Re: [PATCH] KVM: move memslot invalidation later than possible failures

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com
Subject: Re: [PATCH] KVM: move memslot invalidation later than possible failures
Date: Thu, 10 Nov 2022 00:33:25 +0000	[thread overview]
Message-ID: <Y2xG1SY/kNULHFck@google.com> (raw)
In-Reply-To: <Y2tNGHF5Lbjk4DQV@yzhao56-desk.sh.intel.com>

On Wed, Nov 09, 2022, Yan Zhao wrote:
> On Tue, Nov 08, 2022 at 05:32:50PM +0000, Sean Christopherson wrote:
> > On Tue, Nov 08, 2022, Yan Zhao wrote:
> > > For memslot delete and move, kvm_invalidate_memslot() is required before
> > > the real changes committed.
> > > Besides swapping to an inactive slot, kvm_invalidate_memslot() will call
> > > kvm_arch_flush_shadow_memslot() and further kvm_page_track_flush_slot() in
> > > arch x86.
> > > And according to the definition in kvm_page_track_notifier_node, users can
> > > drop write-protection for the pages in the memory slot on receiving
> > > .track_flush_slot.
> > 
> > Ugh, that's a terrible API.  The entire page track API is a mess, e.g. KVMGT is
> > forced to grab its own references to KVM and also needs to manually acquire/release
> > mmu_lock in some flows but not others.
> > 
> > Anyways, this is a flaw in the page track API that should be fixed.  Flushing a
> > slot should not be overloaded to imply "this slot is gone", it should be a flush
> > command, no more, no less.
> hmm, but KVM also registers to the page track
> "node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;"
> And in kvm_mmu_invalidate_zap_pages_in_memslot, memslot (actaully all the shadow
> page tables) is zapped. And during the zap, unaccount_shadowed() will drop the
> write protection. But KVM is ok because the dropped write-protection can be
> rebuilt during rebuilding the shadow page table.
> So, for .track_flush_slot, it's expected that "users can drop write-protection
> for the pages in the memory slot", right?

No.  KVM isn't actually dropping write-protection, because for the internal KVM
case, KVM obliterates all of its page tables.

> > AFAICT, KVMGT never flushes anything, so fixing the bug should be a simple matter
> > of adding another hook that's invoked when the memory region change is committed.
> >
> Do you mean adding a new hook in page track, e.g. .track_slot_change?
> Then right before committing slot changes, call this interface to notify slot
> DELETE/MOVE?
> 
> Not only KVMGT, KVM can also be affected by this failure to MOVE if it wants to
> support below usecase:
> 1. KVM pre-populated a memslot
> 2. memslot MOVE happened
> 3. KVM pre-populates the new memslot (MOVE succeeds) or the old one (MOVE fails).
> So also add a new interface for the MOVE failure?
> 
> > That would allow KVMGT to fix another bug.  If a memory region is moved and the
> > new region partially overlaps the old region, KVMGT technically probably wants to
> > retain its write-protection scheme.  Though that's probably not worth supporting,
> > might be better to say "don't move memory regions if KVMGT is enabled", because
> > AFAIK there is no VMM that actually moves memory regions (KVM's MOVE support was
> > broken for years and no one noticed).
> >
> So, could we disable MOVE in KVM at all?

Ideally, yes.  Practically?  Dunno.  It's difficult to know whether or not there
are users out there.

> > Actually, given that MOVE + KVMGT probably doesn't work regardless of where the
> > region is moved to, maybe we can get away with rejecting MOVE if an external
> > page tracker cares about the slot in question.
> > 
> > > However, if kvm_prepare_memory_region() fails, the later
> > > kvm_activate_memslot() will only swap back the original slot, leaving
> > > previous write protection not recovered.
> > > 
> > > This may not be a problem for kvm itself as a page tracker user, but may
> > > cause problem to other page tracker users, e.g. kvmgt, whose
> > > write-protected pages are removed from the write-protected list and not
> > > added back.
> > > 
> > > So call kvm_prepare_memory_region first for meta data preparation before
> > > the slot invalidation so as to avoid failure and recovery.
> > 
> > IIUC, this is purely a theoretical bug and practically speaking can't be a problem
> > in practice, at least not without completely breaking KVMGT.
> > 
> > On DELETE, kvm_prepare_memory_region() will never fail on x86 (s390's ultravisor
> > protected VM case is the only scenario where DELETE can fail on any architecture).
> > The common code in kvm_prepare_memory_region() does nothing for DELETE, ditto for
> > kvm_arch_prepare_memory_region().
> But as long as with current code sequence, we can't relying on that
> kvm_prepare_memory_region() will never fail for DELETE.
> Or, we need to call kvm_prepare_memory_region() only for !DELETE to avoid future
> possible broken.

Agreed, I just don't want to touch the common memslots code unless it's necessary.

> > And for MOVE, it can only fail in two scenarios: (1) the end gfn is beyond the
> > max gfn, i.e. userspace gave bad input or (2) the new memslot is unaligned but
> > the old memslot was not, and so KVM needs to allocate new metadata due to the new
> > memslot needed larger arrays.
> kvm_prepare_memory_region() can also fail for MOVE during live migration when
> memslot->dirty_bitmap allocation is failed in kvm_alloc_dirty_bitmap().
> and in x86, kvm_arch_prepare_memory_region() can also fail for MOVE if
> kvm_alloc_memslot_metadata() fails due to -ENOMEM. 
> BTW, I don't find the "(2) the new memslot is unaligned but the old memslot was not,
> and so KVM needs to allocate new metadata due to the new memslot needed
> larger arrays". 
> > 
> > As above MOVE is not handled correctly by KVMGT, so I highly doubt there is a VMM
> > that supports KVMGT and moves relevant memory regions, let alove does something
> > that can result in MOVE failing _and_ moves the memslot that KVMGT is shadowing.
> > 
> > Heh, and MOVE + KVM_MEM_LOG_DIRTY_PAGES is also arguably broken, as KVM reuses
> > the existing dirty bitmap but doesn't shift the dirty bits.  This is likely
> Do you mean, for the new slot in MOVE, the new dirty bitmap should be
> cleared? Otherwise, why shift is required, given mem->userspace_addr and npages
> are not changed and dirty_bitmap is indexed using rel_gfn 
> (rel_gfn = gfn - memslot->base_gfn) and both QEMU/KVM aligns the bitmap
> size to BITS_PER_LONG.

Oh, you're right.  I forgot that userspace would also be doing a gfn-relative
calculation in the ridiculously theoretically scenario that a memslot is moved
while it's being dirty-logged.

> > another combination that KVM can simply reject.
> > 
> > Assuming this is indeed purely theoretical, that should be called out in the
> > changelog.  Or as above, simply disallow MOVE in this case, which probably has
> > my vote.
> >
> Yes, currently it's a purely theoretical bug, as I'm not seeing MOVE is triggered
> in upstream QEMU.
> 
> <...>
> 
> > I'm not 100% sure that simply moving kvm_invalidate_memslot() is functionally
> > correct.  It might be, but there are so many subtleties in this code that I don't
> > want to change this ordering unless absolutely necessary, or at least not without
> > an in-depth analysis and a pile of tests.  E.g. it's possible one or more
> > architectures relies on unmapping, flushing, and invalidating the old region
> > prior to preparing the new region, e.g. to reuse arch memslot data.
> yes. what about just moving kvm_arch_flush_shadow_memslot() and
> kvm_arch_guest_memory_reclaimed() to later than kvm_arch_prepare_memory_region()?

I'm not dead set against tweaking the memslots flow, but I don't want to do so to
fix this extremely theoretical bug.  In other words, I want to fix this by giving
KVM-GT a more appropriate hook, not by shuffling common KVM code to fudge around
a poor KVM x86 API.

And if KVM-GT moves away from track_flush_slot(), we can delete the hook entirely
after cleaning up clean up another pile of ugly: KVM always registers a page-track
notifier because it relies on the track_flush_slot() call to invoke
kvm_mmu_invalidate_zap_pages_in_memslot(), even when KVM isn't tracking anything.
I'll send patches for this; if/when both land, track_flush_slot() can be deleted
on top.

next prev parent reply	other threads:[~2022-11-10  0:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-08  8:44 [PATCH] KVM: move memslot invalidation later than possible failures Yan Zhao
2022-11-08 17:32 ` Sean Christopherson
2022-11-09  6:47   ` Yan Zhao
2022-11-10  0:33     ` Sean Christopherson [this message]
2022-11-10  0:50       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2xG1SY/kNULHFck@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.