Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Anish Moorthy <amoorthy@google.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>,
	Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>,
	jthoughton@google.com, kvm@vger.kernel.org
Subject: Re: [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field
Date: Tue, 21 Mar 2023 12:43:10 -0700	[thread overview]
Message-ID: <ZBoIzo8FGxSyUJ2I@google.com> (raw)
In-Reply-To: <CAF7b7mof8HkcaSthEO8Wu9kf8ZHjE9c1TDzQGAYDYv7FN9+k9w@mail.gmail.com>

On Tue, Mar 21, 2023, Anish Moorthy wrote:
> On Tue, Mar 21, 2023 at 8:21 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Mar 20, 2023, Anish Moorthy wrote:
> > > On Mon, Mar 20, 2023 at 8:53 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Filling kvm_run::memory_fault but not exiting to userspace is ok because userspace
> > > > never sees the data, i.e. userspace is completely unaware.  This behavior is not
> > > > ideal from a KVM perspective as allowing KVM to fill the kvm_run union without
> > > > exiting to userspace can lead to other bugs, e.g. effective corruption of the
> > > > kvm_run union, but at least from a uABI perspective, the behavior is acceptable.
> > >
> > > Actually, I don't think the idea of filling in kvm_run.memory_fault
> > > for -EFAULTs which don't make it to userspace works at all. Consider
> > > the direct_map function, which bubbles its -EFAULT to
> > > kvm_mmu_do_page_fault. kvm_mmu_do_page_fault is called from both
> > > kvm_arch_async_page_ready (which ignores the return value), and by
> > > kvm_mmu_page_fault (where the return value does make it to userspace).
> > > Populating kvm_run.memory_fault anywhere in or under
> > > kvm_mmu_do_page_fault seems an immediate no-go, because a wayward
> > > kvm_arch_async_page_ready could (presumably) overwrite/corrupt an
> > > already-set kvm_run.memory_fault / other kvm_run field.
> >
> > This particular case is a non-issue.  kvm_check_async_pf_completion() is called
> > only when the current task has control of the vCPU, i.e. is the current "running"
> > vCPU.  That's not a coincidence either, invoking kvm_mmu_do_page_fault() without
> > having control of the vCPU would be fraught with races, e.g. the entire KVM MMU
> > context would be unstable.
> >
> > That will hold true for all cases.  Using a vCPU that is not loaded (not the
> > current "running" vCPU in KVM's misleading terminology) to access guest memory is
> > simply not safe, as the vCPU state is non-deterministic.  There are paths where
> > KVM accesses, and even modifies, vCPU state asynchronously, e.g. for IRQ delivery
> > and making requests, but those are very controlled flows with dedicated machinery
> > to make them SMP safe.
> >
> > That said, I agree that there's a risk that KVM could clobber vcpu->run_run by
> > hitting an -EFAULT without the vCPU loaded, but that's a solvable problem, e.g.
> > the helper to fill KVM_EXIT_MEMORY_FAULT could be hardened to yell if called
> > without the target vCPU being loaded:
> >
> >         int kvm_handle_efault(struct kvm_vcpu *vcpu, ...)
> >         {
> >                 preempt_disable();
> >                 if (WARN_ON_ONCE(vcpu != __this_cpu_read(kvm_running_vcpu)))
> >                         goto out;
> >
> >                 vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT;
> >                 ...
> >         out:
> >                 preempt_enable();
> >                 return -EFAULT;
> >         }
> >
> > FWIW, I completely agree that filling KVM_EXIT_MEMORY_FAULT without guaranteeing
> > that KVM "immediately" exits to userspace isn't ideal, but given the amount of
> > historical code that we need to deal with, it seems like the lesser of all evils.
> > Unless I'm misunderstanding the use cases, unnecessarily filling kvm_run is a far
> > better failure mode than KVM not filling kvm_run when it should, i.e. false
> > positives are ok, false negatives are fatal.
> 
> Don't you have this in reverse?

No, I don't think so.

> False negatives will just result in userspace not having useful extra
> information for the -EFAULT it receives from KVM_RUN, in which case userspace
> can do what you mentioned all VMMs do today and just terminate the VM.

And that is _really_ bad behavior if we have any hope of userspace actually being
able to rely on this functionality.  E.g. any false negative when userspace is
trying to do postcopy demand paging will be fatal to the VM.

> Whereas a false positive might cause a double-write to the KVM_RUN struct,
> either putting incorrect information in kvm_run.memory_fault or

Recording unused information on -EFAULT in kvm_run doesn't make the information
incorrect.

> corrupting another member of the union.

Only if KVM accesses guest memory after initiating an exit to userspace, which
would be a KVM irrespective of kvm_run.memory_fault.  We actually have exactly
this type of bug today in the trainwreck that is KVM's MMIO emulation[*], but
KVM gets away with the shoddy behavior by virtue of the scenario simply not
triggered by any real-world code.

And if we're really concerned about clobbering state, we could add hardening/auditing
code to ensure that KVM actually exits when kvm_run.exit_reason is set (though there
are a non-zero number of exceptions, e.g. the aformentioned MMIO mess, nested SVM/VMX
pages, and probably a few others).

Prior to cleanups a few years back[2], emulation failures had issues similar to
what we are discussing, where KVM would fail to exit to userspace, not fill kvm_run,
etc.  Those are the types of bugs I want to avoid here.

[1] https://lkml.kernel.org/r/ZBNrWZQhMX8AHzWM%40google.com
[2] https://lore.kernel.org/kvm/20190823010709.24879-1-sean.j.christopherson@intel.com

> > > That in turn looks problematic for the
> > > memory-fault-exit-on-fast-gup-failure part of this series, because
> > > there are at least a couple of cases for which kvm_mmu_do_page_fault
> > > will -EFAULT. One is the early-efault-on-fast-gup-failure case which
> > > was the original purpose of this series. Another is a -EFAULT from
> > > FNAME(fetch) (passed up through FNAME(page_fault)). There might be
> > > other cases as well. But unless userspace can/should resolve *all*
> > > such -EFAULTs in the same manner, a kvm_run.memory_fault populated in
> > > "kvm_mmu_page_fault" wouldn't be actionable.
> >
> > Killing the VM, which is what all VMMs do today in response to -EFAULT, is an
> > action.  As I've pointed out elsewhere in this thread, userspace needs to be able
> > to identify "faults" that it (userspace) can resolve without a hint from KVM.
> >
> > In other words, KVM is still returning -EFAULT (or a variant thereof), the _only_
> > difference, for all intents and purposes, is that userspace is given a bit more
> > information about the source of the -EFAULT.
> >
> > > At least, not without a whole lot of plumbing code to make it so.
> >
> > Plumbing where?
> 
> In this example, I meant plumbing code to get a kvm_run.memory_fault.flags
> which is more specific than (eg) MEMFAULT_REASON_PAGE_FAULT_FAILURE from the
> -EFAULT paths under kvm_mmu_page_fault. My idea for how userspace would
> distinguish fast-gup failures was that kvm_faultin_pfn would set a special
> bit in kvm_run.memory_fault.flags to indicate its failure. But (still
> assuming that we shouldn't have false-positive kvm_run.memory_fault fills) if
> the memory_fault can only be populated from kvm_mmu_page_fault then either
> failures from FNAME(page_fault) and kvm_faultin_pfn will be indistinguishable
> to userspace, or those functions will need to plumb more specific exit
> reasons all the way up to kvm_mmu_page_fault.

Setting a flag that essentially says "failure when handling a guest page fault"
is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
is not an option, i.e. the ABI would need to be defined as "on page faults from
the guest".  And then the resulting behavior would be non-deterministic, e.g.
userspace would see different behavior if KVM accessed a "bad" gfn via emulation
instead of in response to a guest page fault.  And because of hardware TLBs, it
would even be possible for the behavior to be non-deterministic on the same
platform running the same guest code (though this would be exteremly unliklely
in practice).

And even if userspace is ok with only handling guest page faults_today_, I highly
doubt that will hold forever.  I.e. at some point there will be a use case that
wants to react to uaccess failures on fast-only memslots.

Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
handling a guest page fault" isn't precise enough for userspace to blindly resolve
the failure.  Even if KVM went through the trouble of setting information if and
only if get_user_page_fast_only() failed while handling a guest page fault,
userspace would still need/want a way to verify that the failure was expected and
can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
or mprotecting a page.

> But, since you've made this point elsewhere, my guess is that your answer is
> that it's actually userspace's job to detect the "specific" reason for the
> fault and resolve it.

Yes, it's userspace's responsibity.  I simply don't see how KVM can provide
information that userspace doesn't already have without creating an unmaintainable
uABI, at least not without some deep, deep plumbing into gup().  I.e. unless gup()
were changed to explicitly communicate that it failed because of a uffd equivalent,
at best a flag in kvm_run would be a hint that userspace _might_ be able to resolve
the fault.  And even if we modified gup(), we'd still have all the open questions
about what to do when KVM encounters a fault on a uaccess.

next prev parent reply	other threads:[~2023-03-21 19:45 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-15  2:17 [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 01/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 02/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 03/14] KVM: Allow hva_pfn_fast to resolve read-only faults Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 04/14] KVM: x86: Add KVM_CAP_X86_MEMORY_FAULT_EXIT and associated kvm_run field Anish Moorthy
2023-03-17  0:02   ` Isaku Yamahata
2023-03-17 18:33     ` Anish Moorthy
2023-03-17 19:30       ` Oliver Upton
2023-03-17 21:50       ` Sean Christopherson
2023-03-17 22:44         ` Anish Moorthy
2023-03-20 15:53           ` Sean Christopherson
2023-03-20 18:19             ` Anish Moorthy
2023-03-20 22:11             ` Anish Moorthy
2023-03-21 15:21               ` Sean Christopherson
2023-03-21 18:01                 ` Anish Moorthy
2023-03-21 19:43                   ` Sean Christopherson [this message]
2023-03-22 21:06                     ` Anish Moorthy
2023-03-22 23:17                       ` Sean Christopherson
2023-03-28 22:19                     ` Anish Moorthy
2023-04-04 19:34                       ` Sean Christopherson
2023-04-04 20:40                         ` Anish Moorthy
2023-04-04 22:07                           ` Sean Christopherson
2023-04-05 20:21                             ` Anish Moorthy
2023-03-17 18:35   ` Oliver Upton
2023-03-15  2:17 ` [WIP Patch v2 05/14] KVM: x86: Implement memory fault exit for direct_map Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 06/14] KVM: x86: Implement memory fault exit for kvm_handle_page_fault Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 07/14] KVM: x86: Implement memory fault exit for setup_vmgexit_scratch Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 08/14] KVM: x86: Implement memory fault exit for FNAME(fetch) Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 09/14] KVM: Introduce KVM_CAP_MEMORY_FAULT_NOWAIT without implementation Anish Moorthy
2023-03-17 18:59   ` Oliver Upton
2023-03-17 20:15     ` Anish Moorthy
2023-03-17 20:54       ` Sean Christopherson
2023-03-17 23:42         ` Anish Moorthy
2023-03-20 15:13           ` Sean Christopherson
2023-03-20 19:53             ` Anish Moorthy
2023-03-17 20:17     ` Sean Christopherson
2023-03-20 22:22       ` Oliver Upton
2023-03-21 14:50         ` Sean Christopherson
2023-03-21 20:23           ` Oliver Upton
2023-03-21 21:01             ` Sean Christopherson
2023-03-15  2:17 ` [WIP Patch v2 10/14] KVM: x86: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
2023-03-17  0:32   ` Isaku Yamahata
2023-03-15  2:17 ` [WIP Patch v2 11/14] KVM: arm64: Allow user_mem_abort to return 0 to signal a 'normal' exit Anish Moorthy
2023-03-17 18:18   ` Oliver Upton
2023-03-15  2:17 ` [WIP Patch v2 12/14] KVM: arm64: Implement KVM_CAP_MEMORY_FAULT_NOWAIT Anish Moorthy
2023-03-17 18:27   ` Oliver Upton
2023-03-17 19:00     ` Anish Moorthy
2023-03-17 19:03       ` Oliver Upton
2023-03-17 19:24       ` Sean Christopherson
2023-03-15  2:17 ` [WIP Patch v2 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm Anish Moorthy
2023-03-15  2:17 ` [WIP Patch v2 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2023-03-17 17:43 ` [WIP Patch v2 00/14] Avoiding slow get-user-pages via memory fault exit Oliver Upton
2023-03-17 18:13   ` Sean Christopherson
2023-03-17 18:46     ` David Matlack
2023-03-17 18:54       ` Oliver Upton
2023-03-17 18:59         ` David Matlack
2023-03-17 19:53           ` Anish Moorthy
2023-03-17 22:03             ` Sean Christopherson
2023-03-20 15:56               ` Sean Christopherson
2023-03-17 20:35 ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZBoIzo8FGxSyUJ2I@google.com \
    --to=seanjc@google.com \
    --cc=amoorthy@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).