Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Anish Moorthy <amoorthy@google.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>,
	Huacai Chen <chenhuacai@kernel.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Anup Patel <anup@brainfault.org>,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	kvm-riscv@lists.infradead.org, linux-kernel@vger.kernel.org,
	Chao Peng <chao.p.peng@linux.intel.com>,
	Fuad Tabba <tabba@google.com>,
	Jarkko Sakkinen <jarkko@kernel.org>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	Isaku Yamahata <isaku.yamahata@intel.com>,
	Xu Yilun <yilun.xu@intel.com>, Vlastimil Babka <vbabka@suse.cz>,
	Vishal Annapurve <vannapurve@google.com>,
	Ackerley Tng <ackerleytng@google.com>,
	Maciej Szmigiero <mail@maciej.szmigiero.name>,
	David Hildenbrand <david@redhat.com>,
	Quentin Perret <qperret@google.com>,
	Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>,
	Liam Merwick <liam.merwick@oracle.com>,
	Isaku Yamahata <isaku.yamahata@gmail.com>
Subject: Re: [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
Date: Mon, 2 Oct 2023 18:42:57 -0700	[thread overview]
Message-ID: <ZRtxoaJdVF1C2Mvy@google.com> (raw)
In-Reply-To: <CAF7b7mrf-y9DNdsreOAedGJueOThnYE=ascFd4=rvW0Z4rhTQg@mail.gmail.com>

On Mon, Oct 02, 2023, Anish Moorthy wrote:
> On Fri, Sep 22, 2023 at 9:28 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > So when exit reason is KVM_EXIT_MEMORY_FAULT, how can we tell which field in
> > > the first union is valid?
> >
> > /facepalm
> >
> > At one point, I believe we had discussed a second exit reason field?  But yeah,
> > as is, there's no way for userspace to glean anything useful from the first union.
> 
> Oh, was this an objective? When I was pushing for the second union
> this I was just trying to make sure all the efault annotations
> wouldn't clobber *other* exits. But yeah, I don't/didn't see a
> meaningful way to have valid information in both structs.

Clobbering other exits means KVM is already broken, because simply accessing memory
in guest context after initiating an exit is a KVM bug as it would violate ordering
and maybe causality.   E.g. the only reason the preemption case (see below) isn't
completely buggy is specifically because it's host paravirt behavior.

In other words, ignoring preemption for the moment, not clobbering other exits isn't
useful because whatever buggy KVM behavior caused the clobbering already happened,
i.e. the VM is already in trouble either way.  The only realistic options are to fix
the KVM bugs, or to effectively take an errata and say "don't do that" (like we've
done for the silly PUSHD to MMIO case).

> > The more I think about this, the more I think it's a fool's errand.  Even if KVM
> > provides the exit_reason history, userspace can't act on the previous, unfulfilled
> > exit without *knowing* that it's safe/correct to process the previous exit.  I
> > don't see how that's remotely possible.
> >
> > Practically speaking, there is one known instance of this in KVM, and it's a
> > rather riduclous edge case that has existed "forever".  I'm very strongly inclined
> > to do nothing special, and simply treat clobbering an exit that userspace actually
> > cares about like any other KVM bug.
> >
> > > When exit reason is not KVM_EXIT_MEMORY_FAULT, how can we know the info in
> > > the second union run.memory is valid without a run.memory.valid field?
> >
> > Anish's series adds a flag in kvm_run.flags to track whether or not memory_fault
> > has been filled.  The idea is that KVM would clear the flag early in KVM_RUN, and
> > then set the flag when memory_fault is first filled.
> >
> >         /* KVM_CAP_MEMORY_FAULT_INFO flag for kvm_run.flags */
> >         #define KVM_RUN_MEMORY_FAULT_FILLED (1 << 8)
> >
> > I didn't propose that flag here because clobbering memory_fault from the page
> > fault path would be a flagrant KVM bug.
> >
> > Honestly, I'm becoming more and more skeptical that separating memory_fault is
> > worthwhile, or even desirable.  Similar to memory_fault clobbering something else,
> > userspace can only take action if memory_fault is clobbered if userspace somehow
> > knows that it's safe/correct to do so.
> >
> > Even if KVM exits "immediately" after initially filling memory_fault, the fact
> > that KVM is exiting for a different reason (or a different memory fault) means
> > that KVM did *something* between filling memory_fault and actually exiting.  And
> > it's completely impossible for usersepace to know what that "something" was.
> 
> Are you describing a scenario in which memory_fault is (initially)
> filled, then something else happens to fill memory_fault (thus
> clobbering it), then KVM_RUN exits? I'm confused by the tension
> between the "KVM exits 'immediately'" and "KVM did *something* between
> filling memory_fault and actually existing" statements here.

Yes, I'm describing a hypothetical scenario.  Immediately was in quotes because
even if KVM returns from the *current* function straightaway, it's possible that
control is deep in a call stack, i.e. KVM may "immediately" try to exit from the
current function's perspective, but in reality it may take a while to actually
get out to userspace.

> > > E.g. in the splat from selftests[1], KVM reacts to a failure during Real Mode
> > event injection by synthesizing a triple fault
> >
> >         ret = emulate_int_real(ctxt, irq);
> >
> >         if (ret != X86EMUL_CONTINUE) {
> >                 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> >
> > There are multiple KVM bugs at play: read_emulate() and write_emulate() incorrectly
> > assume *all* failures should be treated like MMIO, and conversely ->read_std() and
> > ->write_std() don't handle *any* failures as MMIO.
> >
> > Circling back to my "capturing the history is pointless" assertion, by the time
> > userspace gets an exit, the vCPU is already in shutdown, and KVM has clobbered
> > memory_fault something like five times.  There is zero chance userspace can do
> > anything but shed a tear for the VM and move on.
> >
> > The whole "let's annotate all memory faults" idea came from my desire to push KVM
> > towards a future where all -EFAULT exits are annotated[2].  I still think we should
> > point KVM in that general direction, i.e. implement something that _can_ provide
> > 100% "coverage" in the future, even though we don't expect to get there anytime soon.
> >
> > I bring that up because neither private memory nor userfault-on-missing needs to
> > annotate anything other than -EFAULT during guest page faults.  I.e. all of this
> > paranoia about clobbering memory_fault isn't actually buying us anything other
> > than noise and complexity.  The cases we need to work _today_ are perfectly fine,
> > and _if_ some future use cases needs all/more paths to be 100% accurate, then the
> > right thing to do is to make whatever changes are necessary for KVM to be 100%
> > accurate.
> >
> > In other words, trying to gracefully handle memory_fault clobbering is pointless.
> > KVM either needs to guarantee there's no clobbering (guest page fault paths) or
> > treat the annotation as best effort and informational-only (everything else at
> > this time).  Future features may grow the set of paths that needs strong guarantees,
> > but that just means fixing more paths and treating any violation of the contract
> > like any other KVM bug.
> 
> Ok, so if we're restricting the exit to just the places it's totally
> accurate (page-fault paths) then, IIUC,
> 
> - There's no reason to attach it to EFAULT, ie it becomes a "normal" exit

No, I still want at least partial line of sight to being able to provide useful
information to userspace on EFAULT.  Making KVM_EXIT_MEMORY_FAULT a "normal" exit
pretty much squashes any hope of that.

> - I should go drop the patches annotating kvm_vcpu_read/write_page
> from my series

Hold up on that.  I'd prefer to keep them as there's still value in giving userspace
debug information.  All I'm proposing is that we would firmly state in the
documentation that those paths must be treated as informational-only.

The whole kvm_steal_time_set_preempted() mess does give me pause though.  That
helper isn't actually problematic, but only because it uses copy_to_user_nofault()
directly :-/

But that doesn't necessarily mean we need to abandon the entire idea, e.g. it
might not be a terrible idea to explicitly differentiate accesses to guest memory
for paravirt stuff, from accesses to guest memory on behalf of the guest.

Anyways, don't do anything just yet.

> - The helper function [a] for filling the memory_fault field
> (downgraded back into the current union) can drop the "has the field
> already been filled?" check/WARN.

That would need to be dropped regardless because it's user-triggered (sadly).

> - [KVM_CAP_USERFAULT_ON_MISSING] The memslot flag check [b] needs to
> be moved back from __gfn_to_pfn_memslot() into
> user_mem_abort()/kvm_handle_error_pfn() since the slot flag-triggered
> fast-gup failures *have* to result in the memory fault exits, and we
> only want to do those in the two SLAT-failure paths (for now).

I'll look at this more closely when I review your series (slowly, slowly getting
there).  There's no right or wrong answer here, it's more a question of what's the
easiest to maintain.

> [a] https://lore.kernel.org/all/20230908222905.1321305-5-amoorthy@google.com/
> [b] https://lore.kernel.org/all/20230908222905.1321305-11-amoorthy@google.com/
> 
> > And if we stop being unnecessarily paranoid, KVM_RUN_MEMORY_FAULT_FILLED can also
> > go away.  The flag came about in part because *unconditionally* sanitizing
> > kvm_run.exit_reason at the start of KVM_RUN would break KVM's ABI, as userspace
> > may rely on the exit_reason being preserved when calling back into KVM to complete
> > userspace I/O (or MMIO)[3].  But the goal is purely to avoid exiting with stale
> > memory_fault information, not to sanitize every other existing exit_reason, and
> > that can be achieved by simply making the reset conditional.
> >
> > ...
> >
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index 96fc609459e3..d78e97b527e5 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4450,6 +4450,16 @@ static long kvm_vcpu_ioctl(struct file *filp,
> >                                 synchronize_rcu();
> >                         put_pid(oldpid);
> >                 }
> > +
> > +               /*
> > +                * Reset the exit reason if the previous userspace exit was due
> > +                * to a memory fault.  Not all -EFAULT exits are annotated, and
> > +                * so leaving exit_reason set to KVM_EXIT_MEMORY_FAULT could
> > +                * result in feeding userspace stale information.
> > +                */
> > +               if (vcpu->run->exit_reason == KVM_EXIT_MEMORY_FAULT)
> > +                       vcpu->run->exit_reason = KVM_EXIT_UNKNOWN
> > +
> >                 r = kvm_arch_vcpu_ioctl_run(vcpu);
> 
> Under my reading of the earlier block I'm not sure why we need to keep
> this around. The original idea behind a canary of this type was to
> avoid stomping on non-memory-fault exits in cases where something
> caused an (ignored) annotated memory fault before the exit could be
> completed. But if the annotations are going to be restricted in
> general to just the page fault paths, then we can just eliminate the
> sentinel check (and just this block) entirely, right?

This isn't a canary, it's to ensure KVM doesn't feed userspace garbage.  As above,
I'm not saying we throw away all of the code for the "optional" paths, I'm saying
that we only commit to 100% accuracy for the paths that the two use cases need to
work, i.e. the page fault handlers.

next prev parent reply	other threads:[~2023-10-03  1:43 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-14  1:54 [RFC PATCH v12 00/33] KVM: guest_memfd() and per-page attributes Sean Christopherson
2023-09-14  1:54 ` [RFC PATCH v12 01/33] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
2023-09-15  6:47   ` Xiaoyao Li
2023-09-15 21:05     ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 02/33] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
2023-09-14  3:07   ` Binbin Wu
2023-09-14 14:19     ` Sean Christopherson
2023-09-20  6:07   ` Xu Yilun
2023-09-20 13:55     ` Sean Christopherson
2023-09-21  2:39       ` Xu Yilun
2023-09-21 14:24         ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 03/33] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 04/33] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 05/33] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
2023-10-09 16:42   ` Anup Patel
2023-09-14  1:55 ` [RFC PATCH v12 06/33] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-09-15  6:59   ` Xiaoyao Li
2023-09-14  1:55 ` [RFC PATCH v12 07/33] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace Sean Christopherson
2023-09-22  6:03   ` Xiaoyao Li
2023-09-22 14:30     ` Sean Christopherson
2023-09-22 16:28     ` Sean Christopherson
2023-09-22 16:35       ` Sean Christopherson
2023-10-02 22:33       ` Anish Moorthy
2023-10-03  1:42         ` Sean Christopherson [this message]
2023-10-03 22:59           ` Anish Moorthy
2023-10-03 23:46             ` Sean Christopherson
2023-10-05 22:07               ` Anish Moorthy
2023-10-05 22:46                 ` Sean Christopherson
2023-10-10 22:21                   ` David Matlack
2023-10-13 18:45                     ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 08/33] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 09/33] KVM: Drop .on_unlock() mmu_notifier hook Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events Sean Christopherson
2023-09-18  1:14   ` Binbin Wu
2023-09-18 15:57     ` Sean Christopherson
2023-09-18 18:07   ` Michael Roth
2023-09-19  0:08     ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 11/33] KVM: Introduce per-page memory attributes Sean Christopherson
2023-09-15  6:32   ` Yan Zhao
2023-09-20 21:00     ` Sean Christopherson
2023-09-21  1:21       ` Yan Zhao
2023-09-25 17:37         ` Sean Christopherson
2023-09-18  7:51   ` Binbin Wu
2023-09-20 21:03     ` Sean Christopherson
2023-09-27  5:19       ` Binbin Wu
2023-10-03 12:47   ` Fuad Tabba
2023-10-03 15:59     ` Sean Christopherson
2023-10-03 18:33       ` Fuad Tabba
2023-10-03 20:51         ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 12/33] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 13/33] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
2023-09-15  6:11   ` Yan Zhao
2023-09-18 16:36   ` Michael Roth
2023-09-20 23:44     ` Sean Christopherson
2023-09-19  9:01   ` Binbin Wu
2023-09-20 14:24     ` Sean Christopherson
2023-09-21  5:58       ` Binbin Wu
2023-09-21 19:10   ` Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 15/33] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 16/33] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 17/33] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 18/33] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
2023-09-15  5:40   ` Yan Zhao
2023-09-15 14:26     ` Sean Christopherson
2023-09-18  0:54       ` Yan Zhao
2023-09-21 14:59         ` Sean Christopherson
2023-09-21  5:51       ` Binbin Wu
2023-09-14  1:55 ` [RFC PATCH v12 19/33] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 20/33] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 21/33] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 22/33] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 23/33] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 24/33] KVM: selftests: Add support for creating private memslots Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 25/33] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 26/33] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 27/33] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 28/33] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 29/33] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 30/33] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 31/33] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 32/33] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
2023-09-14  1:55 ` [RFC PATCH v12 33/33] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZRtxoaJdVF1C2Mvy@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=david@redhat.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jarkko@kernel.org \
    --cc=kvm-riscv@lists.infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=liam.merwick@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=wei.w.wang@intel.com \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).