All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Yan Zhao <yan.y.zhao@intel.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>,
	kvm@vger.kernel.org, sagis@google.com,  chao.gao@intel.com,
	pbonzini@redhat.com, rick.p.edgecombe@intel.com,
	 linux-kernel@vger.kernel.org
Subject: Re: [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change
Date: Fri, 27 Sep 2024 06:51:58 -0700	[thread overview]
Message-ID: <Zva4aORxE9ljlMNe@google.com> (raw)
In-Reply-To: <ZvUS+Cwg6DyA62EC@yzhao56-desk.sh.intel.com>

On Thu, Sep 26, 2024, Yan Zhao wrote:
> On Thu, Sep 12, 2024 at 05:07:57PM -0700, Sean Christopherson wrote:
> > On Thu, Sep 12, 2024, Isaku Yamahata wrote:
> > > KVM MMU behavior
> > > ================
> > > The leaf SPTE state machine is coded in make_spte().  Consider AD bits and
> > > the present bits for simplicity.  The two functionalities and AD bits
> > > support are related in this context.  unsync (manipulate D bit and W bit,
> > > and handle write protect fault) and access tracking (manipulate A bit and
> > > present bit, and hand handle page fault).  (We don't consider dirty page
> > > tracking for now as it's future work of TDX KVM.)
> > > 
> > > * If AD bit is enabled:
> > > D bit state change for dirty page tracking
> > > On the first EPT violation without prefetch,
> > > - D bits are set.
> > > - Make SPTE writable as TDX supports only RXW (or if write fault).
> > >   (TDX KVM doesn't support write protection at this state. It's future work.)
> > > 
> > > On the second EPT violation.
> > > - clear D bits if write fault.
> > 
> > Heh, I was literally just writing changelogs for patches to address this (I told
> > Sagi I would do it "this week"; that was four weeks ago).
> > 
> > This is a bug in make_spte().  Replacing a W=1,D=1 SPTE with a W=1,D=0 SPTE is
> > nonsensical.  And I'm pretty sure it's an outright but for the TDP MMU (see below).
> > 
> > Right now, the fixes for make_spte() are sitting toward the end of the massive
> > kvm_follow_pfn() rework (80+ patches and counting), but despite the size, I am
> > fairly confident that series can land in 6.13 (lots and lots of small patches).
> > 
> > ---
> > Author:     Sean Christopherson <seanjc@google.com>
> > AuthorDate: Thu Sep 12 16:23:21 2024 -0700
> > Commit:     Sean Christopherson <seanjc@google.com>
> > CommitDate: Thu Sep 12 16:35:06 2024 -0700
> > 
> >     KVM: x86/mmu: Flush TLBs if resolving a TDP MMU fault clears W or D bits
> >     
> >     Do a remote TLB flush if installing a leaf SPTE overwrites an existing
> >     leaf SPTE (with the same target pfn) and clears the Writable bit or the
> >     Dirty bit.  KVM isn't _supposed_ to clear Writable or Dirty bits in such
> >     a scenario, but make_spte() has a flaw where it will fail to set the Dirty
> >     if the existing SPTE is writable.
> >     
> >     E.g. if two vCPUs race to handle faults, the KVM will install a W=1,D=1
> >     SPTE for the first vCPU, and then overwrite it with a W=1,D=0 SPTE for the
> >     second vCPU.  If the first vCPU (or another vCPU) accesses memory using
> >     the W=1,D=1 SPTE, i.e. creates a writable, dirty TLB entry, and that is
> >     the only SPTE that is dirty at the time of the next relevant clearing of
> >     the dirty logs, then clear_dirty_gfn_range() will not modify any SPTEs
> >     because it sees the D=0 SPTE, and thus will complete the clearing of the
> >     dirty logs without performing a TLB flush.
> But it looks that kvm_flush_remote_tlbs_memslot() will always be invoked no
> matter clear_dirty_gfn_range() finds a D bit or not.

Oh, right, I forgot about that.  I'll tweak the changelog to call that out before
posting.  Hmm, and I'll drop the Cc: stable@ too, as commit b64d740ea7dd ("kvm:
x86: mmu: Always flush TLBs when enabling dirty logging") was a bug fix, i.e. if
anything should be backported it's that commit.

> kvm_mmu_slot_apply_flags
>   |kvm_mmu_slot_leaf_clear_dirty
>   |  |kvm_tdp_mmu_clear_dirty_slot
>   |    |clear_dirty_gfn_range
>   |kvm_flush_remote_tlbs_memslot
>      
> >     Opportunistically harden the TDP MMU against clearing the Writable bit as
> >     well, both to prevent similar bugs for write-protection, but also so that
> >     the logic can eventually be deduplicated into spte.c (mmu_spte_update() in
> >     the shadow MMU has similar logic).
> >     
> >     Fix the bug in the TDP MMU's page fault handler even though make_spte() is
> >     clearly doing odd things, e.g. it marks the page dirty in its slot but
> >     doesn't set the Dirty bit.  Precisely because replacing a leaf SPTE with
> >     another leaf SPTE is so rare, the cost of hardening KVM against such bugs
> >     is negligible.  The make_spte() will be addressed in a future commit.
> >     
> >     Fixes: bb18842e2111 ("kvm: x86/mmu: Add TDP MMU PF handler")
> >     Cc: David Matlack <dmatlack@google.com>
> >     Cc: stable@vger.kernel.org
> >     Signed-off-by: Sean Christopherson <seanjc@google.com>
> > 
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 3b996c1fdaab..7c6d1c610f0e 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1038,7 +1038,9 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >         else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
> >                 return RET_PF_RETRY;
> >         else if (is_shadow_present_pte(iter->old_spte) &&
> > -                !is_last_spte(iter->old_spte, iter->level))
> > +                (!is_last_spte(iter->old_spte, iter->level) ||
> > +                 (is_mmu_writable_spte(old_spte) && !is_writable_pte(new_spte)) ||
> > +                 (is_dirty_spte(old_spte) && !is_dirty_spte(new_spte))))
> Should we also check is_accessed_spte() as what's done in mmu_spte_update()?

Heh, it's impossible to see in this standalone "posting", but earlier in the
massive series is a patch that removes that code.  Aptly titled "KVM: x86/mmu:
Don't force flush if SPTE update clears Accessed bit".

> It is possible for make_spte() to make the second spte !is_dirty_spte(), e.g.
> the second one is caused by a KVM_PRE_FAULT_MEMORY ioctl.

Ya, the mega-series also has a fix for that: "KVM: x86/mmu: Always set SPTE's
dirty bit if it's created as writable".

> >                 kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> >  
> >         /*
> > ---
> > 
> > >  arch/x86/kvm/mmu/spte.h    |  6 ++++++
> > >  arch/x86/kvm/mmu/tdp_mmu.c | 28 +++++++++++++++++++++++++---
> > >  2 files changed, 31 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > > index a72f0e3bde17..1726f8ec5a50 100644
> > > --- a/arch/x86/kvm/mmu/spte.h
> > > +++ b/arch/x86/kvm/mmu/spte.h
> > > @@ -214,6 +214,12 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> > >   */
> > >  #define FROZEN_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> > >  
> > > +#define EXTERNAL_SPTE_IGNORE_CHANGE_MASK		\
> > > +	(shadow_acc_track_mask |			\
> > > +	 (SHADOW_ACC_TRACK_SAVED_BITS_MASK <<		\
> > > +	  SHADOW_ACC_TRACK_SAVED_BITS_SHIFT) |		\
> > 
> > Just make TDX require A/D bits, there's no reason to care about access tracking.
> If KVM_PRE_FAULT_MEMORY is allowed for TDX and if
> cpu_has_vmx_ept_ad_bits() is false in TDX's hardware (not sure if it's possible),

Make it a requirement in KVM that TDX hardware supports A/D bits and that KVM's
module param is enabled.  EPT A/D bits have been supported in all CPUs since
Haswell, I don't expect them to ever go away.

> access tracking bit is possbile to be changed, as in below scenario:
> 
> 1. KVM_PRE_FAULT_MEMORY ioctl calls kvm_arch_vcpu_pre_fault_memory() to map
>    a GFN, and make_spte() will call mark_spte_for_access_track() to
>    remove shadow_acc_track_mask (i.e. RWX bits) and move R+X left to
>    SHADOW_ACC_TRACK_SAVED_BITS_SHIFT.
> 2. If a concurrent page fault occurs on the same GFN on another vCPU, then
>    make_spte() in that vCPU will not see prefetch and the new_spte is
>    with RWX bits and with no bits set in SHADOW_ACC_TRACK_SAVED_MASK.

This should be fixed by the mega-series.  I'll make sure to Cc you on that series.

Thanks much for the input and feedback!

  reply	other threads:[~2024-09-27 13:52 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-12 23:36 [PATCH] KVM: x86/tdp_mmu: Trigger the callback only when an interesting change Isaku Yamahata
2024-09-13  0:07 ` Sean Christopherson
2024-09-13  1:22   ` Isaku Yamahata
2024-09-26  7:53   ` Yan Zhao
2024-09-27 13:51     ` Sean Christopherson [this message]
2024-09-27 14:32       ` Sean Christopherson
2024-09-27 15:06         ` Sean Christopherson
2024-10-08  2:57           ` Yan Zhao
2024-10-08 19:15           ` Sean Christopherson
2024-10-09  9:21             ` Yan Zhao
2024-10-09 13:14               ` Sean Christopherson
2024-09-29  9:27         ` Yan Zhao
2024-09-30 15:54           ` Sean Christopherson
2024-10-08  0:44             ` Yan Zhao
2024-10-08 16:07               ` Sean Christopherson
2024-09-29  9:29       ` Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zva4aORxE9ljlMNe@google.com \
    --to=seanjc@google.com \
    --cc=chao.gao@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=sagis@google.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.