All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Vipin Sharma <vipinsh@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list
Date: Fri, 25 Apr 2025 10:45:04 -0700	[thread overview]
Message-ID: <aAvKIPaOgdtOpXlh@google.com> (raw)
In-Reply-To: <aAbhvKa8g973-lV6@google.com>

On Mon, Apr 21, 2025, Sean Christopherson wrote:
> On Tue, Apr 15, 2025, Sean Christopherson wrote:
> > On Tue, Apr 15, 2025, Vipin Sharma wrote:
> > > On 2025-04-01 08:57:14, Sean Christopherson wrote:
> > > > +static __ro_after_init HLIST_HEAD(empty_page_hash);
> > > > +
> > > > +static struct hlist_head *kvm_get_mmu_page_hash(struct kvm *kvm, gfn_t gfn)
> > > > +{
> > > > +	struct hlist_head *page_hash = READ_ONCE(kvm->arch.mmu_page_hash);
> > > > +
> > > > +	if (!page_hash)
> > > > +		return &empty_page_hash;
> > > > +
> > > > +	return &page_hash[kvm_page_table_hashfn(gfn)];
> > > > +}
> > > > +
> > > >  
> > > > @@ -2357,6 +2368,7 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> > > >  	struct kvm_mmu_page *sp;
> > > >  	bool created = false;
> > > >  
> > > > +	BUG_ON(!kvm->arch.mmu_page_hash);
> > > >  	sp_list = &kvm->arch.mmu_page_hash[kvm_page_table_hashfn(gfn)];
> > > 
> > > Why do we need READ_ONCE() at kvm_get_mmu_page_hash() but not here?
> > 
> > We don't (need it in kvm_get_mmu_page_hash()).  I suspect past me was thinking
> > it could be accessed without holding mmu_lock, but that's simply not true.  Unless
> > I'm forgetting, something, I'll drop the READ_ONCE() and WRITE_ONCE() in
> > kvm_mmu_alloc_page_hash(), and instead assert that mmu_lock is held for write.
> 
> I remembered what I was trying to do.  The _writer_, kvm_mmu_alloc_page_hash(),
> doesn't hold mmu_lock, and so the READ/WRITE_ONCE() is needed.
> 
> But looking at this again, there's really no point in such games.  All readers
> hold mmu_lock for write, so kvm_mmu_alloc_page_hash() can take mmu_lock for read
> to ensure correctness.  That's far easier to reason about than taking a dependency
> on shadow_root_allocated.
> 
> For performance, taking mmu_lock for read is unlikely to generate contention, as
> this is only reachable at runtime if the TDP MMU is enabled.  And mmu_lock is
> going to be taken for write anyways (to allocate the shadow root).

Wrong again.  After way, way too many failed attempts (I tried some truly stupid
ideas) and staring, I finally remembered why it's a-ok to set arch.mmu_page_hash
outside of mmu_lock, and why it's a-ok for __kvm_mmu_get_shadow_page() to not use
READ_ONCE().  I guess that's my penance for not writing a decent changelog or
comments.

Setting the list outside of mmu_lock is safe, as concurrent readers must hold
mmu_lock in some capacity, shadow pages can only be added (or removed) from the
list when mmu_lock is held for write, and tasks that are creating a shadow root
are serialized by slots_arch_lock.  I.e. it's impossible for the list to become
non-empty until all readers go away, and so readers are guaranteed to see an empty
list even if they make multiple calls to kvm_get_mmu_page_hash() in a single
mmu_lock critical section.

__kvm_mmu_get_shadow_page() doesn't need READ_ONCE() because it's only reachable
after the task has gone through mmu_first_shadow_root_alloc(), i.e. access to
mmu_page_hash in that context is fully serialized by slots_arch_lock.

> > > My understanding is that it is in kvm_get_mmu_page_hash() to avoid compiler
> > > doing any read tear. If yes, then the same condition is valid here, isn't it?
> > 
> > The intent wasn't to guard against a tear, but to instead ensure mmu_page_hash
> > couldn't be re-read and end up with a NULL pointer deref, e.g. if KVM set
> > mmu_page_hash and then nullfied it because some later step failed.  But if
> > mmu_lock is held for write, that is simply impossible.

So yes, you were 100% correct, the only reason for WRITE_ONCE/READ_ONCE is to
ensure the compiler doesn't do something stupid and tear the accesses.

      reply	other threads:[~2025-04-25 17:45 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-01 15:57 [PATCH v2 0/3] KVM: x86: Dynamically allocate hashed page list Sean Christopherson
2025-04-01 15:57 ` [PATCH v2 1/3] KVM: x86/mmu: Dynamically allocate shadow MMU's " Sean Christopherson
2025-04-16 15:53   ` Vipin Sharma
2025-04-01 15:57 ` [PATCH v2 2/3] KVM: x86: Allocate kvm_vmx/kvm_svm structures using kzalloc() Sean Christopherson
2025-04-16 18:24   ` Vipin Sharma
2025-04-16 19:06     ` Vipin Sharma
2025-04-16 19:57       ` Sean Christopherson
2025-04-22 22:53         ` Huang, Kai
2025-04-23 17:07           ` Sean Christopherson
2025-04-23 21:46             ` Huang, Kai
2025-04-24 18:31               ` Sean Christopherson
2025-04-01 15:57 ` [PATCH v2 3/3] KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list Sean Christopherson
2025-04-15 20:06   ` Vipin Sharma
2025-04-15 21:52     ` Sean Christopherson
2025-04-22  0:24       ` Sean Christopherson
2025-04-25 17:45         ` Sean Christopherson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAvKIPaOgdtOpXlh@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=vipinsh@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.