From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Oliver Upton <oliver.upton@linux.dev>,
Marc Zyngier <maz@kernel.org>, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock
Date: Mon, 9 Sep 2024 15:02:43 -0700 [thread overview]
Message-ID: <Zt9wg6h_bPp8BKtd@google.com> (raw)
In-Reply-To: <CADrL8HW-mOAyF0Gcw7UbkvEvEfcHDxEir0AiStkqYzD5x8ZGpg@mail.gmail.com>
On Mon, Sep 09, 2024, James Houghton wrote:
> On Mon, Sep 9, 2024 at 1:02 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Mon, Sep 09, 2024, James Houghton wrote:
> > > On Fri, Aug 9, 2024 at 12:44 PM Sean Christopherson <seanjc@google.com> wrote:
> > > > + */
> > > > +#define KVM_RMAP_LOCKED BIT(1)
> > > > +
> > > > +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
> > > > +{
> > > > + unsigned long old_val, new_val;
> > > > +
> > > > + old_val = READ_ONCE(rmap_head->val);
> > > > + if (!old_val)
> > > > + return 0;
> > > > +
> > > > + do {
> > > > + /*
> > > > + * If the rmap is locked, wait for it to be unlocked before
> > > > + * trying acquire the lock, e.g. to bounce the cache line.
> > > > + */
> > > > + while (old_val & KVM_RMAP_LOCKED) {
> > > > + old_val = READ_ONCE(rmap_head->val);
> > > > + cpu_relax();
> > > > + }
> > > > +
> > > > + /*
> > > > + * Recheck for an empty rmap, it may have been purged by the
> > > > + * task that held the lock.
> > > > + */
> > > > + if (!old_val)
> > > > + return 0;
> > > > +
> > > > + new_val = old_val | KVM_RMAP_LOCKED;
> > > > + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
> > >
> > > I think we (technically) need an smp_rmb() here. I think cmpxchg
> > > implicitly has that on x86 (and this code is x86-only), but should we
> > > nonetheless document that we need smp_rmb() (if it indeed required)?
> > > Perhaps we could/should condition the smp_rmb() on `if (old_val)`.
> >
> > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here would be
> > smp_mb__after_spinlock(), but I'm pretty sure even that is misleading, and arguably
> > even wrong.
>
> I don't think smp_mb__after_spinlock() is right either. This seems to
> be used following the acquisition of a spinlock to promote the memory
> ordering from an acquire barrier (that is implicit with the lock
> acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a
> stronger-than-usual barrier. But I guess I'm not really sure.
>
> In this case, I'm complaining that we don't have the usual memory
> ordering restrictions that come with a spinlock.
What makes you think that?
> > For the !old_val case, there is a address/data dependency that can't be broken by
> > the CPU without violating the x86 memory model (all future actions with relevant
> > memory loads depend on rmap_head->val being non-zero). And AIUI, in the Linux
> > kernel memory model, READ_ONCE() is responsible for ensuring that the address
> > dependency can't be morphed into a control dependency by the compiler and
> > subsequently reordered by the CPU.
> >
> > I.e. even if this were arm64, ignoring the LOCK CMPXCHG path for the moment, I
> > don't _think_ an smp_{r,w}mb() pair would be needed, as arm64's definition of
> > __READ_ONCE() promotes the operation to an acquire.
> >
> > Back to the LOCK CMPXCHG path, KVM_RMAP_LOCKED implements a rudimentary spinlock,
> > hence my smp_mb__after_spinlock() suggestion. Though _because_ it's a spinlock,
> > the rmaps are fully protected by the critical section.
>
> I feel like a spinlock must include the appropriate barriers for it to
> correctly function as a spinlock, so I'm not sure I fully understand
> what you mean here.
On TSO architectures, the atomic _is_ the barrier. E.g. atomic_try_cmpxchg_acquire()
eventually resolves to atomic_try_cmpxchg() on x86. And jumping back to the
"we don't have the usual memory ordering restrictions that come with a spinlock",
x86's virt_spin_lock() uses atomic_try_cmpxchg(). So while using the acquire
variant here is obviously not wrong, it also feels somewhat weird. Though some
of that is undoubtedly due to explicit "acquire" semantics being rather rare in
x86.
> > And for the SPTEs, there is no required ordering. The reader (aging
> > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepared for
> > either. I.e. there is no requirement that the reader observe a PRESENT
> > SPTE if there is a valid rmap.
>
> This makes sense.
>
> > So, unless I'm missing something, I would prefer to not add a smp_mb__after_spinlock(),
> > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=y), because it suggests
> > an ordering requirement that doesn't exist.
>
> So we have: the general kvm_rmap_lock() and the read-only
> kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll use
> those names (sorry if it's confusing).
>
> For kvm_rmap_lock(), we are always holding mmu_lock for writing. So
> any changes we make to the rmap will be properly published to other
> threads that subsequently grab kvm_rmap_lock() because we had to
> properly release and then re-acquire mmu_lock, which comes with the
> barriers I'm saying we need.
>
> For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no
> smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that
> it is possible that there may observe external changes to the
> pte_list_desc while we are in this critical section (for a
> sufficiently weak architecture). The changes that the kvm_rmap_lock()
> (mmu_lock) side made were half-published with an smp_wmb() (really
> [3]), but the read side didn't use a load-acquire or smp_rmb(), so it
> hasn't held up its end of the deal.
>
> I don't think READ_ONCE() has the guarantees we need to be a
> sufficient replacement for smp_rmb() or a load-acquire that a real
> lock would use, although I agree with you that, on arm64, it
> apparently *is* a sufficient replacement.
>
> Now this isn't a problem if the kvm_rmap_lock_readonly() side can
> tolerate changes to pte_list_desc while in the critical section. I
> don't think this is true (given for_each_rmap_spte_lockless),
> therefore an smp_rmb() or equivalent is (technically) needed.
>
> Am I confused?
Yes, I think so. kvm_rmap_lock_readonly() creates a critical section that prevents
any pte_list_desc changes. rmap_head->val, and every pte_list_desc that is pointed
at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and cannot change.
The SPTE _value_ that is pointed at by rmap_head->val or pte_list_desc.sptes[]
can change, but the pointers themselves cannot. And with aging, the code is
completely tolerant of an instable SPTE _value_ because test-only doesn't care
about false negatives/positives, and test-and-clear is itself an atomic access
i.e. won't corrupt a SPTE (and is also tolerant of false positives/negatives).
>
> (Though all of this works just fine as written on x86.)
>
> [1]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L62
> [2]: https://lore.kernel.org/kvm/20240809194335.1726916-21-seanjc@google.com/
> [3]: https://elixir.bootlin.com/linux/v6.11-rc7/source/kernel/locking/rwbase_rt.c#L190
next prev parent reply other threads:[~2024-09-09 22:02 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-09 19:43 [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
2024-08-09 19:43 ` [PATCH 01/22] KVM: selftests: Check for a potential unhandled exception iff KVM_RUN succeeded Sean Christopherson
2024-08-09 19:43 ` [PATCH 02/22] KVM: selftests: Rename max_guest_memory_test to mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 03/22] KVM: selftests: Only muck with SREGS on x86 in mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 04/22] KVM: selftests: Compute number of extra pages needed " Sean Christopherson
2024-09-06 0:03 ` James Houghton
2024-09-06 4:42 ` Sean Christopherson
2024-09-06 18:25 ` James Houghton
2024-08-09 19:43 ` [PATCH 05/22] KVM: selftests: Enable mmu_stress_test on arm64 Sean Christopherson
2024-08-09 19:43 ` [PATCH 06/22] KVM: selftests: Use vcpu_arch_put_guest() in mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 07/22] KVM: selftests: Precisely limit the number of guest loops " Sean Christopherson
2024-08-09 19:43 ` [PATCH 08/22] KVM: selftests: Add a read-only mprotect() phase to mmu_stress_test Sean Christopherson
2024-08-09 19:43 ` [PATCH 09/22] KVM: selftests: Verify KVM correctly handles mprotect(PROT_READ) Sean Christopherson
[not found] ` <CADrL8HXcD--jn1iLeCJycCd3Btv4_rBPxz6NMnTREXfeh0vRZA@mail.gmail.com>
2024-09-07 0:53 ` Sean Christopherson
2024-09-07 2:26 ` James Houghton
2024-09-09 15:49 ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 10/22] KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() Sean Christopherson
2024-08-28 18:20 ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 11/22] KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() Sean Christopherson
2024-08-09 19:43 ` [PATCH 12/22] KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot Sean Christopherson
2024-08-09 19:43 ` [PATCH 13/22] KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed Sean Christopherson
2024-08-09 19:43 ` [PATCH 14/22] KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper Sean Christopherson
2024-08-12 21:53 ` David Matlack
2024-08-12 21:58 ` David Matlack
2024-08-09 19:43 ` [PATCH 15/22] KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() Sean Christopherson
2024-08-09 19:43 ` [PATCH 16/22] KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals Sean Christopherson
2024-08-09 19:43 ` [PATCH 17/22] KVM: x86/mmu: Refactor low level rmap helpers to prep for walking w/o mmu_lock Sean Christopherson
2024-08-09 19:43 ` [PATCH 18/22] KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent Sean Christopherson
2024-08-09 19:43 ` [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock Sean Christopherson
2024-08-12 8:39 ` Lai Jiangshan
2024-08-12 15:22 ` Sean Christopherson
2024-08-13 6:35 ` Lai Jiangshan
2024-08-13 15:19 ` Sean Christopherson
2024-09-03 20:17 ` James Houghton
2024-09-03 21:27 ` Sean Christopherson
2024-09-03 21:40 ` James Houghton
2024-09-09 19:00 ` James Houghton
2024-09-09 20:02 ` Sean Christopherson
2024-09-09 20:46 ` James Houghton
2024-09-09 22:02 ` Sean Christopherson [this message]
2024-09-10 0:28 ` James Houghton
2024-09-10 1:42 ` Sean Christopherson
2024-09-10 21:11 ` James Houghton
2024-09-11 15:33 ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 20/22] KVM: x86/mmu: Add support for lockless walks of rmap SPTEs Sean Christopherson
2024-08-09 19:43 ` [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns Sean Christopherson
2024-09-03 20:36 ` James Houghton
2024-09-04 15:48 ` Sean Christopherson
2024-08-09 19:43 ` [PATCH 22/22] ***HACK*** KVM: x86: Don't take " Sean Christopherson
2024-09-10 4:56 ` [PATCH 00/22] KVM: x86/mmu: Allow yielding on mmu_notifier zap Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zt9wg6h_bPp8BKtd@google.com \
--to=seanjc@google.com \
--cc=jthoughton@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=oliver.upton@linux.dev \
--cc=pbonzini@redhat.com \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).