From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 07B5737165 for ; Tue, 10 Sep 2024 01:42:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725932578; cv=none; b=nTvpcXt/och+3AREk30jvJO+UQwqV1WW0h5pC3fLmsERJBPoRU2WJc8CemA0heKT+1Q4pncI64DitchrOlUJZjRz2nROOAvNq3A1/ujafpHLeER44WIKnOdn69NY9ITvRDSLkAotcRZ2HPQosE8vQL6xQncnMCXr5m7opb4YTcQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725932578; c=relaxed/simple; bh=+/PCdjNe2dfM+dCZ+mpa07o3EmJ3SWGbRgQEt41wOVk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=gLM5t0ryRo48Wy/M//Bl7DeosX2aYvt/bfzb+9cY7/elcuvFMXzo4dF+1VERj231loThwtXltHeftVwyxHBrr3n9aizCSFLG3e+p8k5pnUNVCAIA0g1oAS7Z6MKiJ44I2IdLpQj1C8v+wk6si8IU7qhxauRWMFL0plGsIvrjq1I= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SJRu183N; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SJRu183N" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e02a4de4f4eso11424823276.1 for ; Mon, 09 Sep 2024 18:42:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725932576; x=1726537376; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=s3ZJS4NCqs2daZcr/ed8Z/zqWEoFadweNh0PNCQlatA=; b=SJRu183NRVz7rbPeE6UQx7xwvsV4MNRoOXgiG6Jc3TqrgVx7zHEx8Ckx8l8aXQwI3s Wx1aHOupatPA2/RiVGDH65L8p4jY/uqXK6FDfoqaEntsj46NYW1OC5UV2kD/X+M8ncCh BcAC+GIxWvx3MLSWJm54P6In/n9YXwgz4KbuJ3V1FO/GhJcwQ1/UHhapQM58K8yloqaM Iag7A7AX807oku7o7MX6WcrYCnLWkrdmCBw+tQ8jPoKqVIvfsJshXf1vEbXgABuEWSB8 VkGSFtV+emOSSLGe0E5pnMw5FdnsqcNdqjhaZlf5zrajc2gPtrRvFxXHc8HchkOT4O7C bCcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725932576; x=1726537376; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=s3ZJS4NCqs2daZcr/ed8Z/zqWEoFadweNh0PNCQlatA=; b=tzMG6Rj7xWPLoeaI/8pmVCmiz6dLgVq3LAOdlZwQWdp8FXk/mT8nfVLogl/D2h26Vy XOt1dj9A5LGb6SEKAkrJJMHifWwMDw+g2/p4yt0GHlm0L7TlCEnND+UOgx1qvhiBoYrX 9AocFLcASsbpxpB/QzSpcuSGEBeVcGE0tJS3VDCHL5zfFYT4tRVoY6c1xeorHOAFBnkU 8Ukx4QVcR1yR/+00ULv0d9APjgbgmK6da1H7VtSRGWzfXCcaorTdh2kZhCfTK8nr5V9f SNP7Lq8PESA12l8r8ulqM1YAV4QAvMJuRJ+cfaiHk6N8tt/j6ovjZs7OV+q2dTC9cUzs APmQ== X-Forwarded-Encrypted: i=1; AJvYcCULkLFtpP0KQndmiAmJELHFcIzZlh7SqjN9yvr04kAcrozzTwPhW6yzqib69Y9tdlejwgw=@vger.kernel.org X-Gm-Message-State: AOJu0YwQ5SZgWnfNbwTKvXo3H/wgsgd6j6hCmTOEb0k50Rza3HxPCxkw s7rgHjKwhOJWfzvldwQXMcVGMQTjTtiFxIcvtBZ1v4lSKAoB+zp2k53ntKzG32KpIq/ZoXO+c1y vSw== X-Google-Smtp-Source: AGHT+IHR0cISWbNSGYzTHxwwKccQKA3RLZdDOFknSYtGDe+7IRQ53sx+JM4pizATZulD7JxEZowAqbSaKqY= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:5f08:0:b0:e0b:f1fd:1375 with SMTP id 3f1490d57ef6-e1d34a1fb19mr19944276.10.1725932575899; Mon, 09 Sep 2024 18:42:55 -0700 (PDT) Date: Mon, 9 Sep 2024 18:42:54 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240809194335.1726916-1-seanjc@google.com> <20240809194335.1726916-20-seanjc@google.com> Message-ID: Subject: Re: [PATCH 19/22] KVM: x86/mmu: Add infrastructure to allow walking rmaps outside of mmu_lock From: Sean Christopherson To: James Houghton Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Oliver Upton , Marc Zyngier , Peter Xu Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Mon, Sep 09, 2024, James Houghton wrote: > On Mon, Sep 9, 2024 at 3:02=E2=80=AFPM Sean Christopherson wrote: > > > > On Mon, Sep 09, 2024, James Houghton wrote: > > > On Mon, Sep 9, 2024 at 1:02=E2=80=AFPM Sean Christopherson wrote: > > > > Hmm, no, not smp_rmb(). If anything, the appropriate barrier here = would be > > > > smp_mb__after_spinlock(), but I'm pretty sure even that is misleadi= ng, and arguably > > > > even wrong. > > > > > > I don't think smp_mb__after_spinlock() is right either. This seems to > > > be used following the acquisition of a spinlock to promote the memory > > > ordering from an acquire barrier (that is implicit with the lock > > > acquisition, e.g. [1]) to a full barrier. IIUC, we have no need for a > > > stronger-than-usual barrier. But I guess I'm not really sure. > > > > > > In this case, I'm complaining that we don't have the usual memory > > > ordering restrictions that come with a spinlock. > > > > What makes you think that? >=20 > Ok I was under the impression that try_cmpxchg() did not carry memory > ordering guarantees (i.e., I thought it was try_cmpxchg_relaxed()). > Sorry.... >=20 > So the way I would write this is with try_cmpxchg_relaxed() in the > loop and then smp_rmb() after we break out of the loop, at least for > the `old_val !=3D 0` case. Kinda like this [4]. >=20 > Did you really want try_cmpxchg(), not try_cmpxchg_relaxed()? Heh, why would I care? On x86, they are the same thing. Yeah, I could tea= se out some subtle difference to super precisely describe KVM's behavior, but that= more or less defeats the value proposition of total-store ordered architectures:= take on more complexity in hardware (and possibly loss in performance) to allow = for simpler software. > Just comparing against bit_spin_lock, test_and_set_bit_lock() > documents the requirement for acquire barrier semantics[5]. Again, that's generic library code. > [4]: https://elixir.bootlin.com/linux/v6.10.9/source/kernel/locking/rtmut= ex.c#L253 > [5]: https://elixir.bootlin.com/linux/v6.10.9/source/include/asm-generic/= bitops/instrumented-lock.h#L51 ... > > And jumping back to the "we don't have the usual memory ordering > > restrictions that come with a spinlock", x86's virt_spin_lock() uses > > atomic_try_cmpxchg(). So while using the acquire variant here is obvio= usly > > not wrong, it also feels somewhat weird. >=20 > Yeah that's fine. atomic_try_cmpxchg() is at least as strong as > atomic_try_cmpxchg_acquire(), so there is no issue. >=20 > But if virt_spin_lock() were written to use atomic_try_cmpxchg_relaxed() = (and > nothing else) instead, then you'd complain right? I'd complain because someone made me think hard for no reason. :-) > It would work on x86 (I think?), but it's not written properly! That's > basically what I'm saying in this thread. >=20 > > Though some of that is undoubtedly due to explicit "acquire" semantics > > being rather rare in x86. > > > > > > And for the SPTEs, there is no required ordering. The reader (agin= g > > > > thread) can observe a !PRESENT or a PRESENT SPTE, and must be prepa= red for > > > > either. I.e. there is no requirement that the reader observe a PRE= SENT > > > > SPTE if there is a valid rmap. > > > > > > This makes sense. > > > > > > > So, unless I'm missing something, I would prefer to not add a smp_m= b__after_spinlock(), > > > > even though it's a nop on x86 (unless KCSAN_WEAK_MEMORY=3Dy), becau= se it suggests > > > > an ordering requirement that doesn't exist. > > > > > > So we have: the general kvm_rmap_lock() and the read-only > > > kvm_rmap_lock_readonly(), as introduced by the next patch[2]. I'll us= e > > > those names (sorry if it's confusing). > > > > > > For kvm_rmap_lock(), we are always holding mmu_lock for writing. So > > > any changes we make to the rmap will be properly published to other > > > threads that subsequently grab kvm_rmap_lock() because we had to > > > properly release and then re-acquire mmu_lock, which comes with the > > > barriers I'm saying we need. > > > > > > For kvm_rmap_lock_readonly(), we don't hold mmu_lock, so there is no > > > smp_rmb() or equivalent. Without an smp_rmb() somewhere, I claim that > > > it is possible that there may observe external changes to the > > > pte_list_desc while we are in this critical section (for a > > > sufficiently weak architecture). The changes that the kvm_rmap_lock() > > > (mmu_lock) side made were half-published with an smp_wmb() (really > > > [3]), but the read side didn't use a load-acquire or smp_rmb(), so it > > > hasn't held up its end of the deal. > > > > > > I don't think READ_ONCE() has the guarantees we need to be a > > > sufficient replacement for smp_rmb() or a load-acquire that a real > > > lock would use, although I agree with you that, on arm64, it > > > apparently *is* a sufficient replacement. > > > > > > Now this isn't a problem if the kvm_rmap_lock_readonly() side can > > > tolerate changes to pte_list_desc while in the critical section. I > > > don't think this is true (given for_each_rmap_spte_lockless), > > > therefore an smp_rmb() or equivalent is (technically) needed. > > > > > > Am I confused? > > > > Yes, I think so. kvm_rmap_lock_readonly() creates a critical section t= hat prevents > > any pte_list_desc changes. rmap_head->val, and every pte_list_desc tha= t is pointed > > at by rmap_head->val in the KVM_RMAP_MULTI case, is protected and canno= t change. >=20 > I take back what I said about this working on x86. I think it's > possible for there to be a race. >=20 > Say... >=20 > 1. T1 modifies pte_list_desc then unlocks kvm_rmap_unlock(). > 2. T2 then locks kvm_rmap_lock_readonly(). >=20 > The modifications that T1 has made are not guaranteed to be visible to > T2 unless T1 has an smp_wmb() (or equivalent) after the modfication > and T2 has an smp_rmb() before reading the data. >=20 > Now the way you had it, T2, because it uses try_cmpxchg() with full > ordering, will effectively do a smp_rmb(). But T1 only does an > smp_wmb() *after dropping the mmu_lock*, so there is a race. While T1 > still holds the mmu_lock but after releasing the kvm_rmap_lock(), T2 > may enter its critical section and then *later* observe the changes > that T1 made. >=20 > Now this is impossible on x86 (IIUC) if, in the compiled list of > instructions, T1's writes occur in the same order that we have written > them in C. I'm not sure if WRITE_ONCE guarantees that this reordering > at compile time is forbidden. >=20 > So what I'm saying is: >=20 > 1. kvm_rmap_unlock() must have an smp_wmb(). No, because beating a dead horse, this is not generic code, this is x86.=20 > 2. If you change kvm_rmap_lock() to use try_cmpxchg_relaxed() (which > is what I think you want),=20 No, I don't. If this were generic code, then it would be a different conve= rsation, but using a "relaxed" atomic in x86 specific code is nonsensical, because s= uch an operation simply does not exist in the world of x86. There are memory oper= ations that have relaxed ordering, but none of them are atomic; they are very much= the exact opposite of atomic. E.g. the only references to _relaxed that you'll find in x86 are to overrid= e the generics to x86's non-relaxed semantics. $ git grep _relaxed arch/x86 arch/x86/include/asm/io.h:#define readb_relaxed(a) __readb(a) arch/x86/include/asm/io.h:#define readw_relaxed(a) __readw(a) arch/x86/include/asm/io.h:#define readl_relaxed(a) __readl(a) arch/x86/include/asm/io.h:#define writeb_relaxed(v, a) __writeb(v, a) arch/x86/include/asm/io.h:#define writew_relaxed(v, a) __writew(v, a) arch/x86/include/asm/io.h:#define writel_relaxed(v, a) __writel(v, a) arch/x86/include/asm/io.h:#define readq_relaxed(a) __readq(a) arch/x86/include/asm/io.h:#define writeq_relaxed(v, a) __writeq(v, a) There are a few places in KVM x86 where _acquire() and _release() variants = are used, but that's done purely to document the roles and who is doing what, a= nd almost always (maybe always?) when there are lockless programming shenaniga= ns in play. E.g. kvm_recalculate_apic_map() is probably the closest example, where KVM = uses atomic_read_acquire(), atomic_cmpxchg_acquire(), and atomic_cmpxchg_release= () to document the ordering between marking the map as dirty and actually process= ing the map. But that is still quite different, as apic_map_dirty is not a spi= nlock, there isn't a direct data/address dependency between apic_map_dirty and all= of the data it consumes, _and_ the data that is guarded by apic_map_dirty can = be modified by other vCPUs _while it is being processed_. Hmm, actually, sev_lock_two_vms() is arguably a better comparison. That's = also a rudimentary spinlock (try-lock only behavior). If kvm_rmap_head.val were an int, i.e. could be unionized with an atomic_t,= then I wouldn't be opposed to doing this in the locking code to document things: s/READ_ONCE/atomic_read_acquire s/WRITE_ONCE/atomic_set_release s/try_cmpxchg/atomic_cmpxchg_acquire But it's an "unsigned long", and so trying to massage it into the above isn= 't a net positive for me. This is gnarly x86 specific code; it's very unlikely = that someone will be able to understand the rest of the rmap code, but won't be = able to understand that the code is safe due to x86's strong memory ordering. > then you must also have an smp_rmb() following a successful > cmpxchg/acquisition (at least for the case where we then follow the > pte_list_desc pointer).