From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 87758C2BD09
	for <linux-mm@archiver.kernel.org>; Tue,  9 Jul 2024 17:49:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9613A6B0096; Tue,  9 Jul 2024 13:49:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9113C6B0098; Tue,  9 Jul 2024 13:49:39 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7485F6B0099; Tue,  9 Jul 2024 13:49:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 4E9E46B0096
	for <linux-mm@kvack.org>; Tue,  9 Jul 2024 13:49:39 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id EADDC41A07
	for <linux-mm@kvack.org>; Tue,  9 Jul 2024 17:49:38 +0000 (UTC)
X-FDA: 82320951636.22.AF28E0B
Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201])
	by imf14.hostedemail.com (Postfix) with ESMTP id 17F07100010
	for <linux-mm@kvack.org>; Tue,  9 Jul 2024 17:49:36 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=v6fgNv74;
	spf=pass (imf14.hostedemail.com: domain of 3L3iNZgYKCJACyu73w08805y.w86527EH-664Fuw4.8B0@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3L3iNZgYKCJACyu73w08805y.w86527EH-664Fuw4.8B0@flex--seanjc.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720547347;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vVrOFEvDX+XM8YnlH+Wmmy/8FZGIo1WsbeGzDkgNVps=;
	b=1unJSv0Y+DU6Y5s/UfMShBOmbw6v2RnNXmkG+r6rI0wYDIcNsCxoy1OZeAJyVqOAFOwM7R
	vRP6VwmdTg7aFm0PQddoDZkd/WJhIc2AhqkJ7I9/fdF8X6vgY9OWf66d7PgyZh+ZYxe+Dt
	9kXzohW5l+ZiE31jqcHbtuOr9qiG0Hc=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720547347; a=rsa-sha256;
	cv=none;
	b=FHhbCgRY6v/H+Y+shPiFR9a23KXDVNLt4+xMnwqvjvcAhENj05BhcK3bmHsxcmH6jTc8l4
	vqAYfo0tprwhTpD0YBrIcBgUvQ4O7DpC1VCqaLJHfUmZD5Jq3WV8/STdne1P4Zaa9okijU
	k/504z4x+QZuuc/T/gsim+S5MBJDVYU=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=v6fgNv74;
	spf=pass (imf14.hostedemail.com: domain of 3L3iNZgYKCJACyu73w08805y.w86527EH-664Fuw4.8B0@flex--seanjc.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3L3iNZgYKCJACyu73w08805y.w86527EH-664Fuw4.8B0@flex--seanjc.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-6501bac2d6aso98917b3.0
        for <linux-mm@kvack.org>; Tue, 09 Jul 2024 10:49:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1720547376; x=1721152176; darn=kvack.org;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id
         :reply-to;
        bh=vVrOFEvDX+XM8YnlH+Wmmy/8FZGIo1WsbeGzDkgNVps=;
        b=v6fgNv74OUv4/0xtAqPsKfDMGtqXNeiReEwH8qOdvFX8agCkYzo6V5u/3cliway7gp
         FeuA6dnfuJFklsuH5/XwnfVTdX97kuEheUdcZ5C5Z88D3/t6hTPN5U2Ju2mRhLQbQ6h5
         hKoHXfIB99N/Rrbs32bc0qn6PQ9VxgyGrE7uXx7ti80enJZobAocI4yfG7RuVywFqzwU
         zx4A0/poMhFHFefn5+CKKdgMTGRUR1PLtTCw8Xpil+xzys2OtsFfKRQ282ESFOT97PAE
         XWjPT2xK2clZewrgTN77LpWOEXy/8oh/wT6vNskpjQgGzGmdCf5f3kigbYtqv9riVC+z
         AT1A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720547376; x=1721152176;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject
         :date:message-id:reply-to;
        bh=vVrOFEvDX+XM8YnlH+Wmmy/8FZGIo1WsbeGzDkgNVps=;
        b=TufEOWLLAv1Qm0QksB68usYv7oP67nM/0VckkbeBe0mLodKjrT6Uys4gMWlog5h5M/
         a0wPEy8ItRsa3vhPZn9G23qKrKFgprYzQ0PoP+Nvo2b8DmTWv9bYYdrN2zdpfH+Q28ww
         5YOhZG7VMDrYOEXZiCA8W2Q4HgI1Dv7BAh58pcs5IrWsjplN3jD5pGRUVyjLbER4ICEx
         +OMi1xXJMF2EYeHI8siUNmU0H7fbMkdJ6X892zEXquPYm6vDLHR0TMCai8Sd8Lk3yfox
         IM/CHFbJ1Kd2KW+E3ytlVcdQIK1+iq+1POCAFd4qs6xbtACQL4WAXu3JpNj8hHX4a+lA
         eHCQ==
X-Forwarded-Encrypted: i=1; AJvYcCVA+g4Q9gkuzSyqFmQkYV1QYgMdKet2ZUPQnh7bzUTlWjhC5Q+WlNc1INjQe4WaGm9zn9E9fk0k2DFT1blefH4tb+k=
X-Gm-Message-State: AOJu0Yzuz0N6jx57Q3RAhOEJvmHqlKkT4wwGUYi+zkdlQPRCeKZ5jlV+
	XHT1N4yySXh95tcz/YKGx9ihX8ez0N0V+QdtjlgeXyqU2E6rWnHNBYFP5cIoSlzPFTktcKzIZxW
	InA==
X-Google-Smtp-Source: AGHT+IFsZ6Q+oxAaPQSUV2QgR6LdQz2eaiEWblLJAUmt8IoOc2zaha2Rbw12EZbueXNK3o+GzpRzKLUZke0=
X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37])
 (user=seanjc job=sendgmr) by 2002:a05:690c:2b11:b0:644:c4d6:add0 with SMTP id
 00721157ae682-659177ca45amr105927b3.1.1720547375567; Tue, 09 Jul 2024
 10:49:35 -0700 (PDT)
Date: Tue, 9 Jul 2024 10:49:34 -0700
In-Reply-To: <CADrL8HUv6T4baOi=VTFV6ZA=Oyn3dEc6Hp9rXXH0imeYkwUhew@mail.gmail.com>
Mime-Version: 1.0
References: <CADrL8HU_FKHTz_6d=xhVLZFDQ_zQo-zdB2rqdpa2CKusa1uo+A@mail.gmail.com>
 <ZmjtEBH42u7NUWRc@google.com> <CADrL8HUW2q79F0FsEjhGW0ujij6+FfCqas5UpQp27Epfjc94Nw@mail.gmail.com>
 <ZmxsCwu4uP1lGsWz@google.com> <CADrL8HVDZ+m_-jUCaXf_DWJ92N30oqS=_9wNZwRvoSp5fo7asg@mail.gmail.com>
 <ZmzPoW7K5GIitQ8B@google.com> <CADrL8HW3rZ5xgbyGa+FXk50QQzF4B1=sYL8zhBepj6tg0EiHYA@mail.gmail.com>
 <ZnCCZ5gQnA3zMQtv@google.com> <CADrL8HW=kCLoWBwoiSOCd8WHFvBdWaguZ2ureo4eFy9D67+owg@mail.gmail.com>
 <CADrL8HUv6T4baOi=VTFV6ZA=Oyn3dEc6Hp9rXXH0imeYkwUhew@mail.gmail.com>
Message-ID: <Zo137P7BFSxAutL2@google.com>
Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier
From: Sean Christopherson <seanjc@google.com>
To: James Houghton <jthoughton@google.com>
Cc: Yu Zhao <yuzhao@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Paolo Bonzini <pbonzini@redhat.com>, Ankit Agrawal <ankita@nvidia.com>, 
	Axel Rasmussen <axelrasmussen@google.com>, Catalin Marinas <catalin.marinas@arm.com>, 
	David Matlack <dmatlack@google.com>, David Rientjes <rientjes@google.com>, 
	James Morse <james.morse@arm.com>, Jonathan Corbet <corbet@lwn.net>, Marc Zyngier <maz@kernel.org>, 
	Oliver Upton <oliver.upton@linux.dev>, Raghavendra Rao Ananta <rananta@google.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Shaoqin Huang <shahuang@redhat.com>, 
	Suzuki K Poulose <suzuki.poulose@arm.com>, Wei Xu <weixugc@google.com>, 
	Will Deacon <will@kernel.org>, Zenghui Yu <yuzenghui@huawei.com>, kvmarm@lists.linux.dev, 
	kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, 
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 17F07100010
X-Stat-Signature: nfzskbyg18igbhkeygzfnmbuc97rue6m
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1720547376-920637
X-HE-Meta: U2FsdGVkX18IvLaxxDc+E9O3//Ic3nwa2T7Dt4itdhUs6SoxbTxoLYTvW0qviOSGv4LYYS8OrQGJWMHFOUaVtZ72UZc82/IBOEvhD1RPiD2v4u6Q2XXH3B15UHnDzLUCRPq0lSUqjNooQKK/YHjlY1HWRcAQ74q4YeouuzzGvE/91hf2AXSJ8lSm2y1I9SWA57lJlItxqy0Alr0IePe4WYwP/PHbLeXc21O6BOuBW471H4yHIiJ0lC+P+TbNO17Rh/kQn8Lrxu5KCnHQ4ubt4hWF9Yz0fTCjlsBWq2B5VKjSMtgJEOHqEtyppcn8OEVTj16YmJOu5lx9POZLThXrHd82kkEohkgcOUhQ0TkYoqMwg7IKAsxzvdUzQg9oWvNQod7zlRCU+RFgOSdCx9Uw/pGWymgOTNSnONzH3lHhDsQnZYfd9JW314gu1eDhdzZJyVDkDTbStuOZkLUaBa44zg5i7TGH6WKAi1DyjnbhEX5TD2l7FL7EiJkBOvn64q1B0TcUUYp2/HanZ4et6r/amNY/cRqgu+ZLlUHaSzyNe8JXYfNfoGzhQyWF0/PGpJjfO7dnURgycdu/Wk+TJ1Cjkj1K90FtOKC3dP9qPcpgW6kFXF0zILulNBXSGxXOyThTu7XTfpKBkKnWswbKM+V3hIV5dCbEWY7eB6qDkKUOMW50JcDgjQrf/EEhFzB1LcaMR3kPuKhp5DEOf2iFkhlErzJ1C0JxYXkXuRZE15D39aN+d6+vbkOyUoLS5TMnTpAv0WfpbYVktKQ6srlNvZ5SmKS1vx49dKTnsxzcGpbgQHySVxWq9lx/LLAKlNevmCMgL21HFIPJR0Let/UaA9wLj/ChWWQz23wqzi7Kjfl7lGjnK5CZqppin1rJxScg25wlCdDTsb0Dkc9E4al1netDDoqwins+frwX4z4v84py5ncCQkgWgDkBxBoeiuwDZ4Qp9HclgH4voUnZ/VfFLbQ
 phZ2xeyc
 QPNYYSw5mnl23NZI9j6orJWGS9pibYNQUVOemtoTVghR/Osh6qwpDcpoNj9Z9ttU3FHdtvbtz0pIPSO3n4eo/6kUGzV5z9XXgoBHKFc9MH3eUbSi9L4a+h126uzhYsQ3HkCTUcQ0SjFHfrBm0Zhts9lVidM+ZXHyUA8v7XyrA3cWF7DzkNkFvWggDrdNW7CIyXjn96vH0W5OYNHd9ZjZniCpIwphaSRVJKGlQvfF72fv6rDjouxwIldTBv5t+Anq0oQ5XawL3xuzLsjivUl9Wqcd4RYAmgquwb56POo+FOmqHqJ4KW1g2PVpSCQurV/1krJDYBnPn9rvFo+LwTJ7ePCoG1AM47y6b+Xk53QtE0qfKX9kibj2xTcK26vQgrAZk1bdajBTFnUYmR1Ec9DL9CkhbeTqro3fuV2uD8PJhE1J4Xf3gkzc57jr7/Z4imSl61UWHNzRD1YUTudx2HlOg+a38jjge1TFbhIoonDFGT5xBhvLX7aYzAtzvHwTxel/vH6M8FNxA/gp9kLCGm+4miuEix1CQFeAsPkymaZTivWgUVcKRLruqVRCE0OpObQXRZp6wH26nReFZrEuMQp9lkun12FLw836X+9PDjulbiAlTkgzc6RYMWjsah4+kAT745VdFQYNZ+8wnSJqEDs180Po5xg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jul 08, 2024, James Houghton wrote:
> On Fri, Jun 28, 2024 at 7:38=E2=80=AFPM James Houghton <jthoughton@google=
.com> wrote:
> >
> > On Mon, Jun 17, 2024 at 11:37=E2=80=AFAM Sean Christopherson <seanjc@go=
ogle.com> wrote:
> > >
> > > On Mon, Jun 17, 2024, James Houghton wrote:
> > > > On Fri, Jun 14, 2024 at 4:17=E2=80=AFPM Sean Christopherson <seanjc=
@google.com> wrote:
> > > > > Ooh!  Actually, after fiddling a bit to see how feasible fast-agi=
ng in the shadow
> > > > > MMU would be, I'm pretty sure we can do straight there for nested=
 TDP.  Or rather,
> > > > > I suspect/hope we can get close enough for an initial merge, whic=
h would allow
> > > > > aging_is_fast to be a property of the mmu_notifier, i.e. would si=
mplify things
> > > > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for =
each notification.
> > > > >
> > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rma=
p entries is done
> > > > > in such a way that a lockless walk would be painfully complex.  B=
ut if there is
> > > > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] poi=
nts directly at
> > > > > that one SPTE.  And with nested TDP, unless L1 is doing something=
 uncommon, e.g.
> > > > > mapping the same page into multiple L2s, that overwhelming vast m=
ajority of rmaps
> > > > > have only one entry.  That's not the case for legacy shadow pagin=
g because kernels
> > > > > almost always map a pfn using multiple virtual addresses, e.g. Li=
nux's direct map
> > > > > along with any userspace mappings.
> >
> > Hi Sean, sorry for taking so long to get back to you.
> >
> > So just to make sure I have this right: if L1 is using TDP, the gfns
> > in L0 will usually only be mapped by a single spte. If L1 is not using
> > TDP, then all bets are off. Is that true?
> >
> > If that is true, given that we don't really have control over whether
> > or not L1 decides to use TDP, the lockless shadow MMU walk will work,
> > but, if L1 is not using TDP, it will often return false negatives
> > (says "old" for an actually-young gfn). So then I don't really
> > understand conditioning the lockless shadow MMU walk on us (L0) using
> > the TDP MMU[1]. We care about L1, right?
>=20
> Ok I think I understand now. If L1 is using shadow paging, L2 is
> accessing memory the same way L1 would, so we use the TDP MMU at L0
> for this case (if tdp_mmu_enabled). If L1 is using TDP, then we must
> use the shadow MMU, so that's the interesting case.

Yep.
=20
> > (Maybe you're saying that, when the TDP MMU is enabled, the only cases
> > where the shadow MMU is used are cases where gfns are practically
> > always mapped by a single shadow PTE. This isn't how I understood your
> > mail, but this is what your hack-a-patch[1] makes me think.)
>=20
> So it appears that this interpretation is actually what you meant.

Yep.

> > [1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@google.com/
> >
> > >
> > > ...
> > >
> > > > Hmm, interesting. I need to spend a little bit more time digesting =
this.
> > > >
> > > > Would you like to see this included in v6? (It'd be nice to avoid t=
he
> > > > WAS_FAST stuff....) Should we leave it for a later series? I haven'=
t
> > > > formed my own opinion yet.
> > >
> > > I would say it depends on the viability and complexity of my idea.  E=
.g. if it
> > > pans out more or less like my rough sketch, then it's probably worth =
taking on
> > > the extra code+complexity in KVM to avoid the whole WAS_FAST goo.
> > >
> > > Note, if we do go this route, the implementation would need to be twe=
aked to
> > > handle the difference in behavior between aging and last-minute check=
s for eviction,
> > > which I obviously didn't understand when I threw together that hack-a=
-patch.
> > >
> > > I need to think more about how best to handle that though, e.g. skipp=
ing GFNs with
> > > multiple mappings is probably the worst possible behavior, as we'd ri=
sk evicting
> > > hot pages.  But falling back to taking mmu_lock for write isn't all t=
hat desirable
> > > either.
> >
> > I think falling back to the write lock is more desirable than evicting
> > a young page.
> >
> > I've attached what I think could work, a diff on top of this series.
> > It builds at least. It uses rcu_read_lock/unlock() for
> > walk_shadow_page_lockless_begin/end(NULL), and it puts a
> > synchronize_rcu() in kvm_mmu_commit_zap_page().
> >
> > It doesn't get rid of the WAS_FAST things because it doesn't do
> > exactly what [1] does. It basically makes three calls now: lockless
> > TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the
> > locked shadow MMU bits if the lockless bits say !young (instead of
> > being conditioned on tdp_mmu_enabled). My choice is definitely
> > questionable for the clear path.
>=20
> I still don't think we should get rid of the WAS_FAST stuff.

I do :-)

> The assumption that the L1 VM will almost never share pages between L2
> VMs is questionable. The real question becomes: do we care to have
> accurate age information for this case? I think so.

I think you're conflating two different things.  WAS_FAST isn't about accur=
acy,
it's about supporting lookaround in conditionally fast secondary MMUs.

Accuracy only comes into play when we're talking about the last-minute chec=
k,
which, IIUC, has nothing to do with WAS_FAST because any potential lookarou=
nd has
already been performed.

> It's not completely trivial to get the lockless walking of the shadow
> MMU rmaps correct either (please see the patch I attached here[1]).

Heh, it's not correct.  Invoking synchronize_rcu() in kvm_mmu_commit_zap_pa=
ge()
is illegal, as mmu_lock (rwlock) is held and synchronize_rcu() might_sleep(=
).

For kvm_test_age_rmap_fast(), KVM can blindly read READ_ONCE(*sptep).  KVM =
might
read garbage, but that would be an _extremely_ rare scenario, and reporting=
 a
zapped page as being young is acceptable in that 1 in a billion situation.

For kvm_age_rmap_fast(), i.e. where KVM needs to write, I'm pretty sure KVM=
 can
handle that by rechecking the rmap and using CMPXCHG to write the SPTE.  If=
 the
rmap is unchanged, then the old SPTE value is guaranteed to be valid, in th=
e sense
that its value most definitely came from a KVM shadow page table.  Ah, drat=
, that
won't work, because very theoretically, the page table could be freed, real=
located,
and rewritten with the exact same value by something other than KVM.  Hrm.

Looking more closely, I think we can go straight to supporting rmap walks o=
utside
of mmu_lock.  There will still be a "lock", but it will be a *very* rudimen=
tary
lock, akin to the TDP MMU's REMOVED_SPTE approach.  Bit 0 of rmap_head->val=
 is
used to indicate "many", while bits 63:3/31:2 on 64-bit/32-bit KVM hold the
pointer (to a SPTE or a list).  That means bit 1 is available for shenaniga=
ns.

If we use bit 1 to lock the rmap, then the fast mmu_notifier can safely wal=
k the
entire rmap chain.  And with a reader/write scheme, the rmap walks that are
performed under mmu_lock don't need to lock the rmap, which means flows lik=
e
kvm_mmu_zap_collapsible_spte() don't need to be modified to avoid recursive
self-deadlock.  Lastly, the locking can be conditioned on the rmap being va=
lid,
i.e. having at least one SPTE.  That way the common case of a gfn not havin=
g any
rmaps is a glorified nop.

Adding the locking isn't actually all that difficult, with the *huge* cavea=
t that
the below patch is compile-tested only.  The vast majority of the churn is =
to make
it so existing code ignores the new KVM_RMAP_LOCKED bit.

I don't know that we should pursue such an approach in this series unless w=
e have
to.  E.g. if we can avoid WAS_FAST or don't have to carry too much intermed=
iate
complexity, then it'd probably be better to land the TDP MMU support first =
and
then add nested TDP support later.

At the very least, it does make me more confident that a fast walk of the r=
maps
is very doable (at least for nested TDP), i.e. makes me even more steadfast
against adding WAS_FAST.

> And the WAS_FAST functionality isn't even that complex to begin with.

I agree the raw code isn't terribly complex, but it's not trivial either.  =
And the
concept and *behavior* is complex, which is just as much of a maintenance b=
urden
as the code itself.  E.g. it requires knowing that KVM has multiple MMUs bu=
ried
behind a single mmu_notifier, and that a "hit" on the fast MMU will trigger
lookaround on the fast MMU, but not the slow MMU.  Understanding and descri=
bing
the implications of that behavior isn't easy.  E.g. if GFN=3DX is young in =
the TDP
MMU, but X+1..X+N are young only in the shadow MMU, is doing lookaround and=
 making
decisions based purely on the TDP MMU state the "right" behavior?

I also really don't like bleeding KVM details into the mmu_nofitier APIs.  =
The
need for WAS_FAST is 100% a KVM limitation.  AFAIK, no other secondary MMU =
has
multiple MMU implementations active behind a single notifier, and other tha=
n lack
of support, nothing fundamentally prevents a fast query in the shadow MMU.

---
 arch/x86/kvm/mmu/mmu.c | 163 ++++++++++++++++++++++++++++++++---------
 1 file changed, 128 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 842a3a4cdfe9..bfcfdc0a8600 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -935,9 +935,59 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bi=
tmap(struct kvm_vcpu *vcpu
  * About rmap_head encoding:
  *
  * If the bit zero of rmap_head->val is clear, then it points to the only =
spte
- * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct
+ * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct
  * pte_list_desc containing more mappings.
  */
+#define KVM_RMAP_MANY	BIT(0)
+#define KVM_RMAP_LOCKED	BIT(1)
+
+static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head)
+{
+	unsigned long old_val, new_val;
+
+	old_val =3D READ_ONCE(rmap_head->val);
+	if (!old_val)
+		return 0;
+
+	do {
+		while (old_val & KVM_RMAP_LOCKED) {
+			old_val =3D READ_ONCE(rmap_head->val);
+			cpu_relax();
+		}
+		if (!old_val)
+			return 0;
+
+		new_val =3D old_val | KVM_RMAP_LOCKED;
+	} while (!try_cmpxchg(&rmap_head->val, &old_val, new_val));
+
+	return old_val;
+}
+
+static unsigned long kvm_rmap_write_lock(struct kvm_rmap_head *rmap_head)
+{
+	return kvm_rmap_lock(rmap_head);
+}
+
+static void kvm_rmap_write_ulock(struct kvm_rmap_head *rmap_head,
+			    unsigned long new_val)
+{
+	WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED);
+	WRITE_ONCE(rmap_head->val, new_val);
+}
+
+static unsigned long kvm_rmap_read_lock(struct kvm_rmap_head *rmap_head)
+{
+	return kvm_rmap_lock(rmap_head);
+}
+
+static void kvm_rmap_read_unlock(struct kvm_rmap_head *rmap_head,
+				 unsigned long old_val)
+{
+	if (!old_val)
+		return;
+
+	WRITE_ONCE(rmap_head->val, old_val & ~KVM_RMAP_LOCKED);
+}
=20
 /*
  * Returns the number of pointers in the rmap chain, not counting the new =
one.
@@ -945,21 +995,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_b=
itmap(struct kvm_vcpu *vcpu
 static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
 			struct kvm_rmap_head *rmap_head)
 {
+	unsigned long old_val, new_val;
 	struct pte_list_desc *desc;
 	int count =3D 0;
=20
-	if (!rmap_head->val) {
-		rmap_head->val =3D (unsigned long)spte;
-	} else if (!(rmap_head->val & 1)) {
+	old_val =3D kvm_rmap_write_lock(rmap_head);
+
+	if (!old_val) {
+		new_val =3D (unsigned long)spte;
+	} else if (!(old_val & KVM_RMAP_MANY)) {
 		desc =3D kvm_mmu_memory_cache_alloc(cache);
-		desc->sptes[0] =3D (u64 *)rmap_head->val;
+		desc->sptes[0] =3D (u64 *)old_val;
 		desc->sptes[1] =3D spte;
 		desc->spte_count =3D 2;
 		desc->tail_count =3D 0;
-		rmap_head->val =3D (unsigned long)desc | 1;
+		new_val =3D (unsigned long)desc | KVM_RMAP_MANY;
 		++count;
 	} else {
-		desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul);
+		desc =3D (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 		count =3D desc->tail_count + desc->spte_count;
=20
 		/*
@@ -968,21 +1021,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache =
*cache, u64 *spte,
 		 */
 		if (desc->spte_count =3D=3D PTE_LIST_EXT) {
 			desc =3D kvm_mmu_memory_cache_alloc(cache);
-			desc->more =3D (struct pte_list_desc *)(rmap_head->val & ~1ul);
+			desc->more =3D (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY);
 			desc->spte_count =3D 0;
 			desc->tail_count =3D count;
-			rmap_head->val =3D (unsigned long)desc | 1;
+			new_val =3D (unsigned long)desc | KVM_RMAP_MANY;
+		} else {
+			new_val =3D old_val;
 		}
 		desc->sptes[desc->spte_count++] =3D spte;
 	}
+
+	kvm_rmap_write_ulock(rmap_head, new_val);
+
 	return count;
 }
=20
-static void pte_list_desc_remove_entry(struct kvm *kvm,
-				       struct kvm_rmap_head *rmap_head,
+static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rma=
p_val,
 				       struct pte_list_desc *desc, int i)
 {
-	struct pte_list_desc *head_desc =3D (struct pte_list_desc *)(rmap_head->v=
al & ~1ul);
+	struct pte_list_desc *head_desc =3D (struct pte_list_desc *)(*rmap_val & =
~KVM_RMAP_MANY);
 	int j =3D head_desc->spte_count - 1;
=20
 	/*
@@ -1009,9 +1066,9 @@ static void pte_list_desc_remove_entry(struct kvm *kv=
m,
 	 * head at the next descriptor, i.e. the new head.
 	 */
 	if (!head_desc->more)
-		rmap_head->val =3D 0;
+		*rmap_val =3D 0;
 	else
-		rmap_head->val =3D (unsigned long)head_desc->more | 1;
+		*rmap_val =3D (unsigned long)head_desc->more | KVM_RMAP_MANY;
 	mmu_free_pte_list_desc(head_desc);
 }
=20
@@ -1019,24 +1076,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *s=
pte,
 			    struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc;
+	unsigned long rmap_val;
 	int i;
=20
-	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm))
-		return;
+	rmap_val =3D kvm_rmap_write_lock(rmap_head);
+	if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm))
+		goto out;
=20
-	if (!(rmap_head->val & 1)) {
-		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val !=3D spte, kvm))
-			return;
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val !=3D spte, kvm))
+			goto out;
=20
-		rmap_head->val =3D 0;
+		rmap_val =3D 0;
 	} else {
-		desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul);
+		desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 		while (desc) {
 			for (i =3D 0; i < desc->spte_count; ++i) {
 				if (desc->sptes[i] =3D=3D spte) {
-					pte_list_desc_remove_entry(kvm, rmap_head,
+					pte_list_desc_remove_entry(kvm, &rmap_val,
 								   desc, i);
-					return;
+					goto out;
 				}
 			}
 			desc =3D desc->more;
@@ -1044,6 +1103,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spt=
e,
=20
 		KVM_BUG_ON_DATA_CORRUPTION(true, kvm);
 	}
+
+out:
+	kvm_rmap_write_ulock(rmap_head, rmap_val);
 }
=20
 static void kvm_zap_one_rmap_spte(struct kvm *kvm,
@@ -1058,17 +1120,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 				   struct kvm_rmap_head *rmap_head)
 {
 	struct pte_list_desc *desc, *next;
+	unsigned long rmap_val;
 	int i;
=20
-	if (!rmap_head->val)
+	rmap_val =3D kvm_rmap_write_lock(rmap_head);
+	if (!rmap_val)
 		return false;
=20
-	if (!(rmap_head->val & 1)) {
-		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val);
+	if (!(rmap_val & KVM_RMAP_MANY)) {
+		mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val);
 		goto out;
 	}
=20
-	desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
=20
 	for (; desc; desc =3D next) {
 		for (i =3D 0; i < desc->spte_count; i++)
@@ -1078,20 +1142,21 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm,
 	}
 out:
 	/* rmap_head is meaningless now, remember to reset it */
-	rmap_head->val =3D 0;
+	kvm_rmap_write_ulock(rmap_head, 0);
 	return true;
 }
=20
 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head)
 {
+	unsigned long rmap_val =3D READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
 	struct pte_list_desc *desc;
=20
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return 0;
-	else if (!(rmap_head->val & 1))
+	else if (!(rmap_val & KVM_RMAP_MANY))
 		return 1;
=20
-	desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	return desc->tail_count + desc->spte_count;
 }
=20
@@ -1134,6 +1199,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
  */
 struct rmap_iterator {
 	/* private fields */
+	struct rmap_head *head;
 	struct pte_list_desc *desc;	/* holds the sptep if not NULL */
 	int pos;			/* index of the sptep */
 };
@@ -1148,18 +1214,19 @@ struct rmap_iterator {
 static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head,
 			   struct rmap_iterator *iter)
 {
+	unsigned long rmap_val =3D READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED;
 	u64 *sptep;
=20
-	if (!rmap_head->val)
+	if (!rmap_val)
 		return NULL;
=20
-	if (!(rmap_head->val & 1)) {
+	if (!(rmap_val & KVM_RMAP_MANY)) {
 		iter->desc =3D NULL;
-		sptep =3D (u64 *)rmap_head->val;
+		sptep =3D (u64 *)rmap_val;
 		goto out;
 	}
=20
-	iter->desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul);
+	iter->desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY);
 	iter->pos =3D 0;
 	sptep =3D iter->desc->sptes[iter->pos];
 out:
@@ -1553,6 +1620,32 @@ static __always_inline bool kvm_handle_gfn_range(str=
uct kvm *kvm,
 	return ret;
 }
=20
+static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm,
+							  struct kvm_gfn_range *range,
+							  rmap_handler_t handler)
+{
+	struct kvm_rmap_head *rmap_head;
+	unsigned long rmap_val;
+	bool ret =3D false;
+	gfn_t gfn;
+	int level;
+
+	for (gfn =3D range->start; gfn < range->end; gfn++) {
+		for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_LEVEL; level++) =
{
+			rmap_head =3D gfn_to_rmap(gfn, level, range->slot);
+			rmap_val =3D kvm_rmap_read_lock(rmap_head);
+
+			if (rmap_val)
+				ret |=3D handler(kvm, rmap_head, range->slot, gfn, level);
+
+			kvm_rmap_read_unlock(rmap_head, rmap_val);
+		}
+	}
+
+	return ret;
+}
+
+
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
 	bool flush =3D false;

base-commit: 771df9ffadb8204e61d3e98f36c5067102aab78f
--=20