From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 292DDC3DA42 for ; Tue, 9 Jul 2024 17:50:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: Content-Type:Cc:To:From:Subject:Message-ID:References:Mime-Version: In-Reply-To:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=vVrOFEvDX+XM8YnlH+Wmmy/8FZGIo1WsbeGzDkgNVps=; b=bctvcfDUN9wqThQyZG/d/m2o6v 0eRG6n5NYHi3Fz8Y0lTKO2bDP8DERIoLg2o4ewJ+uBI2kRqrdpTmKv+x9NiBbjw6SAzkBGBJNsJjd A2q/zaSTWWebP4+6METZb2wWRdL+jYIy4NciHQTXFZZ0yxxYxKItAJpP9xxye4Vd7thmTVlYXp2eK uSoKk1/jL66oBKxpAi1QsgaEgP53i1to6uUrg6imeDgT148sUuvoszDLY5LziX4KgJ9AYXKU2/8yX bvw60yMGXNrF6Vz1gIv9IgkyJmLc8C/aumEwthn7X5pZAI5pN+DVUHK7T40effxVS6rJPDfZNn3JD 8vqQ3dCg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1sREyR-00000008Dz0-0L7X; Tue, 09 Jul 2024 17:49:55 +0000 Received: from mail-yw1-x1149.google.com ([2607:f8b0:4864:20::1149]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1sREyA-00000008Duw-0pjf for linux-arm-kernel@lists.infradead.org; Tue, 09 Jul 2024 17:49:40 +0000 Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-6502e4c5aafso54939247b3.1 for ; Tue, 09 Jul 2024 10:49:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1720547376; x=1721152176; darn=lists.infradead.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=vVrOFEvDX+XM8YnlH+Wmmy/8FZGIo1WsbeGzDkgNVps=; b=XlHEpAfM7cTNqd0KmiA5MAM7YgNKIuWB9T+gANpJ2GGqxo2r8fceao0jP1226I5iJJ w9ybDnQghP/qnTUHvnpSZ9xRK1BcwELoS1MTgITe7wqUFQJm0mcO+ejda6vGBgQb8lny 8xvztlfuZB+7BfoPqsgrCXRDHSSQSxwXsOmzhj/6Jo0agC2IM/5mmzyU832nDlqdqffE YqEiPmnr/2QBSW6udzuxfPdwpFkl12SlAhAqFuAT8SuOwsYXqm3A2MrM/1U3RVldlXSF IvzfsX8jhIS4u2Ox8qIOQyT9jGS0J9nfAmEhh7bCMvDiiGMH4hTPqOLQcepMscY4zrGI BuAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1720547376; x=1721152176; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=vVrOFEvDX+XM8YnlH+Wmmy/8FZGIo1WsbeGzDkgNVps=; b=HnTRssoiyLqCXriLAepnq6pDgKA9BuU7bUdprBM1vXbL0L2vdZUhUv0oqh8jqyZI8w M4wdRc+4LIsNmGkumIMnyKe6eFx6Bv0Ge+is553F+Gy6UiRWP71IcuZyeIN58x4mZGde ExlObHDNteLYKSsrYJi4t0EdSyspqkFEsZV0FVMNV6lC/pLTK1dld7TsTra3Zub1DL/l R6a2d8HPiz4WeRMwSBnh9INdmeQK6dv8kbBAyzdO2iLn+mDdhNNE2oEXvGje3dQRTaS1 BmtSLI/ICtoEljeAnj15uhUhLBkO9emhJ/eXWaLW9j1qBR7ykLV60GZshDb8+pGSByiV q6fA== X-Forwarded-Encrypted: i=1; AJvYcCUSv1sJa5+8Kv4bYIvrnRyeUvHhRyEq//SVkOVA6M9v7XMlTMPWz/Y5yVE7fQKhEVi12MjLhWsmsYfLAqrHK7D6W56QRNrKyZMZbr8DqZB/4/DEeEw= X-Gm-Message-State: AOJu0YxOdEdNzbCYfXA0L5EEQtqgLbAzW7o/boo0WWvLMncYhzic0VL9 yK7Yh1Ny1xHjAZqqBmWKp3zdpi6KVy0aU+zwU2fFMiwTh9tGFTd37UObC2A3fWJSWk/koixTDYk aHQ== X-Google-Smtp-Source: AGHT+IFsZ6Q+oxAaPQSUV2QgR6LdQz2eaiEWblLJAUmt8IoOc2zaha2Rbw12EZbueXNK3o+GzpRzKLUZke0= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:690c:2b11:b0:644:c4d6:add0 with SMTP id 00721157ae682-659177ca45amr105927b3.1.1720547375567; Tue, 09 Jul 2024 10:49:35 -0700 (PDT) Date: Tue, 9 Jul 2024 10:49:34 -0700 In-Reply-To: Mime-Version: 1.0 References: Message-ID: Subject: Re: [PATCH v5 4/9] mm: Add test_clear_young_fast_only MMU notifier From: Sean Christopherson To: James Houghton Cc: Yu Zhao , Andrew Morton , Paolo Bonzini , Ankit Agrawal , Axel Rasmussen , Catalin Marinas , David Matlack , David Rientjes , James Morse , Jonathan Corbet , Marc Zyngier , Oliver Upton , Raghavendra Rao Ananta , Ryan Roberts , Shaoqin Huang , Suzuki K Poulose , Wei Xu , Will Deacon , Zenghui Yu , kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240709_104938_280044_274F7BFD X-CRM114-Status: GOOD ( 69.96 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Jul 08, 2024, James Houghton wrote: > On Fri, Jun 28, 2024 at 7:38=E2=80=AFPM James Houghton wrote: > > > > On Mon, Jun 17, 2024 at 11:37=E2=80=AFAM Sean Christopherson wrote: > > > > > > On Mon, Jun 17, 2024, James Houghton wrote: > > > > On Fri, Jun 14, 2024 at 4:17=E2=80=AFPM Sean Christopherson wrote: > > > > > Ooh! Actually, after fiddling a bit to see how feasible fast-agi= ng in the shadow > > > > > MMU would be, I'm pretty sure we can do straight there for nested= TDP. Or rather, > > > > > I suspect/hope we can get close enough for an initial merge, whic= h would allow > > > > > aging_is_fast to be a property of the mmu_notifier, i.e. would si= mplify things > > > > > because KVM wouldn't need to communicate MMU_NOTIFY_WAS_FAST for = each notification. > > > > > > > > > > Walking KVM's rmaps requires mmu_lock because adding/removing rma= p entries is done > > > > > in such a way that a lockless walk would be painfully complex. B= ut if there is > > > > > exactly _one_ rmap entry for a gfn, then slot->arch.rmap[...] poi= nts directly at > > > > > that one SPTE. And with nested TDP, unless L1 is doing something= uncommon, e.g. > > > > > mapping the same page into multiple L2s, that overwhelming vast m= ajority of rmaps > > > > > have only one entry. That's not the case for legacy shadow pagin= g because kernels > > > > > almost always map a pfn using multiple virtual addresses, e.g. Li= nux's direct map > > > > > along with any userspace mappings. > > > > Hi Sean, sorry for taking so long to get back to you. > > > > So just to make sure I have this right: if L1 is using TDP, the gfns > > in L0 will usually only be mapped by a single spte. If L1 is not using > > TDP, then all bets are off. Is that true? > > > > If that is true, given that we don't really have control over whether > > or not L1 decides to use TDP, the lockless shadow MMU walk will work, > > but, if L1 is not using TDP, it will often return false negatives > > (says "old" for an actually-young gfn). So then I don't really > > understand conditioning the lockless shadow MMU walk on us (L0) using > > the TDP MMU[1]. We care about L1, right? >=20 > Ok I think I understand now. If L1 is using shadow paging, L2 is > accessing memory the same way L1 would, so we use the TDP MMU at L0 > for this case (if tdp_mmu_enabled). If L1 is using TDP, then we must > use the shadow MMU, so that's the interesting case. Yep. =20 > > (Maybe you're saying that, when the TDP MMU is enabled, the only cases > > where the shadow MMU is used are cases where gfns are practically > > always mapped by a single shadow PTE. This isn't how I understood your > > mail, but this is what your hack-a-patch[1] makes me think.) >=20 > So it appears that this interpretation is actually what you meant. Yep. > > [1] https://lore.kernel.org/linux-mm/ZmzPoW7K5GIitQ8B@google.com/ > > > > > > > > ... > > > > > > > Hmm, interesting. I need to spend a little bit more time digesting = this. > > > > > > > > Would you like to see this included in v6? (It'd be nice to avoid t= he > > > > WAS_FAST stuff....) Should we leave it for a later series? I haven'= t > > > > formed my own opinion yet. > > > > > > I would say it depends on the viability and complexity of my idea. E= .g. if it > > > pans out more or less like my rough sketch, then it's probably worth = taking on > > > the extra code+complexity in KVM to avoid the whole WAS_FAST goo. > > > > > > Note, if we do go this route, the implementation would need to be twe= aked to > > > handle the difference in behavior between aging and last-minute check= s for eviction, > > > which I obviously didn't understand when I threw together that hack-a= -patch. > > > > > > I need to think more about how best to handle that though, e.g. skipp= ing GFNs with > > > multiple mappings is probably the worst possible behavior, as we'd ri= sk evicting > > > hot pages. But falling back to taking mmu_lock for write isn't all t= hat desirable > > > either. > > > > I think falling back to the write lock is more desirable than evicting > > a young page. > > > > I've attached what I think could work, a diff on top of this series. > > It builds at least. It uses rcu_read_lock/unlock() for > > walk_shadow_page_lockless_begin/end(NULL), and it puts a > > synchronize_rcu() in kvm_mmu_commit_zap_page(). > > > > It doesn't get rid of the WAS_FAST things because it doesn't do > > exactly what [1] does. It basically makes three calls now: lockless > > TDP MMU, lockless shadow MMU, locked shadow MMU. It only calls the > > locked shadow MMU bits if the lockless bits say !young (instead of > > being conditioned on tdp_mmu_enabled). My choice is definitely > > questionable for the clear path. >=20 > I still don't think we should get rid of the WAS_FAST stuff. I do :-) > The assumption that the L1 VM will almost never share pages between L2 > VMs is questionable. The real question becomes: do we care to have > accurate age information for this case? I think so. I think you're conflating two different things. WAS_FAST isn't about accur= acy, it's about supporting lookaround in conditionally fast secondary MMUs. Accuracy only comes into play when we're talking about the last-minute chec= k, which, IIUC, has nothing to do with WAS_FAST because any potential lookarou= nd has already been performed. > It's not completely trivial to get the lockless walking of the shadow > MMU rmaps correct either (please see the patch I attached here[1]). Heh, it's not correct. Invoking synchronize_rcu() in kvm_mmu_commit_zap_pa= ge() is illegal, as mmu_lock (rwlock) is held and synchronize_rcu() might_sleep(= ). For kvm_test_age_rmap_fast(), KVM can blindly read READ_ONCE(*sptep). KVM = might read garbage, but that would be an _extremely_ rare scenario, and reporting= a zapped page as being young is acceptable in that 1 in a billion situation. For kvm_age_rmap_fast(), i.e. where KVM needs to write, I'm pretty sure KVM= can handle that by rechecking the rmap and using CMPXCHG to write the SPTE. If= the rmap is unchanged, then the old SPTE value is guaranteed to be valid, in th= e sense that its value most definitely came from a KVM shadow page table. Ah, drat= , that won't work, because very theoretically, the page table could be freed, real= located, and rewritten with the exact same value by something other than KVM. Hrm. Looking more closely, I think we can go straight to supporting rmap walks o= utside of mmu_lock. There will still be a "lock", but it will be a *very* rudimen= tary lock, akin to the TDP MMU's REMOVED_SPTE approach. Bit 0 of rmap_head->val= is used to indicate "many", while bits 63:3/31:2 on 64-bit/32-bit KVM hold the pointer (to a SPTE or a list). That means bit 1 is available for shenaniga= ns. If we use bit 1 to lock the rmap, then the fast mmu_notifier can safely wal= k the entire rmap chain. And with a reader/write scheme, the rmap walks that are performed under mmu_lock don't need to lock the rmap, which means flows lik= e kvm_mmu_zap_collapsible_spte() don't need to be modified to avoid recursive self-deadlock. Lastly, the locking can be conditioned on the rmap being va= lid, i.e. having at least one SPTE. That way the common case of a gfn not havin= g any rmaps is a glorified nop. Adding the locking isn't actually all that difficult, with the *huge* cavea= t that the below patch is compile-tested only. The vast majority of the churn is = to make it so existing code ignores the new KVM_RMAP_LOCKED bit. I don't know that we should pursue such an approach in this series unless w= e have to. E.g. if we can avoid WAS_FAST or don't have to carry too much intermed= iate complexity, then it'd probably be better to land the TDP MMU support first = and then add nested TDP support later. At the very least, it does make me more confident that a fast walk of the r= maps is very doable (at least for nested TDP), i.e. makes me even more steadfast against adding WAS_FAST. > And the WAS_FAST functionality isn't even that complex to begin with. I agree the raw code isn't terribly complex, but it's not trivial either. = And the concept and *behavior* is complex, which is just as much of a maintenance b= urden as the code itself. E.g. it requires knowing that KVM has multiple MMUs bu= ried behind a single mmu_notifier, and that a "hit" on the fast MMU will trigger lookaround on the fast MMU, but not the slow MMU. Understanding and descri= bing the implications of that behavior isn't easy. E.g. if GFN=3DX is young in = the TDP MMU, but X+1..X+N are young only in the shadow MMU, is doing lookaround and= making decisions based purely on the TDP MMU state the "right" behavior? I also really don't like bleeding KVM details into the mmu_nofitier APIs. = The need for WAS_FAST is 100% a KVM limitation. AFAIK, no other secondary MMU = has multiple MMU implementations active behind a single notifier, and other tha= n lack of support, nothing fundamentally prevents a fast query in the shadow MMU. --- arch/x86/kvm/mmu/mmu.c | 163 ++++++++++++++++++++++++++++++++--------- 1 file changed, 128 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 842a3a4cdfe9..bfcfdc0a8600 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -935,9 +935,59 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_bi= tmap(struct kvm_vcpu *vcpu * About rmap_head encoding: * * If the bit zero of rmap_head->val is clear, then it points to the only = spte - * in this rmap chain. Otherwise, (rmap_head->val & ~1) points to a struct + * in this rmap chain. Otherwise, (rmap_head->val & ~3) points to a struct * pte_list_desc containing more mappings. */ +#define KVM_RMAP_MANY BIT(0) +#define KVM_RMAP_LOCKED BIT(1) + +static unsigned long kvm_rmap_lock(struct kvm_rmap_head *rmap_head) +{ + unsigned long old_val, new_val; + + old_val =3D READ_ONCE(rmap_head->val); + if (!old_val) + return 0; + + do { + while (old_val & KVM_RMAP_LOCKED) { + old_val =3D READ_ONCE(rmap_head->val); + cpu_relax(); + } + if (!old_val) + return 0; + + new_val =3D old_val | KVM_RMAP_LOCKED; + } while (!try_cmpxchg(&rmap_head->val, &old_val, new_val)); + + return old_val; +} + +static unsigned long kvm_rmap_write_lock(struct kvm_rmap_head *rmap_head) +{ + return kvm_rmap_lock(rmap_head); +} + +static void kvm_rmap_write_ulock(struct kvm_rmap_head *rmap_head, + unsigned long new_val) +{ + WARN_ON_ONCE(new_val & KVM_RMAP_LOCKED); + WRITE_ONCE(rmap_head->val, new_val); +} + +static unsigned long kvm_rmap_read_lock(struct kvm_rmap_head *rmap_head) +{ + return kvm_rmap_lock(rmap_head); +} + +static void kvm_rmap_read_unlock(struct kvm_rmap_head *rmap_head, + unsigned long old_val) +{ + if (!old_val) + return; + + WRITE_ONCE(rmap_head->val, old_val & ~KVM_RMAP_LOCKED); +} =20 /* * Returns the number of pointers in the rmap chain, not counting the new = one. @@ -945,21 +995,24 @@ static struct kvm_memory_slot *gfn_to_memslot_dirty_b= itmap(struct kvm_vcpu *vcpu static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte, struct kvm_rmap_head *rmap_head) { + unsigned long old_val, new_val; struct pte_list_desc *desc; int count =3D 0; =20 - if (!rmap_head->val) { - rmap_head->val =3D (unsigned long)spte; - } else if (!(rmap_head->val & 1)) { + old_val =3D kvm_rmap_write_lock(rmap_head); + + if (!old_val) { + new_val =3D (unsigned long)spte; + } else if (!(old_val & KVM_RMAP_MANY)) { desc =3D kvm_mmu_memory_cache_alloc(cache); - desc->sptes[0] =3D (u64 *)rmap_head->val; + desc->sptes[0] =3D (u64 *)old_val; desc->sptes[1] =3D spte; desc->spte_count =3D 2; desc->tail_count =3D 0; - rmap_head->val =3D (unsigned long)desc | 1; + new_val =3D (unsigned long)desc | KVM_RMAP_MANY; ++count; } else { - desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc =3D (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY); count =3D desc->tail_count + desc->spte_count; =20 /* @@ -968,21 +1021,25 @@ static int pte_list_add(struct kvm_mmu_memory_cache = *cache, u64 *spte, */ if (desc->spte_count =3D=3D PTE_LIST_EXT) { desc =3D kvm_mmu_memory_cache_alloc(cache); - desc->more =3D (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc->more =3D (struct pte_list_desc *)(old_val & ~KVM_RMAP_MANY); desc->spte_count =3D 0; desc->tail_count =3D count; - rmap_head->val =3D (unsigned long)desc | 1; + new_val =3D (unsigned long)desc | KVM_RMAP_MANY; + } else { + new_val =3D old_val; } desc->sptes[desc->spte_count++] =3D spte; } + + kvm_rmap_write_ulock(rmap_head, new_val); + return count; } =20 -static void pte_list_desc_remove_entry(struct kvm *kvm, - struct kvm_rmap_head *rmap_head, +static void pte_list_desc_remove_entry(struct kvm *kvm, unsigned long *rma= p_val, struct pte_list_desc *desc, int i) { - struct pte_list_desc *head_desc =3D (struct pte_list_desc *)(rmap_head->v= al & ~1ul); + struct pte_list_desc *head_desc =3D (struct pte_list_desc *)(*rmap_val & = ~KVM_RMAP_MANY); int j =3D head_desc->spte_count - 1; =20 /* @@ -1009,9 +1066,9 @@ static void pte_list_desc_remove_entry(struct kvm *kv= m, * head at the next descriptor, i.e. the new head. */ if (!head_desc->more) - rmap_head->val =3D 0; + *rmap_val =3D 0; else - rmap_head->val =3D (unsigned long)head_desc->more | 1; + *rmap_val =3D (unsigned long)head_desc->more | KVM_RMAP_MANY; mmu_free_pte_list_desc(head_desc); } =20 @@ -1019,24 +1076,26 @@ static void pte_list_remove(struct kvm *kvm, u64 *s= pte, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc; + unsigned long rmap_val; int i; =20 - if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_head->val, kvm)) - return; + rmap_val =3D kvm_rmap_write_lock(rmap_head); + if (KVM_BUG_ON_DATA_CORRUPTION(!rmap_val, kvm)) + goto out; =20 - if (!(rmap_head->val & 1)) { - if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_head->val !=3D spte, kvm)) - return; + if (!(rmap_val & KVM_RMAP_MANY)) { + if (KVM_BUG_ON_DATA_CORRUPTION((u64 *)rmap_val !=3D spte, kvm)) + goto out; =20 - rmap_head->val =3D 0; + rmap_val =3D 0; } else { - desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); while (desc) { for (i =3D 0; i < desc->spte_count; ++i) { if (desc->sptes[i] =3D=3D spte) { - pte_list_desc_remove_entry(kvm, rmap_head, + pte_list_desc_remove_entry(kvm, &rmap_val, desc, i); - return; + goto out; } } desc =3D desc->more; @@ -1044,6 +1103,9 @@ static void pte_list_remove(struct kvm *kvm, u64 *spt= e, =20 KVM_BUG_ON_DATA_CORRUPTION(true, kvm); } + +out: + kvm_rmap_write_ulock(rmap_head, rmap_val); } =20 static void kvm_zap_one_rmap_spte(struct kvm *kvm, @@ -1058,17 +1120,19 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, struct kvm_rmap_head *rmap_head) { struct pte_list_desc *desc, *next; + unsigned long rmap_val; int i; =20 - if (!rmap_head->val) + rmap_val =3D kvm_rmap_write_lock(rmap_head); + if (!rmap_val) return false; =20 - if (!(rmap_head->val & 1)) { - mmu_spte_clear_track_bits(kvm, (u64 *)rmap_head->val); + if (!(rmap_val & KVM_RMAP_MANY)) { + mmu_spte_clear_track_bits(kvm, (u64 *)rmap_val); goto out; } =20 - desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); =20 for (; desc; desc =3D next) { for (i =3D 0; i < desc->spte_count; i++) @@ -1078,20 +1142,21 @@ static bool kvm_zap_all_rmap_sptes(struct kvm *kvm, } out: /* rmap_head is meaningless now, remember to reset it */ - rmap_head->val =3D 0; + kvm_rmap_write_ulock(rmap_head, 0); return true; } =20 unsigned int pte_list_count(struct kvm_rmap_head *rmap_head) { + unsigned long rmap_val =3D READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED; struct pte_list_desc *desc; =20 - if (!rmap_head->val) + if (!rmap_val) return 0; - else if (!(rmap_head->val & 1)) + else if (!(rmap_val & KVM_RMAP_MANY)) return 1; =20 - desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul); + desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); return desc->tail_count + desc->spte_count; } =20 @@ -1134,6 +1199,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) */ struct rmap_iterator { /* private fields */ + struct rmap_head *head; struct pte_list_desc *desc; /* holds the sptep if not NULL */ int pos; /* index of the sptep */ }; @@ -1148,18 +1214,19 @@ struct rmap_iterator { static u64 *rmap_get_first(struct kvm_rmap_head *rmap_head, struct rmap_iterator *iter) { + unsigned long rmap_val =3D READ_ONCE(rmap_head->val) & ~KVM_RMAP_LOCKED; u64 *sptep; =20 - if (!rmap_head->val) + if (!rmap_val) return NULL; =20 - if (!(rmap_head->val & 1)) { + if (!(rmap_val & KVM_RMAP_MANY)) { iter->desc =3D NULL; - sptep =3D (u64 *)rmap_head->val; + sptep =3D (u64 *)rmap_val; goto out; } =20 - iter->desc =3D (struct pte_list_desc *)(rmap_head->val & ~1ul); + iter->desc =3D (struct pte_list_desc *)(rmap_val & ~KVM_RMAP_MANY); iter->pos =3D 0; sptep =3D iter->desc->sptes[iter->pos]; out: @@ -1553,6 +1620,32 @@ static __always_inline bool kvm_handle_gfn_range(str= uct kvm *kvm, return ret; } =20 +static __always_inline bool kvm_handle_gfn_range_lockless(struct kvm *kvm, + struct kvm_gfn_range *range, + rmap_handler_t handler) +{ + struct kvm_rmap_head *rmap_head; + unsigned long rmap_val; + bool ret =3D false; + gfn_t gfn; + int level; + + for (gfn =3D range->start; gfn < range->end; gfn++) { + for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_LEVEL; level++) = { + rmap_head =3D gfn_to_rmap(gfn, level, range->slot); + rmap_val =3D kvm_rmap_read_lock(rmap_head); + + if (rmap_val) + ret |=3D handler(kvm, rmap_head, range->slot, gfn, level); + + kvm_rmap_read_unlock(rmap_head, rmap_val); + } + } + + return ret; +} + + bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range) { bool flush =3D false; base-commit: 771df9ffadb8204e61d3e98f36c5067102aab78f --=20