From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12A411D9D6A for ; Wed, 4 Sep 2024 15:48:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725464923; cv=none; b=S6tmuOp1hIIslQ4XwsawPoWyCtHeSzQKSDceySYDAYm8BdzG7gi3XNAeOd+bAiDR4RGIZochNkosiXo524LR6g09sGc3w3aFEaByim0umHvGfw8SNuFPG9Og3mqmFWQE+06pTixrrfszpn7uV4L2/ZhSxa/LnlGkBrQoPUAkX+s= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725464923; c=relaxed/simple; bh=YFXu6Fs91bSZlaLaA/3ydsvGQ81AEcDn4k3L1L/q5f8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=edCWDY1rQWz3iXgbjxqynBW5XaVxswv1yYrzFEv7jcrzGhcrrmM4PVNlLfs0kNLkXuTDzkwPVH2VZGVdGd0yWufamgXqz2fTJuHoZWv7As3vfdhJe2G1MXTv+HeRaKBvSL9FtPqVOcpx378zN3dSfNjJiAeQnkFXfv8YW3ofWEE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=wknBmI/F; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="wknBmI/F" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-715c530f80eso7355063b3a.0 for ; Wed, 04 Sep 2024 08:48:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725464921; x=1726069721; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=ouT7rYO2/YVMI+l5OQmMww73bGkck/LGVXhuvGIWT20=; b=wknBmI/FoHatxQr5XB32z+MRtf4SPtppPU7O3f691cj/FV+BC1qNN9VcWc2RkOjVDm NDhWJ+pctqpyKPO07RCm99Z06vsC9qO1eH1bzaoJX+cwWHPmplsq67uMirSqP3PYn3Y2 sw0M3UNxYaFIwCZHzaQWwGMs3WaSfhAvU2z7LYsoMyuScNpospKMrcUTzRhC0V5KB23L Bbv4Mz2msYCNMdJ2aDXyYtjTiD6evdv8Oy4OMc6I/UJuQtoQ6y8lS4UienAF7K1jYU0t Z2VdpOBvLrUklWwswrVCrXEOMLFJC+MFqKQKqnqT0T1LNpxCqfCuQH7O7+AF5nktquik Pqmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725464921; x=1726069721; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=ouT7rYO2/YVMI+l5OQmMww73bGkck/LGVXhuvGIWT20=; b=sJdTLnoK4Fss2OA515VcHVbXzt8gRYt8Q7ppEFGb/kSRpm9MFrBLJ/Z/m4JMv67S9+ SBulN03GKbBOITShkNKDwXgLmtLZdFpcPz1awvLChTC6d+fh4SkmPqAx6Bm/mOfnWaxU w8ZA68DoHxwRtE58f2EqSRJFMG2c/2PnC8POOMcuuYHHa7YvgJnQp26zZq8M9lfsc116 7pRfIqFytbyNevT10j0v+Lt9OIezP6leiJItByUUOChUd2x19FzUlQLR+79wZeUamcdV jyfBDyJZVo5TQSZbdxnRqmgfAqW5XtRLNqvbSKfTM+qECpyThBTzgy1ju8jPsmZ3YUn7 FDmw== X-Forwarded-Encrypted: i=1; AJvYcCW94Rm9TKypu+zhnKyUkb4TPnl8zo2QZ5Fwj4yzX/U2Ztkt8hzzoQcc9z8PjK7CyHUwjhZALTnoBocDoK8=@vger.kernel.org X-Gm-Message-State: AOJu0YzxPy7O8srvHC6WmXeCftJYyDqyPwZwXb9COpPRDjxiHblcKds6 kgcul7yA0cBn8HeSNpJbln6GrKCzhMur8GSycQyXnYnKixqYcD9H22VmWzrBxnQpPIylwXzE7pg fRw== X-Google-Smtp-Source: AGHT+IGGP3PXqrQgYqgMOgQ+kMGQRkSHUgc2dXDaJi8EO81IhrHsgPA17Q69v3HAlkSASN14F+txkPEx7hA= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6a00:9457:b0:70d:fba:c52b with SMTP id d2e1a72fcca58-7173072c955mr48049b3a.3.1725464921051; Wed, 04 Sep 2024 08:48:41 -0700 (PDT) Date: Wed, 4 Sep 2024 08:48:39 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240809194335.1726916-1-seanjc@google.com> <20240809194335.1726916-22-seanjc@google.com> Message-ID: Subject: Re: [PATCH 21/22] KVM: x86/mmu: Support rmap walks without holding mmu_lock when aging gfns From: Sean Christopherson To: James Houghton Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Oliver Upton , Marc Zyngier , Peter Xu Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Tue, Sep 03, 2024, James Houghton wrote: > On Fri, Aug 9, 2024 at 12:44=E2=80=AFPM Sean Christopherson wrote: > > > > DO NOT MERGE, yet... > > > > Cc: James Houghton > > Signed-off-by: Sean Christopherson > > --- > > arch/x86/kvm/mmu/mmu.c | 63 +++++++++++++++++++++++++++++++++++++++--- > > 1 file changed, 59 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index 48e8608c2738..9df6b465de06 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -995,13 +995,11 @@ static void kvm_rmap_unlock(struct kvm_rmap_head = *rmap_head, > > * locking is the same, but the caller is disallowed from modifying th= e rmap, > > * and so the unlock flow is a nop if the rmap is/was empty. > > */ > > -__maybe_unused > > static unsigned long kvm_rmap_lock_readonly(struct kvm_rmap_head *rmap= _head) > > { > > return __kvm_rmap_lock(rmap_head); > > } > > > > -__maybe_unused > > static void kvm_rmap_unlock_readonly(struct kvm_rmap_head *rmap_head, > > unsigned long old_val) > > { > > @@ -1743,8 +1741,53 @@ static void rmap_add(struct kvm_vcpu *vcpu, cons= t struct kvm_memory_slot *slot, > > __rmap_add(vcpu->kvm, cache, slot, spte, gfn, access); > > } > > > > -static bool kvm_rmap_age_gfn_range(struct kvm *kvm, > > - struct kvm_gfn_range *range, bool te= st_only) > > +static bool kvm_rmap_age_gfn_range_lockless(struct kvm *kvm, > > + struct kvm_gfn_range *range= , > > + bool test_only) > > +{ > > + struct kvm_rmap_head *rmap_head; > > + struct rmap_iterator iter; > > + unsigned long rmap_val; > > + bool young =3D false; > > + u64 *sptep; > > + gfn_t gfn; > > + int level; > > + u64 spte; > > + > > + for (level =3D PG_LEVEL_4K; level <=3D KVM_MAX_HUGEPAGE_LEVEL; = level++) { > > + for (gfn =3D range->start; gfn < range->end; > > + gfn +=3D KVM_PAGES_PER_HPAGE(level)) { > > + rmap_head =3D gfn_to_rmap(gfn, level, range->sl= ot); > > + rmap_val =3D kvm_rmap_lock_readonly(rmap_head); > > + > > + for_each_rmap_spte_lockless(rmap_head, &iter, s= ptep, spte) { > > + if (!is_accessed_spte(spte)) > > + continue; > > + > > + if (test_only) { > > + kvm_rmap_unlock_readonly(rmap_h= ead, rmap_val); > > + return true; > > + } > > + > > + /* > > + * Marking SPTEs for access tracking ou= tside of > > + * mmu_lock is unsupported. Report the= page as > > + * young, but otherwise leave it as-is. >=20 > Just for my own understanding, what's the main reason why it's unsafe Note, I specifically said "unsupported", not "unsafe" :-D > to mark PTEs for access tracking outside the mmu_lock? It probably can be done safely? The main issue is that marking the SPTE fo= r access tracking can also clear the Writable bit, and so we'd need to audit = all the flows that consume is_writable_pte(). Hmm, actually, that's less scary= than it first seems, because thanks to kvm_mmu_notifier_clear_young(), KVM alrea= dy clears the Writable bit in AD-disabled SPTEs without a TLB flush. E.g. mmu_spte_update() specifically looks at MMU-writable, not the Writable bit,= when deciding if a TLB flush is required. On a related note, I missed is that KVM would need to leaf SPTEs as volatil= e at all times, as your MGLRU series modified kvm_tdp_mmu_spte_need_atomic_write= (), not the common spte_has_volatile_bits(). Actually, on second though, maybe it isn't necessary for the AD-enabled cas= e. Effectively clobbering the Accessed bit is completely fine, as aging is tol= erant of false negatives and false positives, so long as they aren't excessive. = And that's doubly true if KVM x86 follows MM and doesn't force a TLB flush[1] Oooh, and triply true if KVM stops marking the folio accesses when zapping = SPTEs[2]. So yeah, after thinking though all of the moving parts, maybe we should com= mit to aging AD-disabled SPTEs out of mmu_lock. AD-disabled leaf SPTEs do end = up being "special", because KVM needs to ensure it doesn't clobber the Writable bit,= i.e. AD-disabled leaf SPTEs need to be treated as volatile at all times. But in= practice, forcing an atomic update for all AD-disabled leaf SPTEs probably doesn't ac= tually change much, because in most cases KVM is probably using an atomic access a= nyways, e.g. because KVM is clearing the Writable bit and the Writable bit is alrea= dy volatile. FWIW, marking the folio dirty if the SPTE was writable, as is done today in mmu_spte_age(), is sketchy if mmu_lock isn't held, but probably ok since th= is is invoked from an mmu_notifier and presumably the caller holds a reference to= the page/folio. But that's largely a moot point since I want to yank out that = code anyways[3]. [1] https://lore.kernel.org/all/ZsS_OmxwFzrqDcfY@google.com [2] https://lore.kernel.org/all/20240726235234.228822-82-seanjc@google.com [3] https://lore.kernel.org/all/20240726235234.228822-8-seanjc@google.com > > + if (spte_ad_enabled(spte)) > > + clear_bit((ffs(shadow_accessed_= mask) - 1), > > + (unsigned long *)spte= p); >=20 > I feel like it'd be kinda nice to de-duplicate this clear_bit() piece > with the one in kvm_rmap_age_gfn_range(). Ya, definitely no argument against adding a helper. > > + young =3D true; > > + } > > + > > + kvm_rmap_unlock_readonly(rmap_head, rmap_val); > > + } > > + } > > + return young; > > +} > > + > > +static bool __kvm_rmap_age_gfn_range(struct kvm *kvm, > > + struct kvm_gfn_range *range, bool = test_only) > > { > > struct slot_rmap_walk_iterator iterator; > > struct rmap_iterator iter; > > @@ -1783,6 +1826,18 @@ static bool kvm_rmap_age_gfn_range(struct kvm *k= vm, > > return young; > > } > > > > + > > +static bool kvm_rmap_age_gfn_range(struct kvm *kvm, > > + struct kvm_gfn_range *range, bool te= st_only) > > +{ > > + /* FIXME: This also needs to be guarded with something like ran= ge->fast_only. */ > > + if (kvm_ad_enabled()) >=20 > I expect this to be something like `if (kvm_ad_enabled() || > range->fast_only)`. With MGLRU, that means the pages will always be the l= ast > candidates for eviction, though it is still possible for them to be evict= ed > (though I think this would basically never happen). I think this is fine. >=20 > I think the only other possible choice is to record/return 'not young'/fa= lse > instead of 'young'/true if the spte is young but !spte_ad_enabled(). That > doesn't seem to be obviously better, though we *will* get correct age > information at eviction time, when !range->fast_only, at which point the = page > will not be evicted, and Accessed will be cleared. As above, I think the simpler solution overall is to support aging AD-disab= led SPTEs out of mmu_lock. The sequence of getting to that end state will be m= ore complex, but most of that complexity is going to happen irrespective of thi= s series. And it would mean KVM MGLRU support has no chance of landing in 6.12, but a= gain I think that's the reality either way.