From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f172.google.com (mail-yb1-f172.google.com [209.85.219.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF4261C1F0C for ; Mon, 28 Jul 2025 18:08:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753726096; cv=none; b=pvLYfFvkeircs77L2MsWha5E2PaZluM+Xdbu4svBuuttgo9yZ4g0VbAj895WTMCQ3MCvEvkVIShcjl42k353VRkywdgcTPKnGjqB3H/l4fHogAXLoptvq6bgZqlbx2/Kut4W3CCKjFPUCBBxYJ0UUUsPpgDt6H7v/kT8qZ4rooo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753726096; c=relaxed/simple; bh=EVc0iEknCF3Ka4SHyQNJqg8MLMtEG5JaIjGM3psiOjg=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=cs1Bp9s/mgA1HE7u+0t5ui/B8A+rf+4VxJWGbGa2CjT5QQbdsgsKsoZo17ZdNhlX3XWCuhj1f32cpGOq2cikljmp2ws7jNvX7AA2rOxpJaLKGnZDqNvAnyrZHK/MYQyLEQ4xS7VzsFisN1TABHTgj0fMv7Bv4oGw8vGdB4gpcx4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=03bCLprC; arc=none smtp.client-ip=209.85.219.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="03bCLprC" Received: by mail-yb1-f172.google.com with SMTP id 3f1490d57ef6-e8bbb605530so162619276.0 for ; Mon, 28 Jul 2025 11:08:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1753726094; x=1754330894; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=2gmrGvfylvw3ZgqzNbgUCIxZfOpxfy+MGSCTqwUJs5I=; b=03bCLprC2Ls4UHUjZucCW8ZWhm8QnJTnU9jR/2u2U7HYhFN70q5vhsQ2Lgmbm/MTNy NcItjzDtdg1JwVb/5CSXUgpsSAQQPOGpIhb1euuyIpxLa5VKPrN7FcbjZ+E5rUd5Xrrg m8oUi2qYCbFWecQBjkK9X/TVSb+k0hcCvfzjxAmQhAViu5Asq/WIHT/7+HBC/RRuwFaE dTvlhIh/69CI8ejhvrYHVu2DIamtNiylE5DpgT0cHoCnXD6JGsXE9YEe9iXN11ADFdos JumYHY5SyDvebXK4nkrDGrz9CoakRyMOxuVekd8w7Ru+UFwlc2fDwPAAv5A8WlMAm4pO 0uag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753726094; x=1754330894; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2gmrGvfylvw3ZgqzNbgUCIxZfOpxfy+MGSCTqwUJs5I=; b=DO4mQduCTOveVWBIJUFPxZRCdtOs/iK95xt4NuYJXc1gJGysEWr/1KS+CtRebimqFp KV92mAmB/FqXhEGQ3LgUBxq+QSIdsyRJW+Jf/MMQ2BXN1vp2tziETg0O2JZ+/m1mQ+Jj aVcBsaJWgDrJdu4RoNqY3oKYZwWn8UV/wFHVnnjXP8ACNDuElZWFjk9SjmrlQauS25w1 /kCXRsns0fDPKQZBcqS0sPla+o0e3P+wQwYNn3pPiBzXyy0+0Hd0MvVWuqjraI/i1mWw Yyz+WdWLLETPhAxlWHsavN8np7xSc8jv2zg+S1EWZafNxfzk7FNuweM3KYcPnYhvhUXO y+8Q== X-Forwarded-Encrypted: i=1; AJvYcCUCwJ7chPb8RbG0GtztHeG3/i3lkeGahc8Z+D8uhyglc98ovgwLYf2II496jAktWZyjrzY=@vger.kernel.org X-Gm-Message-State: AOJu0Yy57V6kobVjlMM8lFCKkr4Uh1FufUbONUl5JVW21dqGNcJFZp/u YK2lO7vc7364m7G4I4cN4wA2b9VHYus+I1vfFFmTpCBAnBsLQKmw4Ic/d2M9EjY+MVXx1X1657m eKI861mpEEVmfOlDCGgVyKnk2T4oIrXitqT6ZcMjWzsYegTkljuuoPuJW X-Gm-Gg: ASbGncuG6hng2uVyemaaKyf+Ki0Bled5qi8+cQRyOgK9oQpzbpM4OXpxDF8uNwCwG5H O59Xow7ZG6BIy3fZvTIj28Im3SSYe3GtS6lTsYP3Carj+myngaDrs/wVFedIxXqQFK0d/qw5rkt P+zzUAFLlNnmNVD98ea4KSlSnX/3Q2PPQYI1pt+ZrY/rK2nrMbCVb1rFYnjUZY2qTFVR0sgclHj 1MTcEYxYbnQX9Nkcf5JhHl+Gh01sqZEqMlIW7l6upxrfrc= X-Google-Smtp-Source: AGHT+IHjjBEZToln1GUKdNiNA5i+48eLmqUyHE2lIco0fXmKeAo87dihQyms53696HZtftCNENAtjjy50jxrFh37csY= X-Received: by 2002:a05:6902:2192:b0:e8e:16d9:8f09 with SMTP id 3f1490d57ef6-e8e2512a762mr683908276.17.1753726093352; Mon, 28 Jul 2025 11:08:13 -0700 (PDT) Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20250707224720.4016504-1-jthoughton@google.com> <20250707224720.4016504-4-jthoughton@google.com> In-Reply-To: From: James Houghton Date: Mon, 28 Jul 2025 11:07:37 -0700 X-Gm-Features: Ac12FXw88v-b_BueYv4w7UAtEPsyh5QQFnvb9K48s-JDnycsPo2ucd7UoV2kjdI Message-ID: Subject: Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using MMU read lock To: Sean Christopherson Cc: Paolo Bonzini , Vipin Sharma , David Matlack , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Jul 23, 2025 at 1:35=E2=80=AFPM Sean Christopherson wrote: > > On Mon, Jul 07, 2025, James Houghton wrote: > > From: Vipin Sharma > > > > Use MMU read lock to recover TDP MMU NX huge pages. Iterate > > Wrap at ~75 chars. Oh, I did indeed wrap the text pretty aggressively for this patch. Not sure why that happened. > > > over the huge pages list under tdp_mmu_pages_lock protection and > > unaccount the page before dropping the lock. > > > > We must not zap an SPTE if: > > No pronouns! Right. > > > - The SPTE is a root page. > > - The SPTE does not point at the SP's page table. > > > > If the SPTE does not point at the SP's page table, then something else > > has change the SPTE, so we cannot safely zap it. > > > > Warn if zapping SPTE fails and current SPTE is still pointing to same > > page table. This should never happen. > > > > There is always a race between dirty logging, vCPU faults, and NX huge > > page recovery for backing a gfn by an NX huge page or an executable > > small page. Unaccounting sooner during the list traversal is increasing > > the window of that race. Functionally, it is okay, because accounting > > doesn't protect against iTLB multi-hit bug, it is there purely to > > prevent KVM from bouncing a gfn between two page sizes. The only > > downside is that a vCPU will end up doing more work in tearing down all > > the child SPTEs. This should be a very rare race. > > > > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU > > read lock. This optimizaion is done to solve a guest jitter issue on > > Windows VM which was observing an increase in network latency. > > With slight tweaking: > > Use MMU read lock to recover TDP MMU NX huge pages. To prevent > concurrent modification of the list of potential huge pages, iterate over > the list under tdp_mmu_pages_lock protection and unaccount the page > before dropping the lock. > > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU > read lock, which solves a guest jitter issue on Windows VMs which were > observing an increase in network latency. > > Do not zap an SPTE if: > - The SPTE is a root page. > - The SPTE does not point at the SP's page table. > > If the SPTE does not point at the SP's page table, then something else > has change the SPTE, so KVM cannot safely zap it. "has changed" (my mistake) > > Warn if zapping SPTE fails and current SPTE is still pointing to same > page table, as it should be impossible for the CMPXCHG to fail due to all > other write scenarios being mutually exclusive. > > There is always a race between dirty logging, vCPU faults, and NX huge > page recovery for backing a gfn by an NX huge page or an executable > small page. Unaccounting sooner during the list traversal increases the > window of that race, but functionally, it is okay. Accounting doesn't > protect against iTLB multi-hit bug, it is there purely to prevent KVM > from bouncing a gfn between two page sizes. The only downside is that a > vCPU will end up doing more work in tearing down all the child SPTEs. > This should be a very rare race. Thanks, this is much better. > > > Suggested-by: Sean Christopherson > > Signed-off-by: Vipin Sharma > > Co-developed-by: James Houghton > > Signed-off-by: James Houghton > > --- > > arch/x86/kvm/mmu/mmu.c | 107 ++++++++++++++++++++++++------------- > > arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++--- > > 2 files changed, 105 insertions(+), 44 deletions(-) > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > > index b074f7bb5cc58..7df1b4ead705b 100644 > > --- a/arch/x86/kvm/mmu/mmu.c > > +++ b/arch/x86/kvm/mmu/mmu.c > > @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struc= t kvm *kvm, > > return ratio ? DIV_ROUND_UP(pages, ratio) : 0; > > } > > > > +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm, > > + struct kvm_mmu_page *sp) > > +{ > > + struct kvm_memory_slot *slot =3D NULL; > > + > > + /* > > + * Since gfn_to_memslot() is relatively expensive, it helps to sk= ip it if > > + * it the test cannot possibly return true. On the other hand, i= f any > > + * memslot has logging enabled, chances are good that all of them= do, in > > + * which case unaccount_nx_huge_page() is much cheaper than zappi= ng the > > + * page. > > And largely irrelevant, because KVM should unaccount the NX no matter wha= t. I > kinda get what you're saying, but honestly it adds a lot of confusion, es= pecially > since unaccount_nx_huge_page() is in the caller. > > > + * > > + * If a memslot update is in progress, reading an incorrect value= of > > + * kvm->nr_memslots_dirty_logging is not a problem: if it is beco= ming > > + * zero, gfn_to_memslot() will be done unnecessarily; if it is be= coming > > + * nonzero, the page will be zapped unnecessarily. Either way, t= his only > > + * affects efficiency in racy situations, and not correctness. > > + */ > > + if (atomic_read(&kvm->nr_memslots_dirty_logging)) { > > Short-circuit the function to decrease indentation, and so that "slot" do= esn't > need to be NULL-initialized. > > > + struct kvm_memslots *slots; > > + > > + slots =3D kvm_memslots_for_spte_role(kvm, sp->role); > > + slot =3D __gfn_to_memslot(slots, sp->gfn); > > Then this can be: > > slot =3D __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->rol= e), sp->gfn); > > without creating a stupid-long line. > > > + WARN_ON_ONCE(!slot); > > And then: > > if (WARN_ON_ONCE(!slot)) > return false; > > return kvm_slot_dirty_track_enabled(slot); > > With a comment cleanup: > > struct kvm_memory_slot *slot; > > /* > * Skip the memslot lookup if dirty tracking can't possibly be en= abled, > * as memslot lookups are relatively expensive. > * > * If a memslot update is in progress, reading an incorrect value= of > * kvm->nr_memslots_dirty_logging is not a problem: if it is beco= ming > * zero, KVM will do an unnecessary memslot lookup; if it is be= coming > * nonzero, the page will be zapped unnecessarily. Either way, t= his > * only affects efficiency in racy situations, and not correctnes= s. > */ > if (!atomic_read(&kvm->nr_memslots_dirty_logging)) > return false; > > slot =3D __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->rol= e), sp->gfn); > if (WARN_ON_ONCE(!slot)) > return false; > > return kvm_slot_dirty_track_enabled(slot); LGTM, thanks! > > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm= *kvm, > > rcu_read_lock(); > > > > for ( ; to_zap; --to_zap) { > > - if (list_empty(nx_huge_pages)) > > +#ifdef CONFIG_X86_64 > > These #ifdefs still make me sad, but I also still think they're the least= awful > solution. And hopefully we will jettison 32-bit sooner than later :-) Yeah I couldn't come up with anything better. :(