From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D39DF358372 for ; Thu, 25 Jun 2026 21:45:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782423950; cv=none; b=Hr/MTvudiPUdg7KaaxpvS61aXRGpvjBRDGEAkgpJorS4seZ0ndDkzSfJIZ8scExBeXmNf3rWwzp8XTnBhLSviszdGb7PsdoYVCFWBmHSHXFUPPCnafauL6a2R1o7dX1eOsTPEqt6o1rfHBcYKvqhu5gGJi4j1mdbphfVPVmt7+A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782423950; c=relaxed/simple; bh=R695RNKosz53FgQpxHNo3CW1d1pFwQ7ud5ZvpBjzuJg=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=tKGUQFMoCUIxaL5PbIYVbnG/zga1kzyO70XpaDd+Q0FXOfMb/fXJY08eS+wi0kMQ/G+LYCm0VfMWAB9wkgk8vwFEMWb6JPrjRz9FCO6tidOHLxGkQgEjtnfggU3u4l8FMKqs09dkTMsWXE84aEDKTjSDTcLDjOlVpMk2VI7ldcE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=RSZHMyvK; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RSZHMyvK" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-c89956023dbso160264a12.0 for ; Thu, 25 Jun 2026 14:45:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1782423948; x=1783028748; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=KLKv2w/oZldsKRU7x2mvxBStpob28cqfWxBp8cLvb0Q=; b=RSZHMyvKuXxhtdaUczjj5Gjgf6qmydvLtzixBHQmGCvH/lnWkt8B862Y8NE7953V3l XYN+eG0/x7wPIA5eFbkfAqyyNmV6ZKro2NRFsP39JYdZLrCf2HOkPsOFvQ7qRuqoVfHf 73+CzcqUwrPRO0G6Q98VfN1yOCxoWe/AkrXFEUpo49GQS7iRSsOQ55RJodTbrBxI7/un 7vRYkMRx68AqeUodrrvQVie7CX//oQSD8o4r/l9N9RFSCHEgJqCJ/gYZDkiPcVXb5pWJ upK5k9bu1A+u0qhNRmGLNJlW5w+WfyT6o8ifbluNkl69cAAAsXXTK/dYsIG2+cYrcdVF iDBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1782423948; x=1783028748; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=KLKv2w/oZldsKRU7x2mvxBStpob28cqfWxBp8cLvb0Q=; b=jsYqMirnhja5O2slO0Db7jzZSr2RwgOo7lrEOG6EMAQSGqKegQS65z4wY1VzSK/XUa ZvGxAoEVAdeCyQ/HY1p70e1DzXC5v211Rp42ncZsmxoYVkj/NxeONEhX5rDsucaVpnzO o+9+bKsctoqjLQ7ZmO+9QWXUXxiaA26SB9jZ5hCL5SmwBvp/nyQjrR4dVB8lG6Gal0Jk m4C5FNaRy/SxJJ43MLnT4vY0mSWHSXBKIr127JSj+IG7gEW5rX+Tx/ugBSoc82D43pjR O4C2E/bOBOYYMNnAoNpWvQ5SRuui72ud/oEGxHi3eYhBtW5nm8iTYz+EYSWLcjcwFPvP z8Hw== X-Forwarded-Encrypted: i=1; AHgh+Ro4AEpjtBTAzb9eFHDybcC4ynwc/PSM7Rwa8O2ZOiggcWO3N5OihR9GZXrgHeXKM5SipeQ=@vger.kernel.org X-Gm-Message-State: AOJu0YwpNAx93S8cDbk088weB6OPjp6IrkXzInWyOVycLK1WxJxOkbyj gg6ESav4lqKuvpVANPaaeNZRDUyxKdb3+2OHBihHlNuxFbZO2dIbftXYrQN5I5iQOsJCLi/Hu5I POhyRkw== X-Received: from pfbfj39.prod.google.com ([2002:a05:6a00:3a27:b0:82f:8cba:4285]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:464f:b0:842:2f09:fde9 with SMTP id d2e1a72fcca58-845b39afe3emr5212047b3a.8.1782423947743; Thu, 25 Jun 2026 14:45:47 -0700 (PDT) Date: Thu, 25 Jun 2026 14:45:47 -0700 In-Reply-To: <20260520111325.693807-1-aha310510@gmail.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260520111325.693807-1-aha310510@gmail.com> Message-ID: Subject: Re: [PATCH] KVM: pfncache: track HVA invalidations for HVA-based caches From: Sean Christopherson To: Jeongjun Park Cc: Paolo Bonzini , David Woodhouse , Paul Durrant , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, syzbot+0948c82180d475ad24e2@syzkaller.appspotmail.com Content-Type: text/plain; charset="us-ascii" On Wed, May 20, 2026, Jeongjun Park wrote: > HVA-based gfn_to_pfn caches are not necessarily backed by a KVM memslot. > When an MMU notifier invalidation targets such an HVA, KVM's global > mmu_invalidate_seq is not guaranteed to change because that sequence is > advanced through the memslot-based invalidation path. > > This matters during hva_to_pfn_retry(). The refresh path temporarily > marks the cache invalid and drops gpc->lock while resolving the HVA and > creating a kernel mapping. If an overlapping HVA invalidation completes in > that window, the notifier may observe gpc->valid == false and therefore > leave no state behind for the in-progress refresh. For an HVA outside all > memslots, the refresh cannot rely on mmu_invalidate_seq to detect the > event either. > > To prevent this, we must add a per-cache HVA invalidation sequence. No, there are multiple ways this could be handled. Instead of saying "we must", state what change is being (as you did below), and then briefly explain the alternatives and _why_ the chosen approach was deemed to be the best approach. > Bump the sequence whenever the cached HVA overlaps an MMU notifier range, > regardless of the current valid state. Snapshot the sequence before > dropping gpc->lock in hva_to_pfn_retry(), and retry the refresh if it > changes before the new mapping is published. > > Reported-by: syzbot+0948c82180d475ad24e2@syzkaller.appspotmail.com > Closes: https://lore.kernel.org/all/6a0c5f2c.a00a0220.2c7954.0000.GAE@google.com Given that there's no reproducer, how did you test this? > Fixes: b9220d32799a ("KVM: x86/xen: allow shared_info to be mapped by fixed HVA") > Signed-off-by: Jeongjun Park > --- > include/linux/kvm_types.h | 1 + > virt/kvm/pfncache.c | 42 ++++++++++++++++++++++++++++++++------- > 2 files changed, 36 insertions(+), 7 deletions(-) > > diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h > index a568d8e6f4e8..ff3b8aa73561 100644 > --- a/include/linux/kvm_types.h > +++ b/include/linux/kvm_types.h > @@ -85,6 +85,7 @@ struct gfn_to_pfn_cache { > u64 generation; > gpa_t gpa; > unsigned long uhva; > + unsigned long hva_invalidate_seq; > struct kvm_memory_slot *memslot; > struct kvm *kvm; > struct list_head list; > diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c > index 728d2c1b488a..296b06482ebc 100644 > --- a/virt/kvm/pfncache.c > +++ b/virt/kvm/pfncache.c > @@ -19,6 +19,24 @@ > > #include "kvm_mm.h" > > +static inline bool gpc_uhva_in_range(struct gfn_to_pfn_cache *gpc, > + unsigned long start, unsigned long end) > +{ > + return gpc->uhva >= start && gpc->uhva < end; > +} > + > +static inline bool gpc_should_invalidate(struct gfn_to_pfn_cache *gpc, > + unsigned long start, unsigned long end) Rather than providing a "should invalidate" helper, which is makes it difficult to understand the conditions for invalidation, i.e. requires jumping around the code base, I think we should provide: static bool gpc_hva_is_valid(struct gfn_to_pfn_cache *gpc) { return kvm_gpc_is_hva_active(gpc) || gpc->valid; } and open code the start/end checks. We could keep gpc_uhva_in_range() (but drop the 'u'), but IMO it's not a net positive. E.g. yield: /* Only a single page so no need to care about length */ if (gpc_hva_is_valid(gpc) && gpc->uhva >= start && gpc->uhva < end) { read_unlock_irq(&gpc->lock); Which, for me, makes it easier to understand _this_ flow. I might have questions about what constitutes a "valid hva", but I don't need to learn those details to understand the invalidation flow. > +{ > + if (!gpc_uhva_in_range(gpc, start, end)) > + return false; > + > + if (kvm_gpc_is_hva_active(gpc)) > + return true; > + > + return gpc->valid && !is_error_noslot_pfn(gpc->pfn); > +} > /* > * MMU notifier 'invalidate_range_start' hook. > */ > @@ -32,8 +50,7 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, > read_lock_irq(&gpc->lock); > > /* Only a single page so no need to care about length */ > - if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && > - gpc->uhva >= start && gpc->uhva < end) { > + if (gpc_should_invalidate(gpc, start, end)) { > read_unlock_irq(&gpc->lock); > > /* > @@ -45,9 +62,11 @@ void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm, unsigned long start, > */ > > write_lock_irq(&gpc->lock); > - if (gpc->valid && !is_error_noslot_pfn(gpc->pfn) && Unless I'm forgetting/missing something, it should be impossible for gpc->pfn to be garbage if gpc->valid is true. As a follow-up patch, turn that into a WARN_ON_ONCE()? Assuming I'm not missing something... > - gpc->uhva >= start && gpc->uhva < end) > + if (gpc_should_invalidate(gpc, start, end)) { > + if (kvm_gpc_is_hva_active(gpc)) > + gpc->hva_invalidate_seq++; Hmm, I think I would rather just add a per-gpc mmu_invalidate_seq and drop usage of kvm->mmu_invalidate_seq entirely in the gpc code. Well, if we can get this working, which I'm not sure we can without a pile of surgery I don't really want to do. CPU0 (hva_to_pfn_retry()) CPU1 (invalidate) ------------------------- ------------------------- invalidate_range_start() mn_active_invalidate_count = 1 hva_invalidate_seq = X write_unlock_irq() write_lock_irq() gpc->active = true gpc->uhva = A hva_seq = X write_unlock_irq() pfn = Z invalidate_range_end() mn_active_invalidate_count = 0 mn_active_invalidate_count == 0 hva_seq == hva_invalidate_seq use-after-free In other words, the mmu_invalidate_seq scheme requires that the sequence counter be incremented in the same critical section as mmu_invalidate_in_progress. Note, the current use of mn_active_invalidate_count is "fine" because it outlives mmu_invalidate_in_progress, i.e. mn_active_invalidate_count achieves the "same critical section" rule by proxy. We could add gfn_to_pfn_cache_invalidate_end(), but I really, really don't want to add more serialization to the mmu_notifier flow when it's all but guaranteed to be unnecessary. And we can't optimize it to skip bumping sequence counters if gfn_to_pfn_cache_invalidate_start() doesn't detect relevant GPS, unless we block GPC refresh while mn_active_invalidate_count is elevated, as is done for memslots, otherwise KVM would incorrectly skip bumping counters if a GPC becomes relevant after the fact. And blocking memslot updates is a bit of a disaster because it's inherently unfair, and that's for a path that's already slow. Oh! But we can simply piggyback mn_invalidate_lock and use a per-VM GPC sequence counter, then the end() path doesn't need to re-walk GPCs. That will regress GPCs in the presense of unrelated mmu_notifier events, but we already know that's a lurking problem, and we know how to fix it[*] (I think). There's a "sleeping while atomic" problem that will occur on RT if/when GPCs take mn_invalidate_lock when refreshing the PFN, thanks to IRQs being disabled, but that's not a problem until we implement the range-based optimization, because until then, the reader side can be lockless. [*] https://lore.kernel.org/all/Zw8DINUkJJKDByXE@google.com Not yet tested, and I think I'd want to land this at the same time as the range-based optimzation, to avoid regressing "normal" GPC usage (only AWS uses the HVA-based Xen API, so getting a fix to LTS kernels isn't a priority, AFAIK), but this? diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdfa3699352..d07e3cbff983 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -854,6 +854,8 @@ struct kvm { gfn_t mmu_invalidate_range_start; gfn_t mmu_invalidate_range_end; + unsigned long gpc_invalidate_seq; + struct list_head devices; u64 manual_dirty_log_protect; struct dentry *debugfs_dentry; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 98da4c889ffc..5c2e92e4a085 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -812,6 +812,15 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, /* Pairs with the increment in range_start(). */ spin_lock(&kvm->mn_invalidate_lock); + kvm->gpc_invalidate_seq++; + + /* + * As with the MMU sequence counter and mmu_invalidate_in_progress, the + * GPC sequence increase must be visible before the invalidate count + * goes to zero. Pairs with the smp_rmb() in gpc_invalidate_retry(). + */ + smp_wmb(); + if (!WARN_ON_ONCE(!kvm->mn_active_invalidate_count)) --kvm->mn_active_invalidate_count; wake = !kvm->mn_active_invalidate_count; diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index 728d2c1b488a..a427daea986f 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -124,7 +124,7 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva) #endif } -static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_seq) +static inline bool gpc_invalidate_retry(struct kvm *kvm, unsigned long mmu_seq) { /* * mn_active_invalidate_count acts for all intents and purposes @@ -142,14 +142,14 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s return true; /* - * Ensure mn_active_invalidate_count is read before - * mmu_invalidate_seq. This pairs with the smp_wmb() in - * mmu_notifier_invalidate_range_end() to guarantee either the - * old (non-zero) value of mn_active_invalidate_count or the - * new (incremented) value of mmu_invalidate_seq is observed. + * Ensure mn_active_invalidate_count is read before gpc_invalidate_seq. + * This pairs with the smp_wmb() in + * kvm_mmu_notifier_invalidate_range_end() to guarantee either the + * old (non-zero) value of mn_active_invalidate_count or the new + * (incremented) value of gpc_invalidate_seq is observed. */ smp_rmb(); - return kvm->mmu_invalidate_seq != mmu_seq; + return kvm->gpc_invalidate_seq != mmu_seq; } static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) @@ -232,7 +232,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc) * attempting to refresh. */ WARN_ON_ONCE(gpc->valid); - } while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq)); + } while (gpc_invalidate_retry(gpc->kvm, mmu_seq)); gpc->valid = true; gpc->pfn = new_pfn;