Re: [PATCH v4 01/10] KVM: nVMX: Implement cache for L1 MSR bitmap

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [PATCH v4 01/10] KVM: nVMX: Implement cache for L1 MSR bitmap
       [not found] ` <20260102142429.896101-2-griffoul@gmail.com>
@ 2026-05-11 23:08   ` Sean Christopherson
  0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2026-05-11 23:08 UTC (permalink / raw)
  To: Fred Griffoul
  Cc: kvm, pbonzini, vkuznets, shuah, dwmw, linux-kselftest,
	linux-kernel, Fred Griffoul

On Fri, Jan 02, 2026, Fred Griffoul wrote:
> From: Fred Griffoul <fgriffo@amazon.co.uk>
> 
> Introduce a gfn_to_pfn_cache to optimize L1 MSR bitmap access by
> replacing map/unmap operations. This optimization reduces overhead
> during L2 VM-entry where nested_vmx_prepare_msr_bitmap() merges L1's MSR
> intercepts with L0's requirements.
> 
> Current implementation using kvm_vcpu_map_readonly() and
> kvm_vcpu_unmap() creates significant performance impact, mostly with
> unmanaged guest memory.
> 
> The cache is initialized when entering VMX operation and deactivated
> when VMX operation ends.
> 
> Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk>
> ---
>  arch/x86/kvm/vmx/nested.c | 42 +++++++++++++++++++++++++++++++++++----
>  arch/x86/kvm/vmx/vmx.h    |  2 ++
>  2 files changed, 40 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 6137e5307d0f..f05828aca7e5 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -316,6 +316,34 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs)
>  	vcpu->arch.regs_dirty = 0;
>  }
>  
> +/*
> + * Maps a single guest page starting at @gpa and lock the cache for access.
> + */
> +static int nested_gpc_lock(struct gfn_to_pfn_cache *gpc, gpa_t gpa)

These helpers belong in virt/kvm/pfncache.c, not in vmx/nested.c.  The names also
need to be changed (which is super obvious when they land in pfncache.c).  Nested
locks are very much a thing in the kernel; even with the nVXM context, these read
as "lock a GPC within another GPC critical section".

And this does waaaay more than just "lock" the gpc.  E.g. by the end of the series
this picks up:

		err = kvm_gpc_activate(gpc, gpa, PAGE_SIZE);
		if (err) {
			/*
			 * Deactivate nested state caches to prevent
			 * kvm_gpc_invalid() from returning true in subsequent
			 * is_nested_state_invalid() calls. This prevents an
			 * infinite loop while entering guest mode.
			 */
			if (gpc->vcpu)
				kvm_gpc_deactivate(gpc);

			return err;
		}

That's a pretty gross hack, as is the addition and use of kvm_gpc_invalid() to
detect that a vmcs12 page has been invalidated (though that may just be a naming
issue).

The use of nested_gpc_lock() in nested_vmx_prepare_msr_bitmap() and
nested_vmx_handle_enlightened_vmptrld() also feels forced.  The usage there is
_much_ more aligned with nested_gpc_lock_if_active(), in the sense that it's a
temporary establishing a host mapping versus grabbing a pfn that gets shoved into
a non-MMU structure.

> +{
> +	int err;
> +
> +	if (!PAGE_ALIGNED(gpa))
> +		return -EINVAL;
> +retry:
> +	read_lock(&gpc->lock);
> +	if (!kvm_gpc_check(gpc, PAGE_SIZE) || (gpc->gpa != gpa)) {

This order of thsese checks is bizarre.  It's not problematic per se, but it's
jarring to read.

> +		read_unlock(&gpc->lock);
> +		err = kvm_gpc_activate(gpc, gpa, PAGE_SIZE);

And the use of kvm_gpc_activate() in the retry loop is also bizarre.  It's "fine",
but it makes the lifecycle of the code hard to follow.

The other thing that stands out in later patches is the error handling.  I really
dislike the number of unlock() calls that are being added.

The good news is, I think pretty much all of my complaints are symptoms of major
gaps in the core gpc APIs.  E.g. you're smushing largely unrelated things into
nested_gpc_lock() because the retry loop is fugly.  If we add better APIs, then
the nVMX code shouldn't need to combine different concepts in weird ways, just
to avoid copy+pasting an ugly loop.

Happily, the majority of that can be sorted out in advance of converting nVMX.
E.g. instead of the stealtime conversion looking like so:

	if (!read_trylock(&gpc->lock))
		return;

	if (!kvm_gpc_check(gpc, sizeof(*st)))
		goto out_unlock_gpc;

	st = (struct kvm_steal_time *)gpc->khva;
	WRITE_ONCE(st->preempted, preempted);
	vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;

	kvm_gpc_mark_dirty_in_slot(gpc);

  out_unlock_gpc:
	read_unlock(&gpc->lock);

it can instead be something like:

	CLASS(gpc_try_map_local, st_map)(gpc, sizeof(st->preempted));
	if (IS_ERR(st_map))
		return;

	st = *st_map;
	WRITE_ONCE(st->preempted, KVM_VCPU_PREEMPTED);
	vcpu->arch.st.preempted = KVM_VCPU_PREEMPTED;

I have patches that appear to work.  I'll post a new version of David's series[*]
with the new APIs, and we can go from there (dropping the IRQ interactions makes
adding the CLASS() stuff easier, and you'll have to rebase regardless).

[*] https://lore.kernel.org/all/20260508181717.3230988-1-dwmw2@infradead.org

> +		if (err)
> +			return err;
> +
> +		goto retry;
> +	}
> +
> +	return 0;
> +}
> +
> +static void nested_gpc_unlock(struct gfn_to_pfn_cache *gpc)

This is a completely superfluous helper.  Looking ahead, many of the wrappers are
similarly useless, and IMO do more harm than good.  E.g. having bounce through
multiple implementations for something like this:

	if (!pi_test_and_clear_on(pi_desc)) {
		nested_unlock_pi_desc(vmx);
		return 0;
	}

is *really* frustrating when it's nothing more than:

	if (!pi_test_and_clear_on(pi_desc)) {
		read_unlock(&vmx->nested.pi_desc_cache->lock);
		return 0;
	}

All of that should be a moot point with better APIs, so this really "in the
future..." type of feedback.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 04/10] KVM: nVMX: Implement cache for L1 APIC pages
       [not found] ` <20260102142429.896101-5-griffoul@gmail.com>
@ 2026-05-11 23:35   ` Sean Christopherson
  0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2026-05-11 23:35 UTC (permalink / raw)
  To: Fred Griffoul
  Cc: kvm, pbonzini, vkuznets, shuah, dwmw, linux-kselftest,
	linux-kernel, Fred Griffoul

On Fri, Jan 02, 2026, Fred Griffoul wrote:
> From: Fred Griffoul <fgriffo@amazon.co.uk>
> 
> Replace kvm_host_map usage with gfn_to_pfn_cache for L1 APIC
> virtualization pages (APIC access, virtual APIC, and posted interrupt
> descriptor pages) to improve performance with unmanaged guest memory.
> 
> The conversion involves several changes:
> 
> - Page loading in nested_get_vmcs12_pages(): load vmcs02 fields with
>   pfncache PFNs after each cache has been checked and possibly activated
>   or refreshed, during OUTSIDE_GUEST_MODE vCPU mode.
> 
> - Invalidation window handling: since nested_get_vmcs12_pages() runs in
>   OUTSIDE_GUEST_MODE, there's a window where caches can be invalidated
>   by MMU notifications before entering IN_GUEST_MODE. implement
>   is_nested_state_invalid() callback to monitor cache validity between
>   OUTSIDE_GUEST_MODE and IN_GUEST_MODE transitions. This triggers
>   KVM_REQ_GET_NESTED_STATE_PAGES when needed.
> 
> - Cache access in event callbacks: the virtual APIC and posted interrupt
>   descriptor pages are accessed by KVM in has_events() and
>   check_events() nested_ops callbacks. These use the kernel HVA following
>   the pfncache pattern of check/refresh, with both callbacks able to sleep
>   if cache refresh is required.
> 
> This eliminates expensive memremap/memunmap cycles for each L2 VM
> entry/exit, providing substantial performance improvements when using
> unmanaged memory.
> 
> Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk>
> ---
>  arch/x86/kvm/vmx/nested.c | 182 +++++++++++++++++++++++++++++---------
>  arch/x86/kvm/vmx/vmx.h    |   8 +-
>  include/linux/kvm_host.h  |   5 ++
>  3 files changed, 150 insertions(+), 45 deletions(-)

Please split this up; one "cache" per patch.  It's a bit gratutious, but the usage
of the vapic and pid structures is just complex enough to make this more than a
straightforward, "boring" conversion.

> @@ -3112,8 +3173,9 @@ static int nested_vmx_check_controls(struct kvm_vcpu *vcpu,
>  static int nested_vmx_check_controls_late(struct kvm_vcpu *vcpu,
>  					  struct vmcs12 *vmcs12)
>  {
> -	void *vapic = to_vmx(vcpu)->nested.virtual_apic_map.hva;
> -	u32 vtpr = vapic ? (*(u32 *)(vapic + APIC_TASKPRI)) >> 4 : 0;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	void *vapic;
> +	u32 vtpr = 0;
>  
>  	/*
>  	 * Don't bother with the consistency checks if KVM isn't configured to
> @@ -3130,6 +3192,12 @@ static int nested_vmx_check_controls_late(struct kvm_vcpu *vcpu,
>  	if (!warn_on_missed_cc)
>  		return 0;
>  
> +	vapic = nested_lock_vapic(vmx);
> +	if (vapic) {
> +		vtpr = (*(u32 *)(vapic + APIC_TASKPRI)) >> 4;
> +		nested_unlock_vapic(vmx);

This is *very* dangerous.  vapic will still be a legal kernel pointer, and the
code _relies_ on it to be non-NULL, but the pointer could become stale at any time.
Happily, my CLASS() approach makes this a non-issue.

> +	}
> +
>  	if ((exec_controls_get(to_vmx(vcpu)) & CPU_BASED_TPR_SHADOW) &&
>  	    nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW) &&
>  	    !nested_cpu_has_vid(vmcs12) &&

...

> @@ -3543,7 +3609,16 @@ static bool vmx_get_nested_state_pages(struct kvm_vcpu *vcpu)
>  
>  static bool vmx_is_nested_state_invalid(struct kvm_vcpu *vcpu)
>  {
> -	return false;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	/*
> +	 * @vcpu is in IN_GUEST_MODE, eliminating the need for individual gpc

This is straight up wrong.

gpc->active is effectively protected by vcpu->mutex.

And for gpc->valid, it's not completely wrong, but it's partially wrong and
definitely misleading.  It too is protected by vcpu->mutex, but only for %false
=> %true transitions.  And even for %true => %false transitions, IN_GUEST_MODE
doesn't exactly provide safety, rather it ensures that KVM will to kick the vCPU
out of the guest before completing the invalidation.

But that's all moot, because the very existence of this code feels wrong.  KVM
should be sending a "no action" request and then requiring vendor code to check
out-of-band data to proxy that into a "real" request.

Presumably older code that you're reverting used KVM_REQ_OUTSIDE_GUEST_MODE;
that doesn't necessary mean it was a good idea, it got yanked out for a reason :-)

IMO, the least awful way to deal with this is to tag KVM_REQ_GET_NESTED_STATE_PAGES
with KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP, and then pass the specific request
that needs to be made when invalidating a vCPU-mapped gpc as part of
__kvm_gpc_init() (though maybe call it vcpu_gpc_init() instead?).  That way KVM
doesn't need this wonky request => kvm_gpc_invalid() => request proxying, and we
should be avoid to entirely avoid the confusingly-named kvm_gpc_invalid().

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 00/10] KVM: nVMX: Improve performance for unmanaged guest memory
       [not found] <20260102142429.896101-1-griffoul@gmail.com>
       [not found] ` <20260102142429.896101-2-griffoul@gmail.com>
       [not found] ` <20260102142429.896101-5-griffoul@gmail.com>
@ 2026-05-11 23:56 ` Sean Christopherson
       [not found] ` <20260102142429.896101-9-griffoul@gmail.com>
  3 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2026-05-11 23:56 UTC (permalink / raw)
  To: Fred Griffoul
  Cc: kvm, pbonzini, vkuznets, shuah, dwmw, linux-kselftest,
	linux-kernel, Fred Griffoul

On Fri, Jan 02, 2026, Fred Griffoul wrote:
> First, the current approach is missing proper invalidation handling in
> critical scenarios. Enlightened VMCS (eVMCS) pages can become stale when
> memslots are modified, as there is no mechanism to invalidate the cached
> mappings.

This is a non-issue.  Modifying memslots while vCPUs are active will cause problems,
period.  There is no magic on earth that will prevent that.

> Similarly, APIC access and virtual APIC pages can be migrated
> by the host, but without proper notification through mmu_notifier
> callbacks, the mappings become invalid and can lead to incorrect
> behavior.

No, they can't, at least not for unmanaged memory.  For kernel-managed memory,
KVM obtains a long-term pin, which prevents the host from migrating the page.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4 08/10] KVM: x86: Add nested context management
       [not found] ` <20260102142429.896101-9-griffoul@gmail.com>
@ 2026-05-12  0:13   ` Sean Christopherson
  0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2026-05-12  0:13 UTC (permalink / raw)
  To: Fred Griffoul
  Cc: kvm, pbonzini, vkuznets, shuah, dwmw, linux-kselftest,
	linux-kernel, Fred Griffoul

On Fri, Jan 02, 2026, Fred Griffoul wrote:
> From: Fred Griffoul <fgriffo@amazon.co.uk>
> 
> Add infrastructure to persist nested virtualization state when L2 vCPUs

Please be more transparent with what exactly is being persisted.

> are switched on an L1 vCPU or migrated between L1 vCPUs.
> 
> The nested context table uses a hash table for fast lookup by nested
> control block GPA (VMPTR for VMX, VMCB for SVM) and maintains a free
> list for context management.
>
> The kvm_nested_context_load() function searches for a context indexed by
> the target GPA; if not found, it allocates a new context up to the
> configured maximum. If at capacity, it recycles the oldest context from
> the free list.
> 
> The oversubscription is hardcoded to support up to 8 L2 vCPUs per L1
> vCPU.
> 
> The kvm_nested_context_clear() function moves the context to the free
> list while keeping it in the hash table for potential reuse.
> 
> This allows nested hypervisors to multiplex multiple L2 vCPUs on L1
> vCPUs without losing cached nested state, significantly improving
> performance for workloads with frequent L2 context switches.
> 
> This patch adds the basic infrastructure. Subsequent patches will add
> the nested VMX and SVM specific support to populate and utilize the
> cached nested state.
> 
> Signed-off-by: Fred Griffoul <fgriffo@amazon.co.uk>
> ---
>  arch/x86/include/asm/kvm_host.h |  31 +++++
>  arch/x86/include/uapi/asm/kvm.h |   2 +
>  arch/x86/kvm/Makefile           |   2 +-
>  arch/x86/kvm/nested.c           | 199 ++++++++++++++++++++++++++++++++
>  arch/x86/kvm/x86.c              |   5 +-
>  5 files changed, 237 insertions(+), 2 deletions(-)

Please provide concrete performance numbers.  They need to be isolated from the
switch to gpcs, and need to show how much benefit is provided for a per-VM hash
table vs. (much) simpler approaches, e.g. versus a stupid simple per-vCPU LRU
cache, a la KVM's pgd caching.

There also needs to be an analysis of the downsides of the performance gains.
If I'm putting the pieces together correctly, quoting a snippet from the cover
letter, the performance benefits come from:

  The pfncache infrastructure maintains persistent mappings as long as the
  page GPA does not change, eliminating the memremap/memunmap overhead on
  every VM entry/exit cycle. 

Which means that this caching effectively eliminates the security value added by
removing memory from the kernel's direct map.  If, in the long term, we're
collectively moving towards guest_memfd (for setups that don't want all of the
overcommit goodness provided by mm/), then the performance provided by this approach
is directly at odds with the efforts to remove guest_memfd memory from the direct
map for added security.

E.g. if the ratio of L2:L1 contexts is pushed high enough, it would be possible
to have the majority of guest memory mapped into the host kernel.

That then raises the question of whether or not we are optimizing the right thing.
E.g. if we can somehow make map+unmap blazing fast for "all" real world usage that
matters, then maybe we don't need this type of caching.

In general, this needs a _lot_ more justification on the design decisions.  A lot,
a lot, a _lot_ more.  This is too much code and complexity for me to even start
reviewing without hard data.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-05-12  0:13 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260102142429.896101-1-griffoul@gmail.com>
     [not found] ` <20260102142429.896101-2-griffoul@gmail.com>
2026-05-11 23:08   ` [PATCH v4 01/10] KVM: nVMX: Implement cache for L1 MSR bitmap Sean Christopherson
     [not found] ` <20260102142429.896101-5-griffoul@gmail.com>
2026-05-11 23:35   ` [PATCH v4 04/10] KVM: nVMX: Implement cache for L1 APIC pages Sean Christopherson
2026-05-11 23:56 ` [PATCH v4 00/10] KVM: nVMX: Improve performance for unmanaged guest memory Sean Christopherson
     [not found] ` <20260102142429.896101-9-griffoul@gmail.com>
2026-05-12  0:13   ` [PATCH v4 08/10] KVM: x86: Add nested context management Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox