Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v9 14/23] x86/virt/seamldr: Shut down the current TDX module
From: Edgecombe, Rick P @ 2026-05-19  3:00 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Gao, Chao
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com,
	x86@kernel.org
In-Reply-To: <20260513151045.1420990-15-chao.gao@intel.com>

On Wed, 2026-05-13 at 08:09 -0700, Chao Gao wrote:
> The first step of TDX module updates is shutting down the current TDX
> module. This step also packs state information that needs to be
> preserved across updates as handoff data, 
> 

kinda reads like handoff data is an existing term, but its the first reference
in this series.

Maybe packs state information that needs to be preserved across updates, called
"handoff data". This handoff data is consumed...

> which will be consumed by the
> updated module. The handoff data is stored internally in the SEAM range
> and is hidden from the kernel.
> 
> To ensure a successful update, the new module must be able to consume
> the handoff data generated by the old module.
> 

Is it too obvious thing to state? Above you already say it's needed.

>  Since handoff data layout
> may change between modules, the handoff data is versioned. Each module
> has a native handoff version and provides backward support for several
> older versions.
> 
> The complete handoff versioning protocol is complex as it supports both
> module upgrades and downgrades. See details in Intel® Trust Domain
> Extensions (Intel® TDX) Module Base Architecture Specification, Chapter
> "Handoff Versioning".
> 
> Ideally, the kernel needs to retrieve the handoff versions supported by
> the current module and the new module and select a version supported by
> both. But, since this implementation chooses to only support module
> upgrades, simply request the current module to generate handoff data
> using its highest supported version, expecting that the new module will
> likely support it.

Hmm, "likely"? Is this trying to justify the kernel's policy? Dunno, stands out
as weird to me. Like "this will mostly work". Sounds incomplete, rather than a
reason of "this policy is the optimal initial implementation" or something like
that.

> 
> Retrieve the module's handoff version from TDX global metadata and add an
> update step to shut down the module.
> 

This is small patch with both things, but it's almost two changes.

>  Module shutdown has global effect, so
> it only needs to run on one CPU.

I wouldn't think having some global effect would necessarily exclude having to
run on multiple CPUs. Or at least I don't follow. Is it a TDX arch thing? I
guess it's ok.

> 
> Note that the handoff information isn't cached in tdx_sysinfo. It is used
> only for module shutdown, and is present only when the TDX module supports
> updates. Caching it in get_tdx_sys_info() would require extra update-support
> guards and refreshing the cached value across module updates.

Instead of being a "note", could this be just an imperative: Don't cache the
handoff information in tdx_sysinfo...

> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> ---
> v9:
>  - Use CPU0 as the primary CPU
> ---
>  arch/x86/include/asm/tdx_global_metadata.h  |  4 ++++
>  arch/x86/virt/vmx/tdx/seamldr.c             | 15 ++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.c                 | 19 ++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.h                 |  3 +++
>  arch/x86/virt/vmx/tdx/tdx_global_metadata.c | 13 +++++++++++++
>  5 files changed, 52 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tdx_global_metadata.h b/arch/x86/include/asm/tdx_global_metadata.h
> index 40689c8dc67e..41150d546589 100644
> --- a/arch/x86/include/asm/tdx_global_metadata.h
> +++ b/arch/x86/include/asm/tdx_global_metadata.h
> @@ -40,6 +40,10 @@ struct tdx_sys_info_td_conf {
>  	u64 cpuid_config_values[128][2];
>  };
>  
> +struct tdx_sys_info_handoff {
> +	u16 module_hv;
> +};
> +
>  struct tdx_sys_info {
>  	struct tdx_sys_info_version version;
>  	struct tdx_sys_info_features features;
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index 48fe71319fea..6114cab46196 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -15,6 +15,7 @@
>  #include <asm/seamldr.h>
>  
>  #include "seamcall_internal.h"
> +#include "tdx.h"
>  
>  /* P-SEAMLDR SEAMCALL leaf function */
>  #define P_SEAMLDR_INFO			0x8000000000000000
> @@ -164,6 +165,7 @@ static int init_seamldr_params(struct seamldr_params *params, const u8 *data, u3
>   */
>  enum module_update_state {
>  	MODULE_UPDATE_START,
> +	MODULE_UPDATE_SHUTDOWN,
>  	MODULE_UPDATE_DONE,
>  };
>  
> @@ -214,8 +216,16 @@ static void init_state(struct update_ctrl *ctrl)
>  static int do_seamldr_install_module(void *seamldr_params)
>  {
>  	enum module_update_state newstate, curstate = MODULE_UPDATE_START;
> +	int cpu = smp_processor_id();
> +	bool primary;
>  	int ret = 0;
>  
> +	/*
> +	 * Use CPU 0 to execute update steps that must run exactly once.
> +	 * Note CPU 0 is always online.
> +	 */
> +	primary = cpu == 0;
> +

Where does the term 'primary' come from? I'm guessing that the global steps must
each be run on the same CPU? Is that right? And we just pick the cpu that we
know much be online? Or can the global steps be run on different CPUs? Or they
*have* to be run on cpu 0? It might be worth some comments explaining, depending
on the answers to those questions.

>  	do {
>  		newstate = READ_ONCE(update_ctrl.state);
>  
> @@ -226,7 +236,10 @@ static int do_seamldr_install_module(void *seamldr_params)
>  
>  		curstate = newstate;
>  		switch (curstate) {
> -		/* TODO: add the update steps. */
> +		case MODULE_UPDATE_SHUTDOWN:
> +			if (primary)
> +				ret = tdx_module_shutdown();
> +			break;
>  		default:
>  			break;
>  		}
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 1621695d7561..da3c1e857b26 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -321,7 +321,7 @@ static __init int build_tdx_memlist(struct list_head *tmb_list)
>  	return ret;
>  }
>  
> -static __init int read_sys_metadata_field(u64 field_id, u64 *data)
> +static int read_sys_metadata_field(u64 field_id, u64 *data)
>  {
>  	struct tdx_module_args args = {};
>  	int ret;
> @@ -1267,6 +1267,23 @@ static __init int tdx_enable(void)
>  }
>  subsys_initcall(tdx_enable);
>  
> +int tdx_module_shutdown(void)
> +{
> +	struct tdx_sys_info_handoff handoff = {};
> +	struct tdx_module_args args = {};
> +	int ret;
> +
> +	ret = get_tdx_sys_info_handoff(&handoff);
> +	WARN_ON_ONCE(ret);

Take or leave it:

  Why not just WARN_ON_ONCE(get_tdx_sys_info_handoff(&handoff));
  And we can drop the ret var. Save 2 LOC.

> +
> +	/*
> +	 * Use the module's handoff version as it is the highest the
> +	 * module can produce and most likely supported by newer modules.
> +	 */
> +	args.rcx = handoff.module_hv;
> +	return seamcall_prerr(TDH_SYS_SHUTDOWN, &args);
> +}
> +
>  static bool is_pamt_page(unsigned long phys)
>  {
>  	struct tdmr_info_list *tdmr_list = &tdx_tdmr_list;
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index 76c5fb1e1ffe..f0c20dea0388 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -46,6 +46,7 @@
>  #define TDH_PHYMEM_PAGE_WBINVD		41
>  #define TDH_VP_WR			43
>  #define TDH_SYS_CONFIG			45
> +#define TDH_SYS_SHUTDOWN		52
>  #define TDH_SYS_DISABLE			69
>  
>  /*
> @@ -108,4 +109,6 @@ struct tdmr_info_list {
>  	int max_tdmrs;	/* How many 'tdmr_info's are allocated */
>  };
>  
> +int tdx_module_shutdown(void);
> +
>  #endif
> diff --git a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> index d54d4227990c..e793dec688ab 100644
> --- a/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> +++ b/arch/x86/virt/vmx/tdx/tdx_global_metadata.c
> @@ -100,6 +100,19 @@ static __init int get_tdx_sys_info_td_conf(struct tdx_sys_info_td_conf *sysinfo_
>  	return ret;
>  }
>  
> +static int get_tdx_sys_info_handoff(struct tdx_sys_info_handoff *sysinfo_handoff)
> +{
> +	int ret;
> +	u64 val;
> +
> +	ret = read_sys_metadata_field(0x8900000100000000, &val);
> +	if (ret)
> +		return ret;
> +
> +	sysinfo_handoff->module_hv = val;
> +	return 0;
> +}
> +
>  static __init int get_tdx_sys_info(struct tdx_sys_info *sysinfo)
>  {
>  	int ret = 0;


^ permalink raw reply

* Re: [PATCH v9 13/23] x86/virt/seamldr: Abort updates after a failed step
From: Edgecombe, Rick P @ 2026-05-19  2:34 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Gao, Chao
  Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
	kas@kernel.org, Chatre, Reinette, seanjc@google.com,
	pbonzini@redhat.com, binbin.wu@linux.intel.com, Verma, Vishal L,
	nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
	tony.lindgren@linux.intel.com, Annapurve, Vishal, Shahar, Sagi,
	djbw@kernel.org, tglx@kernel.org, paulmck@kernel.org,
	hpa@zytor.com, bp@alien8.de, yilun.xu@linux.intel.com,
	x86@kernel.org
In-Reply-To: <20260513151045.1420990-14-chao.gao@intel.com>

On Wed, 2026-05-13 at 08:09 -0700, Chao Gao wrote:
> A TDX module update is a multi-step process, and any step can fail.
> 
> The current update flow continues to later steps after an error.
> Continuing after a failure can leave the TDX module in an unrecoverable
> state.

I get what you are saying here, but "continuing" vs "leaving" is a tiny bit
confusing to me. Maybe: Continuing with subsequent update steps after a failure
can cause the TDX module to enter an unrecoverable state?

> 
> One failure case must remain recoverable: update contention with an ongoing
> TD build. The agreed kernel behavior for this case [1] is to fail the
> update with -EBUSY so userspace can retry later.

The link to the discussion is nice, but the explanation of just that there was
an agreement is not saying much. But the reasoning around AVOID_COMPAT_SENSITIVE
*is* handled in later patch. So can we say future changes will want to return
errors to userspace for certain update failures? Then we can discuss the
specifics when code is actual error is added?

And why talk about EBUSY specifically? It is not in this patch. Stale log? 

> 
> Abort the update on any failure. This also makes the TD-build contention
> case recoverable, because that failure occurs before any TDX module state
> is changed. 
> 

Oh, maybe I didn't get what you meant above actually. The contention case is
only recoverable because we detect it at the first step? Does "Continuing after
a failure can leave the TDX module in an unrecoverable" really mean that any
failure after the first step is unrecoverable? Or can we put it in some other
more specific terms like that. Terms which are more specific but still not
overly complex description of TDX module update flows?

> Apply the same rule to all errors instead of special-casing
> -EBUSY.

It seems like actually it is not special cased...? The error returned is
whatever is returned from the step.

> 
> Track per-step failures, stop the update loop once a failure is observed,
> and do not advance the state machine to the next step.

Hmm, so this is actually a bunch of generic handling for each step, that really
only works for the first one? Is the generic handling really needed?

> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Link: https://lore.kernel.org/linux-coco/aQFmOZCdw64z14cJ@google.com/ # [1]
> ---
> v9:
>   - Avoid nested if/else by deferring failure accounting to ack_state().
>   - Reduce indentation of the main flow.
>   - Convert the failed flag into a counter. This avoids a conditional
>     update of the flag; the counter can simply accumulate failures.
> ---
>  arch/x86/virt/vmx/tdx/seamldr.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/seamldr.c b/arch/x86/virt/vmx/tdx/seamldr.c
> index 7befe4a08f33..48fe71319fea 100644
> --- a/arch/x86/virt/vmx/tdx/seamldr.c
> +++ b/arch/x86/virt/vmx/tdx/seamldr.c
> @@ -170,6 +170,7 @@ enum module_update_state {
>  static struct update_ctrl {
>  	enum module_update_state state;
>  	int num_ack;
> +	int num_failed;

Was there past discussion on why it keeps a failed count? All we need to know is
if anything failed right? So a bool is fine too?

>  	/*
>  	 * Protect update_ctrl. Raw spinlock as it will be acquired from
>  	 * interrupt-disabled contexts.
> @@ -187,12 +188,13 @@ static void __set_target_state(struct update_ctrl *ctrl,
>  }
>  
>  /* Last one to ack a state moves to the next state. */
> -static void ack_state(struct update_ctrl *ctrl)
> +static void ack_state(struct update_ctrl *ctrl, int result)
>  {
>  	raw_spin_lock(&ctrl->lock);
>  
> +	ctrl->num_failed += !!result;
>  	ctrl->num_ack++;
> -	if (ctrl->num_ack == num_online_cpus())
> +	if (ctrl->num_ack == num_online_cpus() && !ctrl->num_failed)
>  		__set_target_state(ctrl, ctrl->state + 1);
>  
>  	raw_spin_unlock(&ctrl->lock);
> @@ -202,6 +204,7 @@ static void init_state(struct update_ctrl *ctrl)
>  {
>  	raw_spin_lock_init(&ctrl->lock);
>  	__set_target_state(ctrl, MODULE_UPDATE_START + 1);
> +	ctrl->num_failed = 0;
>  }
>  
>  /*
> @@ -228,8 +231,8 @@ static int do_seamldr_install_module(void *seamldr_params)
>  			break;
>  		}
>  
> -		ack_state(&update_ctrl);
> -	} while (curstate != MODULE_UPDATE_DONE);
> +		ack_state(&update_ctrl, ret);
> +	} while (curstate != MODULE_UPDATE_DONE && !READ_ONCE(update_ctrl.num_failed));
>  
>  	return ret;
>  }


^ permalink raw reply

* Re: [PATCH v9 02/23] x86/virt/tdx: Move TDX_FEATURES0 bits to asm/tdx.h
From: Edgecombe, Rick P @ 2026-05-19  1:59 UTC (permalink / raw)
  To: Hansen, Dave, Gao, Chao
  Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Huang, Kai, kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	mingo@redhat.com, Verma, Vishal L, kas@kernel.org, Shahar, Sagi,
	Annapurve, Vishal, djbw@kernel.org, tglx@kernel.org,
	paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
	yilun.xu@linux.intel.com, x86@kernel.org
In-Reply-To: <68e91c7ae1d10bdff73dd178d0e4ee48eaf1cfe1.camel@intel.com>

On Mon, 2026-05-18 at 09:57 -0700, Rick Edgecombe wrote:
> On Mon, 2026-05-18 at 15:52 +0800, Chao Gao wrote:
> > On Fri, May 15, 2026 at 09:15:47AM -0700, Dave Hansen wrote:
> > > On 5/13/26 08:09, Chao Gao wrote:
> > > > This prepares for TDX module update [1] and Dynamic PAMT [2] support. Both
> > > > add new TDX_FEATURES0 capability bits, and both need those capabilities to
> > > > be queried from code outside arch/x86/virt. The corresponding feature-query
> > > > helpers therefore need to live in the public asm/tdx.h header, so move the
> > > > existing bit definitions there first.
> > > 
> > > Please don't add unnecessary changelog cruft. If you need this move for
> > > this series, that's enough.
> > 
> > Sure. Will remove "Dynamic PAMT" stuff from the changelog.
> 
> I think it should not link to old versions of this series to explain the
> preparation. That is very confusing. We can just explain what will come in the
> later patches of *this* series. I'll circle back and propose some verbiage.

How about?

Future changes will add support for new TDX features exposed as TDX_FEATURES0
bits. The presence of these features will need to be checked outside of
arch/x86/virt. So the feature query helpers, and the TDX_FEATURES0 defines they
reference, will need to live in the widely accessible asm/tdx.h helper. Move the
existing TDX_FEATURES0 to asm/tdx.h so that they can all be kept together.


I ended up re-writing the whole thing. Not sure it was entirely necessary, but
lets def lose the links to the old patch versions.

^ permalink raw reply

* Re: [PATCH v9 09/23] coco/tdx-host: Don't expose P-SEAMLDR information on CPUs with erratum
From: Edgecombe, Rick P @ 2026-05-19  1:22 UTC (permalink / raw)
  To: Hansen, Dave, Gao, Chao
  Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Huang, Kai, kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	mingo@redhat.com, Verma, Vishal L, kas@kernel.org, Shahar, Sagi,
	Annapurve, Vishal, djbw@kernel.org, tglx@kernel.org,
	paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
	yilun.xu@linux.intel.com, x86@kernel.org
In-Reply-To: <2e082f23-1383-4dba-8cdf-0df612f64ace@intel.com>

On Mon, 2026-05-18 at 08:29 -0700, Dave Hansen wrote:
> On 5/18/26 05:44, Chao Gao wrote:
> > On Fri, May 15, 2026 at 10:26:19AM -0700, Dave Hansen wrote:
> > > On 5/13/26 08:09, Chao Gao wrote:
> > > > Some TDX-capable CPUs have an erratum, as documented in Intel® Trust
> > > > Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3:
> > > 2021, eh?
> > The TDX ISA document has not been updated since then; the May 2021
> > edition is still the latest revision. See:
> > 
> > https://www.intel.com/content/www/us/en/developer/tools/trust-domain-
> > extensions/documentation.html
> 
> I think you are saying that the CPUs have an erratum.
> 
> That erratum diverges their implementation from the spec: "Intel® Trust
> Domain CPU Architectural Extensions (May 2021 edition) Chapter 2.3".

It actually is documented in that May 2021 spec as the architectural behavior.
But it looks like not earlier, because the doc said it is new verbiage on that
one.

> 
> But when you combine those two things in one sentence, it's incredibly
> confusing.
> 
> The erratum you are talking about is brand new. I just asked for it to
> be created in the last month or two. Thus, my confusion when you say
> there: "an erratum, as documented in ... May 2021".
> 
> Thus, I'm questioning the 2021 date. You probably also want to mention
> that the erratum is, as of today, not publicly documented.
> 
> Can you rephrase this all and make it clearer, please?

So I guess we want to explain:
1. The problematic VMCS clearing behavior
2. That the problematic behavior is only documented in later docs (right?)
3. That it will be documented as an erratum later, and checked via the bit

Maybe something like?

Some TDX-capable CPUs have an erratum where SEAMRET clears the current VMCS
pointer. The behavior relies on the VMM to reload the current VMCS pointer.
However, that is a problem for KVM because clearing the current VMCS pointer
behind KVM's back will break KVM. While the VMCS clearing is documented as the
actual architecture in later versions of the "Intel® Trust Domain CPU
Architectural Extensions"[0] documents, it is not present in the earlier ones. 

Future docs will describe this SEAMRET VMCS clearing behavior as being present
when IA32_VMX_BASIC[60] is set...






^ permalink raw reply

* Re: [PATCH v2] KVM: TDX: Fix x2APIC MSR handling in tdx_has_emulated_msr()
From: Sean Christopherson @ 2026-05-19  0:41 UTC (permalink / raw)
  To: Sean Christopherson, kas, kvm, linux-coco, linux-kernel, pbonzini,
	binbin.wu, dmaluka, Rick Edgecombe
In-Reply-To: <20260410232654.3864196-1-rick.p.edgecombe@intel.com>

On Fri, 10 Apr 2026 16:26:54 -0700, Rick Edgecombe wrote:
> Rework tdx_has_emulated_msr() to explicitly enumerate the x2APIC MSRs
> that KVM can emulate, instead of trying to enumerate the MSRs that KVM
> cannot emulate. Drop the inner switch and list the emulatable x2APIC
> registers directly in the outer switch's "return true" block.
> 
> The old code had multiple bugs in the x2APIC range handling.
> X2APIC_MSR(APIC_ISR + APIC_ISR_NR) was incorrect because APIC_ISR_NR is
> 0x8, not 0x80, so the X2APIC_MSR() shift lost the lower bits, collapsing
> each range to a single MSR. IA32_X2APIC_SELF_IPI was also missing from
> the non-emulatable list.
> 
> [...]

Applied to kvm-x86 vmx, with a massaged comment as suggested by Binbin.  Thanks!

[1/1] KVM: TDX: Fix x2APIC MSR handling in tdx_has_emulated_msr()
      https://github.com/kvm-x86/linux/commit/1f3e69af5f93

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply

* Re: [RFC PATCH v5 00/45] TDX: Dynamic PAMT + S-EPT Hugepage
From: Sean Christopherson @ 2026-05-19  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, Kiryl Shutsemau, Paolo Bonzini
  Cc: linux-kernel, linux-coco, kvm, Kai Huang, Rick Edgecombe,
	Yan Zhao, Vishal Annapurve, Ackerley Tng, Sagi Shahar, Binbin Wu,
	Xiaoyao Li, Isaku Yamahata
In-Reply-To: <20260129011517.3545883-1-seanjc@google.com>

On Wed, 28 Jan 2026 17:14:32 -0800, Sean Christopherson wrote:
> This is a combined series of Dynamic PAMT (from Rick), and S-EPT hugepage
> support (from Yan).  Except for some last minute tweaks to the DPAMT array
> args stuff, a version of this based on a Google-internal kernel has been
> moderately well tested (thanks Vishal!).  But overall it's still firmly RFC
> as I have deliberately NOT addressed others feedback from v4 of DPAMT and v3
> of S-EPT hugepage (mostly lack of cycles), and there's at least one patch in
> here that shouldn't be merged as-is (the quick-and-dirty switch from struct
> page to raw pfns).
> 
> [...]

Applied 1-4 to kvm-x86 mmu.  Please yell if this was unexpected in any way.
I'm pretty sure this is what we agreed on, but the last few week have been a
bit chaotic...

[01/45] x86/tdx: Use pg_level in TDX APIs, not the TDX-Module's 0-based level
        https://github.com/kvm-x86/linux/commit/4487492b92a4
[02/45] KVM: x86/mmu: Update iter->old_spte if cmpxchg64 on mirror SPTE "fails"
        https://github.com/kvm-x86/linux/commit/02eaaffdd865
[03/45] KVM: TDX: Account all non-transient page allocations for per-TD structures
        https://github.com/kvm-x86/linux/commit/a8b2924676ec
[04/45] KVM: x86: Make "external SPTE" ops that can fail RET0 static calls
        https://github.com/kvm-x86/linux/commit/e1a31ca28c9d

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply

* Re: [PATCH v2 08/15] KVM: x86: Add mode-aware versions of kvm_<reg>_{read,write}() helpers
From: Huang, Kai @ 2026-05-18 23:44 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: dwmw2@infradead.org, Edgecombe, Rick P, x86@kernel.org,
	kas@kernel.org, binbin.wu@linux.intel.com,
	dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <agt7vnVL5rJJUVzo@google.com>

On Mon, 2026-05-18 at 13:51 -0700, Sean Christopherson wrote:
> On Mon, May 18, 2026, Kai Huang wrote:
> > 
> > > @@ -10413,29 +10413,30 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> > >  
> > >  	if (!is_64_bit_hypercall(vcpu))
> > >  		ret = (u32)ret;
> > > -	kvm_rax_write(vcpu, ret);
> > > +	kvm_rax_write_raw(vcpu, ret);
> > >  	return kvm_skip_emulated_instruction(vcpu);
> > >  }
> > > 
> > 
> > Nit:  AFAICT if we use kvm_rax_write(vcpu, ret) instead of the "raw" version
> > here, we can then remove the
> > 
> > 	if (!is_64_bit_hypercall(vcpu))
> > 		ret = (u32)ret;
> 
> No, because sneakily, is_64_bit_hypercall() != is_64_bit_mode(vcpu).  And because
> we also need to avoid calling is_64_bit_mode().  If we use kvm_rax_write(), then
> the unpacked code will be:
> 
> 	WARN_ON_ONCE(vcpu->arch.guest_state_protected);
> 
> 	if (is_long_mode(vcpu))
> 		kvm_x86_call(get_cs_db_l_bits)(vcpu, &cs_db, &cs_l);
> 	else
> 		cs_l = 0;
> 
> 	if (cs_l)
> 		vcpu->arch.regs[VCPU_REGS_RAX] = ret;
> 	else	
> 		vcpu->arch.regs[VCPU_REGS_RAX] = (u32)ret;
> 
> whereas the (correct) behavior here is:
> 
> 	if (vcpu->arch.guest_state_protected)
> 		cs_l = 1;
> 	else if (is_long_mode(vcpu))
> 		kvm_x86_call(get_cs_db_l_bits)(vcpu, &cs_db, &cs_l);
> 	else
> 		cs_l = 0;
> 
> 	if (cs_l)
> 		vcpu->arch.regs[VCPU_REGS_RAX] = ret;
> 	else	
> 		vcpu->arch.regs[VCPU_REGS_RAX] = (u32)ret;
> 
> I.e. using the non-raw version will trigger the WARN_ON_ONCE(), and will incorrectly
> truncate "ret" whenever cs_l is stale (which might be always?).

FWIW, I sanity tested that booting/destroying both TD and VMX guests worked
fine.  I have no environment to test SVM and Xen related parts, though.

^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: David Woodhouse @ 2026-05-18 23:38 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 3647 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
> like this through the tip tree?
> 
> The bulk of the changes are in kvmclock and TSC, but pretty much every
> hypervisor's guest-side code gets touched at some point.  I am reaonsably
> confident in the correctness of the KVM changes.  Michael tested Hyper-V in
> v2, and while there were conflicts when rebasing, they were largely
> superficial (and I've just jinxed myself).  For all other hypervisors, assume
> the code is compile-tested only, but those changes are all quite small and
> straightforward.
> 
> The only changes that are questionable/contentious are the last two patches,
> which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
> AMD (that's the dubious part).  I very deliberately put them last, so that
> they can be dropped at will (I don't care terribly if those patches land).
> To merge them, I would want explicit Acks from Paolo and David W.
> 
> So, except for the last two patches, to get the stuff I really care about
> landed, I think/hope it's just the TSC and guest-side CoCo changes that need
> reviews/acks?
> 
> The primary goal of this series is (or at least was, when I started) to
> fix flaws with SNP and TDX guests where a PV clock provided by the untrusted
> hypervisor is used instead of the secure/trusted TSC that is controlled by
> trusted firmware.
> 
> The secondary goal is to draft off of the SNP and TDX changes to slightly
> modernize running under KVM.  Currently, KVM guests will use TSC for
> clocksource, but not sched_clock.  And they ignore Intel's CPUID-based TSC
> and CPU frequency enumeration, even when using the TSC instead of kvmclock.
> And if the host provides the core crystal frequency in CPUID.0x15, then KVM
> guests can use that for the APIC timer period instead of manually calibrating
> the frequency.
> 
> The tertiary goal is to clean up all of the PV clock code to deduplicate logic
> across hypervisors, and to hopefully make it all easier to maintain going
> forward.

I booted this in qemu with -cpu host,+invtsc,+vmware-cpuid-freq

I was expecting to see it eschew the kvmclock and use *only* the TSC.
Is there even any need for 'tsc-early' given that it's *told* the TSC
frequency in CPUID? Shouldn't it have detected that the TSC is known
before init_tsc_clocksource() runs?

And then it even spent some time at boot actually using the kvmclock as
clocksource... when ideally I don't think it would even have *enabled*
it at all?

[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] tsc: Detected 2400.000 MHz processor
[    0.008205] TSC deadline timer available
[    0.008270] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns
[    0.159085] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.164074] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
[    0.229087] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1911260446275000 ns
[    0.337095] clocksource: Switched to clocksource kvm-clock
[    0.345246] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.356201] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22983777dd9, max_idle_ns: 440795300422 ns
[    0.360560] clocksource: Switched to clocksource tsc


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v2 0/4] x86/vmware: Hypercall refactoring and improved guest support
From: Alexey Makhalov @ 2026-05-18 22:52 UTC (permalink / raw)
  To: x86, virtualization, bp, hpa, dave.hansen, mingo, tglx
  Cc: ajay.kaher, brennan.lamoreaux, bo.gan, bcm-kernel-feedback-list,
	linux-kernel, kas, rick.p.edgecombe, linux-coco
In-Reply-To: <20260309235250.2611115-1-alexey.makhalov@broadcom.com>

On 3/9/26 4:52 PM, Alexey Makhalov wrote:
> This series improves VMware guest support on x86 by refactoring the
> hypercall infrastructure and adding better crash diagnostics, along
> with encrypted guest support for the steal time clock.
> 
> The first patch introduces a common vmware_hypercall() backend selected
> via static calls. It consolidates the existing hypercall mechanisms
> (backdoor, VMCALL/VMMCALL, and TDX) behind a single interface and
> selects the optimal implementation at boot. This reduces duplication
> and simplifies future extensions.
> 
> Building on top of the new hypercall infrastructure, the next two
> patches improve post-mortem debugging of VMware guests. They export
> panic information to the hypervisor by dumping kernel messages to the
> VM vmware.log on the host and explicitly reporting guest crash event
> to the hypervisor.
> 
> The final patch adds support for encrypted guests by ensuring that the
> shared memory used for the steal time clock is mapped as decrypted
> before being shared with the hypervisor. This enables steal time
> accounting to function correctly when guest memory encryption is
> enabled.
> 
> Patch overview:
> 
> 1. x86/vmware: Introduce common vmware_hypercall
> 
>     * Consolidate hypercall implementations behind a common API
>     * Select backend via static_call at boot
> 
> 2. x86/vmware: Log kmsg dump on panic
> 
>     * Register a kmsg dumper
>     * Export panic logs to the host
> 
> 3. x86/vmware: Report guest crash to the hypervisor
> 
>     * Register a panic notifier
>     * Notify the hypervisor about guest crashes
> 
> 4. x86/vmware: Support steal time clock for encrypted guests
> 
>     * Mark shared steal time memory as decrypted early in boot
> 
> 
> Changelog:
> 
> V1 -> V2
>     * Fix compilation warnings in patch 2 "x86/vmware: Log kmsg dump on panic"
>       reported by kernel test robot <lkp@intel.com>
> 
> 
> Alexey Makhalov (4):
>    x86/vmware: Introduce common vmware_hypercall()
>    x86/vmware: Log kmsg dump on panic
>    x86/vmware: Report guest crash to the hypervisor
>    x86/vmware: Support steal time clock for encrypted guests
> 
>   arch/x86/include/asm/vmware.h | 276 ++++++++------------
>   arch/x86/kernel/cpu/vmware.c  | 470 +++++++++++++++++++++++++---------
>   2 files changed, 463 insertions(+), 283 deletions(-)
> 
> 
> base-commit: 7d08a6ad25f85c9bb7d0382142838cb54713f1a3

Gentle reminder to review this change. Thanks,
--Alexey



^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Dave Hansen @ 2026-05-18 22:38 UTC (permalink / raw)
  To: Kalra, Ashish, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <6e365465-9ecf-416f-9561-67cab6428e15@amd.com>

On 5/18/26 15:09, Kalra, Ashish wrote:
> On 5/18/2026 5:04 PM, Dave Hansen wrote:
>> On 5/18/26 14:42, Ashish Kalra wrote:
>>> Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
>>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>>> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
>>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
>> Hi Ashish,
>>
>> Sorry if my memory fails me, but I don't remember signing off on this.
>> Could you point me to the place where I gave you my Signed-off-by?
> Sorry about this, added this accidentally. 
> 
> You had suggested the code change, i accidentally took it as a Signed-off.

Hi Ashish,

First, please do me a favor and go back and re-read:

	Documentation/process/submitting-patches.rst

I recommend that everyone do this every once in a while so they remember
what they are constantly signing off on. It's important.

My personal rule for SoB lines is that I don't add them unless I've
explicitly talked to the person. Even then, I vastly prefer that the
person provides it to me *explicitly* (so I just copy and paste
directly) and on a public mailing list. That way, the avenues for
accidents to occur are very narrow.

^ permalink raw reply

* Re: [PATCH v2 08/15] KVM: x86: Add mode-aware versions of kvm_<reg>_{read,write}() helpers
From: Huang, Kai @ 2026-05-18 22:29 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: dwmw2@infradead.org, Edgecombe, Rick P, x86@kernel.org,
	kas@kernel.org, binbin.wu@linux.intel.com,
	dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <agt7vnVL5rJJUVzo@google.com>

On Mon, 2026-05-18 at 13:51 -0700, Sean Christopherson wrote:
> On Mon, May 18, 2026, Kai Huang wrote:
> > 
> > > @@ -10413,29 +10413,30 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> > >  
> > >  	if (!is_64_bit_hypercall(vcpu))
> > >  		ret = (u32)ret;
> > > -	kvm_rax_write(vcpu, ret);
> > > +	kvm_rax_write_raw(vcpu, ret);
> > >  	return kvm_skip_emulated_instruction(vcpu);
> > >  }
> > > 
> > 
> > Nit:  AFAICT if we use kvm_rax_write(vcpu, ret) instead of the "raw" version
> > here, we can then remove the
> > 
> > 	if (!is_64_bit_hypercall(vcpu))
> > 		ret = (u32)ret;
> 
> No, because sneakily, is_64_bit_hypercall() != is_64_bit_mode(vcpu).  

Oh I missed this :-(  sorry for the noise.

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Kalra, Ashish @ 2026-05-18 22:09 UTC (permalink / raw)
  To: Dave Hansen, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <c9f1d4d2-e567-4090-b342-c76125673f61@intel.com>

Hello Dave,

On 5/18/2026 5:04 PM, Dave Hansen wrote:
> On 5/18/26 14:42, Ashish Kalra wrote:
>> Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
>> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
> 
> Hi Ashish,
> 
> Sorry if my memory fails me, but I don't remember signing off on this.
> Could you point me to the place where I gave you my Signed-off-by?

Sorry about this, added this accidentally. 

You had suggested the code change, i accidentally took it as a Signed-off.

Thanks,
Ashish

^ permalink raw reply

* Re: [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Dave Hansen @ 2026-05-18 22:04 UTC (permalink / raw)
  To: Ashish Kalra, tglx, mingo, bp, dave.hansen, x86, hpa, seanjc,
	peterz, thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <c9fe5c2fef063f5006cc9bfa03eec824ac015db7.1779133590.git.ashish.kalra@amd.com>

On 5/18/26 14:42, Ashish Kalra wrote:
> Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>

Hi Ashish,

Sorry if my memory fails me, but I don't remember signing off on this.
Could you point me to the place where I gave you my Signed-off-by?

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Woodhouse, David @ 2026-05-18 21:59 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-3-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 999 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> 
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -569,7 +569,7 @@ static void __init xen_init_time_common(void)
>  	static_call_update(pv_steal_clock, xen_steal_clock);
>  	paravirt_set_sched_clock(xen_sched_clock);
>  
> -	x86_platform.calibrate_tsc = xen_tsc_khz;
> +	tsc_register_calibration_routines(xen_tsc_khz, NULL);
>  	x86_platform.get_wallclock = xen_get_wallclock;
>  }
>  

xen_tsc_khz() doesn't use CPUID but really *should*.

Care to pull in
https://lore.kernel.org/all/20260509224824.3264567-31-dwmw2@infradead.org/
to your next round please?

(Without the misplaced changes in kvm/x86.c that should have been in
two different prior commits, and are now folded into those correctly in
my kvmclock5 branch ready for the next posting of that).

I'll drop that patch, and the similar x86/kvm one which you *have*
already taken in this series, from my next posting.

Thanks.


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* [PATCH v5 7/7] x86/sev: Add debugfs support for RMPOPT
From: Ashish Kalra @ 2026-05-18 21:43 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a debugfs interface to report per-CPU RMPOPT status across all
system RAM.

To dump the per-CPU RMPOPT status for all system RAM:

/sys/kernel/debug/rmpopt# cat rmpopt-table

Memory @  0GB: CPU(s): none
Memory @  1GB: CPU(s): none
Memory @  2GB: CPU(s): 0-1023
Memory @  3GB: CPU(s): 0-1023
Memory @  4GB: CPU(s): none
Memory @  5GB: CPU(s): 0-1023
Memory @  6GB: CPU(s): 0-1023
Memory @  7GB: CPU(s): 0-1023
...
Memory @1025GB: CPU(s): 0-1023
Memory @1026GB: CPU(s): 0-1023
Memory @1027GB: CPU(s): 0-1023
Memory @1028GB: CPU(s): 0-1023
Memory @1029GB: CPU(s): 0-1023
Memory @1030GB: CPU(s): 0-1023
Memory @1031GB: CPU(s): 0-1023
Memory @1032GB: CPU(s): 0-1023
Memory @1033GB: CPU(s): 0-1023
Memory @1034GB: CPU(s): 0-1023
Memory @1035GB: CPU(s): 0-1023
Memory @1036GB: CPU(s): 0-1023
Memory @1037GB: CPU(s): 0-1023
Memory @1038GB: CPU(s): none

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 121 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 121 insertions(+)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 7f8bb09844c1..ac414143feed 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -20,6 +20,8 @@
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
 #include <linux/workqueue.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -144,6 +146,15 @@ static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
 static unsigned long snp_nr_leaked_pages;
 
+/* All users of rmpopt_report_cpumask must hold rmpopt_show_mutex. */
+static cpumask_t rmpopt_report_cpumask;
+static struct dentry *rmpopt_debugfs;
+static DEFINE_MUTEX(rmpopt_show_mutex);
+
+struct seq_paddr {
+	phys_addr_t next_seq_paddr;
+};
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"SEV-SNP: " fmt
 
@@ -583,6 +594,8 @@ static void rmpopt_cleanup(void)
 
 	cancel_delayed_work_sync(&rmpopt_delayed_work);
 	destroy_workqueue(rmpopt_wq);
+	debugfs_remove_recursive(rmpopt_debugfs);
+	rmpopt_debugfs = NULL;
 
 	cpus_read_lock();
 	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
@@ -617,6 +630,10 @@ static inline bool __rmpopt(u64 rax, u64 rcx)
 		     : "a" (rax), "c" (rcx)
 		     : "memory", "cc");
 
+	if (rcx == RMPOPT_FUNC_REPORT_STATUS)
+		assign_cpu(smp_processor_id(), &rmpopt_report_cpumask,
+			   optimized);
+
 	return optimized;
 }
 
@@ -636,6 +653,108 @@ static void rmpopt_smp(void *val)
 	rmpopt((u64)val);
 }
 
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_report_status(void *val)
+{
+	u64 rax = ALIGN_DOWN((u64)val, SZ_1G);
+	u64 rcx = RMPOPT_FUNC_REPORT_STATUS;
+
+	__rmpopt(rax, rcx);
+}
+
+/*
+ * start() can be called multiple times if allocated buffer has overflowed
+ * and bigger buffer is allocated.
+ */
+static void *rmpopt_table_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	phys_addr_t end_paddr = rmpopt_pa_end;
+	struct seq_paddr *p = seq->private;
+
+	if (*pos == 0) {
+		p->next_seq_paddr = rmpopt_pa_start;
+		return &p->next_seq_paddr;
+	}
+
+	if (p->next_seq_paddr == end_paddr)
+		return NULL;
+
+	return &p->next_seq_paddr;
+}
+
+static void *rmpopt_table_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	phys_addr_t end_paddr = rmpopt_pa_end;
+	phys_addr_t *curr_paddr = v;
+
+	(*pos)++;
+	*curr_paddr += SZ_1G;
+	if (*curr_paddr >= end_paddr)
+		return NULL;
+
+	return curr_paddr;
+}
+
+static void rmpopt_table_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static int rmpopt_table_seq_show(struct seq_file *seq, void *v)
+{
+	phys_addr_t *curr_paddr = v;
+
+	guard(mutex)(&rmpopt_show_mutex);
+
+	seq_printf(seq, "Memory @%3lluGB: ",
+		   *curr_paddr >> (get_order(SZ_1G) + PAGE_SHIFT));
+
+	/*
+	 * Query all online CPUs rather than just rmpopt_cpumask (primary
+	 * threads only). The RMPOPT instruction only needs to run on one
+	 * thread per core for the optimization to take effect, but debugfs
+	 * reporting requires the RMPOPT status across all CPUs.
+	 * Performance is not a concern for this diagnostic interface.
+	 */
+	cpumask_clear(&rmpopt_report_cpumask);
+	on_each_cpu_mask(cpu_online_mask, rmpopt_report_status,
+			 (void *)*curr_paddr, true);
+
+	if (cpumask_empty(&rmpopt_report_cpumask))
+		seq_puts(seq, "CPU(s): none\n");
+	else
+		seq_printf(seq, "CPU(s): %*pbl\n", cpumask_pr_args(&rmpopt_report_cpumask));
+
+	return 0;
+}
+
+static const struct seq_operations rmpopt_table_seq_ops = {
+	.start = rmpopt_table_seq_start,
+	.next = rmpopt_table_seq_next,
+	.stop = rmpopt_table_seq_stop,
+	.show = rmpopt_table_seq_show
+};
+
+static int rmpopt_table_open(struct inode *inode, struct file *file)
+{
+	return seq_open_private(file, &rmpopt_table_seq_ops, sizeof(struct seq_paddr));
+}
+
+static const struct file_operations rmpopt_table_fops = {
+	.open = rmpopt_table_open,
+	.read = seq_read,
+	.release = seq_release_private,
+};
+
+static void rmpopt_debugfs_setup(void)
+{
+	rmpopt_debugfs = debugfs_create_dir("rmpopt", arch_debugfs_dir);
+
+	debugfs_create_file("rmpopt-table", 0444, rmpopt_debugfs,
+			    NULL, &rmpopt_table_fops);
+}
+
 /*
  * RMPOPT optimizations skip RMP checks at 1GB granularity if this
  * range of memory does not contain any SNP guest memory.
@@ -798,6 +917,8 @@ void snp_setup_rmpopt(void)
 	 * optimizations on all physical memory.
 	 */
 	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
+
+	rmpopt_debugfs_setup();
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 6/7] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-05-18 21:43 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.

Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.

To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kvm/svm/sev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e107f368ed2d..29af6f6e603c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3005,6 +3005,8 @@ void sev_vm_destroy(struct kvm *kvm)
 		 */
 		if (snp_decommission_context(kvm))
 			return;
+
+		snp_rmpopt_all_physmem();
 	} else {
 		sev_unbind_asid(kvm, sev->handle);
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 5/7] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-05-18 21:43 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.

When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.

Events such as RMPUPDATE can clear RMP optimizations. Add an interface
to re-enable those optimizations.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |  2 ++
 arch/x86/virt/svm/sev.c    | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 6fd72a44a51e..09b1c5d33790 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
 void snp_setup_rmpopt(void);
 void snp_shutdown(void);
 #else
@@ -681,6 +682,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_rmpopt_all_physmem(void) {}
 static inline void snp_setup_rmpopt(void) {}
 static inline void snp_shutdown(void) {}
 #endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8876cac052d5..7f8bb09844c1 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -707,6 +707,21 @@ static void rmpopt_work_handler(struct work_struct *work)
 		cpumask_set_cpu(this_cpu, &rmpopt_cpumask);
 }
 
+void snp_rmpopt_all_physmem(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+		return;
+
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+			   msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_GPL(snp_rmpopt_all_physmem);
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 4/7] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-05-18 21:42 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.

RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.

Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.

Enable RMPOPT optimizations globally for all system RAM up to 2TB at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 167 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 164 insertions(+), 3 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 82f9dc7a57c3..8876cac052d5 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
+#include <linux/workqueue.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
 static u64 probed_rmp_base, probed_rmp_size;
 
 static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
+
+enum rmpopt_function {
+	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+	RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT	10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
 
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -564,12 +576,21 @@ EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
 
 static void rmpopt_cleanup(void)
 {
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	cancel_delayed_work_sync(&rmpopt_delayed_work);
+	destroy_workqueue(rmpopt_wq);
+
 	cpus_read_lock();
 	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
 	cpus_read_unlock();
 
 	cpumask_clear(&rmpopt_cpumask);
-	rmpopt_pa_start = 0;
+	rmpopt_pa_start = rmpopt_pa_end = 0;
+	rmpopt_wq = NULL;
 }
 
 void snp_shutdown(void)
@@ -587,6 +608,105 @@ void snp_shutdown(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+static inline bool __rmpopt(u64 rax, u64 rcx)
+{
+	bool optimized;
+
+	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+		     : "=@ccc" (optimized)
+		     : "a" (rax), "c" (rcx)
+		     : "memory", "cc");
+
+	return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+	u64 rax = ALIGN_DOWN(pa, SZ_1G);
+	u64 rcx = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+	__rmpopt(rax, rcx);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+	rmpopt((u64)val);
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+	bool current_cpu_cleared = false;
+	phys_addr_t pa;
+	int this_cpu;
+
+	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+		rmpopt_pa_start, rmpopt_pa_end);
+
+	/*
+	 * RMPOPT scans the RMP table, stores the result of the scan in the
+	 * reserved processor memory. The RMP scan is the most expensive
+	 * part. If a second RMPOPT occurs, it can skip the expensive scan
+	 * if they can see a cached result in the reserved processor memory.
+	 *
+	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+	 * on every other primary thread. This potentially allows the
+	 * followers to use the "cached" scan results to avoid repeating
+	 * full scans.
+	 */
+
+	/*
+	 * Pin the worker to the current CPU for the leader loop so that
+	 * this_cpu remains valid and the RMPOPT instruction executes on
+	 * the CPU that was cleared from the cpumask.  The workqueue is
+	 * WQ_UNBOUND, so without pinning, the scheduler could migrate
+	 * the worker between the cpumask manipulation and the leader
+	 * loop, causing the leader to run on a different CPU while
+	 * this_cpu's core is skipped entirely.
+	 *
+	 * Use migrate_disable() rather than get_cpu() to prevent
+	 * migration while still allowing preemption.
+	 *
+	 * Note: rmpopt_cpumask is modified here without holding
+	 * rmpopt_wq_mutex.  This is safe because the delayed_work
+	 * mechanism guarantees single-threaded execution of this
+	 * handler, and rmpopt_cleanup() calls cancel_delayed_work_sync()
+	 * to ensure handler completion before tearing down the cpumask.
+	 */
+	migrate_disable();
+	this_cpu = smp_processor_id();
+	if (cpumask_test_cpu(this_cpu, &rmpopt_cpumask)) {
+		cpumask_clear_cpu(this_cpu, &rmpopt_cpumask);
+		current_cpu_cleared = true;
+	}
+
+	/* Leader: prime the RMPOPT cache on this CPU */
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+		rmpopt(pa);
+
+	migrate_enable();
+
+	/* Followers: run RMPOPT on all other cores */
+	cpus_read_lock();
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+		on_each_cpu_mask(&rmpopt_cpumask, rmpopt_smp,
+				 (void *)pa, true);
+
+		 /* Give a chance for other threads to run */
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+	if (current_cpu_cleared)
+		cpumask_set_cpu(this_cpu, &rmpopt_cpumask);
+}
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
@@ -595,11 +715,35 @@ void snp_setup_rmpopt(void)
 	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
 		return;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	/*
+	 * Guard against re-initialization.  When SNP_SHUTDOWN_EX is issued
+	 * with x86_snp_shutdown=0, snp_shutdown() is not called and
+	 * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+	 * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+	 * again, leaking the existing workqueue, delayed work, debugfs
+	 * entries, and cpumask state.
+	 */
+	if (rmpopt_wq)
+		return;
+
+	/*
+	 * Create an RMPOPT-specific workqueue to avoid scheduling
+	 * RMPOPT workitem on the global system workqueue.
+	 */
+	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+	if (!rmpopt_wq) {
+		pr_err("Failed to allocate RMPOPT workqueue\n");
+		return;
+	}
+
 	cpus_read_lock();
 
 	/*
 	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
-	 * to set up the RMPOPT_BASE MSR.
+	 * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+	 * needs to issue the RMPOPT instruction.
 	 *
 	 * Note: only online primary threads are included.  If a core's
 	 * primary thread is offline, that core is not covered.  CPU hotplug
@@ -622,6 +766,23 @@ void snp_setup_rmpopt(void)
 	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
 
 	cpus_read_unlock();
+
+	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
+	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+	/* Limit memory scanning to 2TB of RAM */
+	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+		pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+			rmpopt_pa_start + SZ_2T);
+		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+	}
+
+	/*
+	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+	 * optimizations on all physical memory.
+	 */
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 3/7] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-05-18 21:42 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.

Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.

Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.

Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/coco/core.c             |  1 +
 arch/x86/include/asm/msr-index.h |  3 ++
 arch/x86/include/asm/sev.h       |  2 ++
 arch/x86/virt/svm/sev.c          | 59 +++++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c     |  3 ++
 5 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..7fdef00ca8f2 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -172,6 +172,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_HOST_SEV_SNP:
 		cc_flags.host_sev_snp = 0;
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		break;
 	default:
 		break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
 #define MSR_AMD64_SEG_RMP_ENABLED_BIT	0
 #define MSR_AMD64_SEG_RMP_ENABLED	BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
 #define MSR_AMD64_RMP_SEGMENT_SHIFT(x)	(((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE		0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT	0
+#define MSR_AMD64_RMPOPT_ENABLE		BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
 
 #define MSR_SVSM_CAA			0xc001f000
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..6fd72a44a51e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_setup_rmpopt(void);
 void snp_shutdown(void);
 #else
 static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +681,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
 static inline void snp_shutdown(void) {}
 #endif
 
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..82f9dc7a57c3 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,9 @@ static void *rmp_bookkeeping __ro_after_init;
 
 static u64 probed_rmp_base, probed_rmp_size;
 
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -488,9 +491,13 @@ static bool __init setup_segmented_rmptable(void)
 static bool __init setup_rmptable(void)
 {
 	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
-		if (!setup_segmented_rmptable())
+		if (!setup_segmented_rmptable()) {
+			setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 			return false;
+		}
 	} else {
+		/* Note that Segmented RMP must be enabled to enable RMPOPT. */
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		if (!setup_contiguous_rmptable())
 			return false;
 	}
@@ -555,6 +562,16 @@ int snp_prepare(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
 
+static void rmpopt_cleanup(void)
+{
+	cpus_read_lock();
+	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, 0);
+	cpus_read_unlock();
+
+	cpumask_clear(&rmpopt_cpumask);
+	rmpopt_pa_start = 0;
+}
+
 void snp_shutdown(void)
 {
 	u64 syscfg;
@@ -563,11 +580,51 @@ void snp_shutdown(void)
 	if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
 		return;
 
+	rmpopt_cleanup();
+
 	clear_rmp();
 	on_each_cpu(mfd_reconfigure, NULL, 1);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+void snp_setup_rmpopt(void)
+{
+	u64 rmpopt_base;
+	int cpu;
+
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+		return;
+
+	cpus_read_lock();
+
+	/*
+	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+	 * to set up the RMPOPT_BASE MSR.
+	 *
+	 * Note: only online primary threads are included.  If a core's
+	 * primary thread is offline, that core is not covered.  CPU hotplug
+	 * is not currently supported with SNP enabled.
+	 */
+
+	for_each_online_cpu(cpu)
+		if (topology_is_primary_thread(cpu))
+			cpumask_set_cpu(cpu, &rmpopt_cpumask);
+
+	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+	/*
+	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+	 * to the starting physical address to enable RMP optimizations for
+	 * up to 2 TB of system RAM on all CPUs.
+	 */
+	wrmsrq_on_cpus(&rmpopt_cpumask, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+
+	cpus_read_unlock();
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
 /*
  * Do the necessary preparations which are verified by the firmware as
  * described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
 	}
 
 	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+	snp_setup_rmpopt();
+
 	sev->snp_initialized = true;
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
 		data.tio_en ? "enabled" : "disabled");
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 2/7] x86/msr: add wrmsrq_on_cpus helper
From: Ashish Kalra @ 2026-05-18 21:42 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The existing wrmsr_on_cpus() takes a per-cpu struct msr array, requiring
callers to allocate and populate per-cpu storage even when every CPU
receives the same value. This is unnecessary overhead for the common
case of writing a single uniform u64 to a per-CPU MSR across multiple
CPUs.

Add wrmsrq_on_cpus() which writes the same u64 value to the specified
MSR on all CPUs in the given cpumask.

Co-developed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/msr.h |  5 +++++
 arch/x86/lib/msr-smp.c     | 20 ++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 9c2ea29e12a9..f5f63b4115c8 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -260,6 +260,7 @@ int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
 int wrmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 l, u32 h);
 int rdmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 *q);
 int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q);
+void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q);
 void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs);
 void wrmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr __percpu *msrs);
 int rdmsr_safe_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h);
@@ -289,6 +290,10 @@ static inline int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q)
 	wrmsrq(msr_no, q);
 	return 0;
 }
+static inline void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
+{
+	wrmsrq_on_cpu(0, msr_no, q);
+}
 static inline void rdmsr_on_cpus(const struct cpumask *m, u32 msr_no,
 				struct msr __percpu *msrs)
 {
diff --git a/arch/x86/lib/msr-smp.c b/arch/x86/lib/msr-smp.c
index b8f63419e6ae..d2c91c9bb47b 100644
--- a/arch/x86/lib/msr-smp.c
+++ b/arch/x86/lib/msr-smp.c
@@ -94,6 +94,26 @@ int wrmsrq_on_cpu(unsigned int cpu, u32 msr_no, u64 q)
 }
 EXPORT_SYMBOL(wrmsrq_on_cpu);
 
+void wrmsrq_on_cpus(const struct cpumask *mask, u32 msr_no, u64 q)
+{
+	struct msr_info rv;
+	int this_cpu;
+
+	memset(&rv, 0, sizeof(rv));
+
+	rv.msr_no = msr_no;
+	rv.reg.q = q;
+
+	this_cpu = get_cpu();
+
+	if (cpumask_test_cpu(this_cpu, mask))
+		__wrmsr_on_cpu(&rv);
+
+	smp_call_function_many(mask, __wrmsr_on_cpu, &rv, 1);
+	put_cpu();
+}
+EXPORT_SYMBOL(wrmsrq_on_cpus);
+
 static void __rwmsr_on_cpus(const struct cpumask *mask, u32 msr_no,
 			    struct msr __percpu *msrs,
 			    void (*msr_func) (void *info))
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 1/7] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Ashish Kalra @ 2026-05-18 21:41 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1779133590.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a flag indicating whether RMPOPT instruction is supported.

RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks on the hypervisor and on non-SNP guests by
allowing RMP checks to be skipped when 1G regions of memory are known
not to contain any SEV-SNP guest memory.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/cpufeatures.h | 2 +-
 arch/x86/kernel/cpu/scattered.c    | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..794cc96b8493 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
+	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
 	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
 	{ 0, 0, 0, 0, 0 }
 };
-- 
2.43.0


^ permalink raw reply related

* [PATCH v5 0/7] Add RMPOPT support.
From: Ashish Kalra @ 2026-05-18 21:41 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco

From: Ashish Kalra <ashish.kalra@amd.com>

In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.

The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.

RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.

RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.

In case of report status function, the CPU returns the optimization
status for the 1GB region.

The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned.  Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.

This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.

Support for RAM larger than 2 TB will be added in follow-on series.

This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.

RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.

Delaying work allows batching of multiple SNP guest terminations.

Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.

Additionally add debugfs interface to report per-CPU RMPOPT status
across all system RAM.

v5:
- Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
  and MSR state, called from snp_shutdown().
- Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
  snp_rmpopt_all_physmem(), and rmpopt_cleanup().
- Introduce rmpopt_show_mutex to serialize debugfs reporting of
  rmpopt_report_cpumask.
- Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
  guest shutdown.
- Use migrate_disable()/migrate_enable() for CPU pinning in the
  rmpopt_work_handler() leader loop to maintain CPU affinity without
  disabling preemption for the entire RMPOPT scan.
- Add cpus_read_lock()/cpus_read_unlock() around the follower
  on_each_cpu_mask() loop in rmpopt_work_handler().
- Guard snp_setup_rmpopt() against re-initialization when
  SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
  but clears snp_initialized, preventing workqueue and resource
  leaks on repeated init/shutdown cycles.
- Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
  failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
  used after alternatives are patched; callers check rmpopt_wq != NULL
  as the runtime guard instead.
- Add pr_info() when RMPOPT coverage is capped at 2TB.
- Add comments noting CPU hotplug is not supported with SNP enabled
  and only online primary threads are covered by rmpopt_cpumask.
- Add comment in setup_rmptable() noting Segmented RMP must be
  enabled to enable RMPOPT.
- Simplify cpumask setup loop to set if primary thread rather than
  skip if not primary.
- Improve grammar and clarity in snp_setup_rmpopt() comments.
- Added Reviewed-by's.

  Sashiko AI code review identified several of the above issues.

v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
  per-CPU MSR across a cpumask without per-cpu struct allocation
  overhead. 
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
  programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
  setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
  for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
  RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
  distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked() 
  and instead setup and enable RMPOPT after SNP is enabled and 
  initialized.

v3:
- Drop all RMPOPT kthread support and introduce adding custom and
  dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
  re-enable RMP optimizations during guest shutdown using the
  asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
  rmpopt_report_status() wrappers on top which use rax and rcx
  parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
  system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
  first, then let other CPUs execute RMPOPT in parallel so they can skip
  most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
  one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
  as specified by RMPOPT specifications and not be dependent on PUD_SIZE
  which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
  all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages


v2:
- Drop all NUMA and Socket configuration and enablement support and
  enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
  base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
  RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
  parent directory.

Ashish Kalra (7):
  x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
  x86/msr: add wrmsrq_on_cpus helper
  x86/sev: Initialize RMPOPT configuration MSRs
  x86/sev: Add support to perform RMP optimizations asynchronously
  x86/sev: Add interface to re-enable RMP optimizations.
  KVM: SEV: Perform RMP optimizations on SNP guest shutdown
  x86/sev: Add debugfs support for RMPOPT

 arch/x86/coco/core.c               |   1 +
 arch/x86/include/asm/cpufeatures.h |   2 +-
 arch/x86/include/asm/msr-index.h   |   3 +
 arch/x86/include/asm/msr.h         |   5 +
 arch/x86/include/asm/sev.h         |   4 +
 arch/x86/kernel/cpu/scattered.c    |   1 +
 arch/x86/kvm/svm/sev.c             |   2 +
 arch/x86/lib/msr-smp.c             |  20 ++
 arch/x86/virt/svm/sev.c            | 356 ++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c       |   3 +
 10 files changed, 395 insertions(+), 2 deletions(-)

-- 
2.43.0


^ permalink raw reply

* Re: [PATCH v3 00/41] x86: Try to wrangle PV clocks vs. TSC
From: Sean Christopherson @ 2026-05-18 21:11 UTC (permalink / raw)
  To: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <20260515191942.1892718-1-seanjc@google.com>

On Fri, May 15, 2026, Sean Christopherson wrote:
> Dave/Thomas/Peter/Boris, what's the going rate for bribes to take something
> like this through the tip tree?  
> 
> The bulk of the changes are in kvmclock and TSC, but pretty much every
> hypervisor's guest-side code gets touched at some point.  I am reaonsably
> confident in the correctness of the KVM changes.  Michael tested Hyper-V in
> v2, and while there were conflicts when rebasing, they were largely
> superficial (and I've just jinxed myself).  For all other hypervisors, assume
> the code is compile-tested only, but those changes are all quite small and
> straightforward.
> 
> The only changes that are questionable/contentious are the last two patches,
> which have KVM-as-a-guest use CPUID 0x16 to get the CPU frequency, even on
> AMD (that's the dubious part).  I very deliberately put them last, so that
> they can be dropped at will (I don't care terribly if those patches land).
> To merge them, I would want explicit Acks from Paolo and David W.
> 
> So, except for the last two patches, to get the stuff I really care about
> landed, I think/hope it's just the TSC and guest-side CoCo changes that need
> reviews/acks?

FYI, don't bother reviewing this version.  Sashiko found several glaring flaws,
but I just realized that sashiko-bot's emails are only being sent to myself and
linux-hyperv@vger.kernel.org.  I'll make sure to highlight the changes in the
next version.

In the meantime, Sashiko's feedback is archived on lore if you want to see me
get torched by AI :-)

^ permalink raw reply

* Re: [PATCH v2 08/15] KVM: x86: Add mode-aware versions of kvm_<reg>_{read,write}() helpers
From: Sean Christopherson @ 2026-05-18 20:51 UTC (permalink / raw)
  To: Kai Huang
  Cc: pbonzini@redhat.com, kas@kernel.org, vkuznets@redhat.com,
	dwmw2@infradead.org, paul@xen.org, Rick P Edgecombe,
	x86@kernel.org, binbin.wu@linux.intel.com,
	dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
	yosry@kernel.org, kvm@vger.kernel.org, linux-coco@lists.linux.dev
In-Reply-To: <136d277dba2ac681ed7607a436f55e2fd1975ec5.camel@intel.com>

On Mon, May 18, 2026, Kai Huang wrote:
> 
> > @@ -10413,29 +10413,30 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> >  
> >  	if (!is_64_bit_hypercall(vcpu))
> >  		ret = (u32)ret;
> > -	kvm_rax_write(vcpu, ret);
> > +	kvm_rax_write_raw(vcpu, ret);
> >  	return kvm_skip_emulated_instruction(vcpu);
> >  }
> > 
> 
> Nit:  AFAICT if we use kvm_rax_write(vcpu, ret) instead of the "raw" version
> here, we can then remove the
> 
> 	if (!is_64_bit_hypercall(vcpu))
> 		ret = (u32)ret;

No, because sneakily, is_64_bit_hypercall() != is_64_bit_mode(vcpu).  And because
we also need to avoid calling is_64_bit_mode().  If we use kvm_rax_write(), then
the unpacked code will be:

	WARN_ON_ONCE(vcpu->arch.guest_state_protected);

	if (is_long_mode(vcpu))
		kvm_x86_call(get_cs_db_l_bits)(vcpu, &cs_db, &cs_l);
	else
		cs_l = 0;

	if (cs_l)
		vcpu->arch.regs[VCPU_REGS_RAX] = ret;
	else	
		vcpu->arch.regs[VCPU_REGS_RAX] = (u32)ret;

whereas the (correct) behavior here is:

	if (vcpu->arch.guest_state_protected)
		cs_l = 1;
	else if (is_long_mode(vcpu))
		kvm_x86_call(get_cs_db_l_bits)(vcpu, &cs_db, &cs_l);
	else
		cs_l = 0;

	if (cs_l)
		vcpu->arch.regs[VCPU_REGS_RAX] = ret;
	else	
		vcpu->arch.regs[VCPU_REGS_RAX] = (u32)ret;

I.e. using the non-raw version will trigger the WARN_ON_ONCE(), and will incorrectly
truncate "ret" whenever cs_l is stale (which might be always?).

^ permalink raw reply

* Re: [PATCH v9 01/23] x86/virt/tdx: Consolidate TDX global initialization states
From: Dave Hansen @ 2026-05-18 18:09 UTC (permalink / raw)
  To: Edgecombe, Rick P, Gao, Chao
  Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Huang, Kai, kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
	dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
	Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
	binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
	mingo@redhat.com, Verma, Vishal L, kas@kernel.org, Shahar, Sagi,
	Annapurve, Vishal, djbw@kernel.org, tglx@kernel.org,
	paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
	yilun.xu@linux.intel.com, x86@kernel.org
In-Reply-To: <f6e9a736d15ec41d23156c8e4e75533e8debf908.camel@intel.com>

On 5/18/26 11:00, Edgecombe, Rick P wrote:
> This is a great improvement by itself, irrespective of this series.

Agreed. If it came by itself, I'd probably apply it.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox