Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v2 10/31] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Xu Yilun @ 2026-04-23 16:41 UTC (permalink / raw)
  To: Huang, Kai
  Cc: dan.j.williams@intel.com, linux-pci@vger.kernel.org,
	linux-coco@lists.linux.dev, x86@kernel.org, Gao, Chao,
	Edgecombe, Rick P, Xu, Yilun, Jiang, Dave,
	dave.hansen@linux.intel.com, baolu.lu@linux.intel.com,
	Duan, Zhenzhong, kas@kernel.org, Verma, Vishal L, Li, Xiaoyao,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <89bccca9a409e02667278fd83f0b7b9064557dab.camel@intel.com>

On Thu, Apr 23, 2026 at 12:59:55AM +0000, Huang, Kai wrote:
> On Sat, 2026-03-28 at 00:01 +0800, Xu Yilun wrote:
> > +static int tdx_ext_mem_add(struct tdx_page_array *ext_mem)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = hpa_list_info_assign_raw(ext_mem),
> > +	};
> > +	u64 r;
> > +
> > +	tdx_clflush_page_array(ext_mem);
> > +
> > +	do {
> > +		r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> > +		cond_resched();
> > +	} while (r == TDX_INTERRUPTED_RESUMABLE);
> 
> 
> Ditto here.  I don't think we should introduce any more cond_resched().

Good to me.

> 
> Btw, I think technically we can reuse the seamcall_ir_resched() you introduced
> later, albeit in which a local '_args' is used as a copy of the original 'args',
> but that has no harm for the case where we can just use the 'args' to loop.

No we can't. TDH_EXT_MEM_ADD is designed to use output parameter RCX
to override/update input parameter RCX, so the caller doesn't have to
do manual parameter update on retry call. Using seamcall_ir_resched()
makes each retry use the original RCX, not the updated one.

> 
> I am wondering whether we can just use that here, or we just get rid of that
> helper (then open retry by the callers of these SEAMCALL wrappers), since there
> will be more cases where we need to manually set 'resume=1' in the SEAMCALL
> input 'args' when retrying TDX_INTERRUPTED_RESUMABLE.

I'd like to know why some SEAMCALLs needs resume flag but others don't.
If there is chance we don't introduce too much variants for the same thing,
that's most friendly to OS. And "no resume flag" is my best preference.

For now, I can see only one SEAMCALL with resume flag in mainline,
tdh_phymem_cache_wb(). I'd rather we treat it as an exception and no
resume flag any more if possible.

Then we don't have to make all following efforts, they are complex...

> 
> Unless you have good idea to unify them all?
> 
> E.g., we have something like below in our internal KVM code, using macros to do
> 'resume=1' and retry as the caller wishes.  But my understanding is Dave
> probably won't like macros.  :-)
> 
> (you may see broken indent/text due to text wrapper and sorry for that.) 
> 
> /*
>  * ...
>  *
>  * The retry_func and update_args allow the SEAMCALL to be retried in a loop if
>  * it can still return other error code when there's no race from both KVM and 
>  * vCPUs and can be "retried" until it succeeds.                               
>  */
> #define tdh_do_no_vcpus_retry(tdh_func, kvm, retry_func, update_args, args...)\
> ({                                                                            \
>         struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm);                          \
>         u64 __err;                                                            \
>                                                                               \
>         lockdep_assert_held_write(&kvm->mmu_lock);                            \
>                                                                               \
>         __err = retry_func(tdh_func, update_args, args);                      \
>         if (unlikely(tdx_operand_busy(__err))) {                              \
>                 WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, true);               \
>                 kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);   \
>                                                                               \
>                 __err = retry_func(tdh_func, update_args, args);              \
>                                                                               \
>                 WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, false);              \
>         }                                                                     \
>         __err;                                                                \
> })                                                                             
> 
> #define tdh_intr_retry(tdh_func, update_args, args...)                        \
> ({                                                                            \
>         u64 ____err;                                                          \
>                                                                               \
>         do {                                                                  \
>                 ____err = tdh_func(args);                                     \
>                                                                               \
>                 if ((____err & TDX_SEAMCALL_STATUS_MASK) !=		      \
> 				TDX_INTERRUPTED_RESUMABLE)                   
> \	                       
>                         break;                                                \
>                                                                               \
>                 update_args;                                                  \
>         } while (1);                                                          \
>         ____err;                                                              \
> })
> 
> #define tdh_no_retry(tdh_func, update_args, args...)    tdh_func(args)
> 
> #define tdh_do_no_vcpus(tdh_func, kvm, args...) \
>         tdh_do_no_vcpus_retry(tdh_func, kvm, tdh_no_retry, ;, args)
> 
> #define tdh_do_no_vcpus_intr_retry(tdh_func, kvm, update_args, args...) \
>         tdh_do_no_vcpus_retry(tdh_func, kvm, tdh_intr_retry, update_args, args)

^ permalink raw reply

* Re: [PATCH v2 08/31] x86/virt/tdx: Configure TDX Module with optional TDX Connect feature
From: Xu Yilun @ 2026-04-23 15:49 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-coco, linux-pci, dan.j.williams, x86, chao.gao, dave.jiang,
	baolu.lu, yilun.xu, zhenzhong.duan, kvm, rick.p.edgecombe,
	dave.hansen, kas, xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <69e8223f4c8e6_fe08310058@djbw-dev.notmuch>

On Tue, Apr 21, 2026 at 06:19:59PM -0700, Dan Williams wrote:
> Xu Yilun wrote:
> > TDX Module supports optional TDX features (e.g. TDX Connect & TDX Module
> > Extensions) that won't be enabled by default.
> 
> So this is another place where "optional" is misleading. For simplicity
> there is no mechanism to fallback from TDX Connect operation if present,
> at least in the core. The only optional aspects would be the mechanism
> that could be unloaded through the tdx_host driver.

I see. I think I should use 'extra' instead of 'optional' in all places.
I'll rephase the comments.

> 
> .../me notices that other comments on this patch say the same, but do
> read on, another important detail about ktime_get_real_seconds() below.
> 
> > It extends TDH.SYS.CONFIG for host to choose to enable them on bootup.
> > 
> > Call TDH.SYS.CONFIG with a new bitmap input parameter to specify which
> > features to enable. The bitmap uses the same definitions as
> > TDX_FEATURES0. But note not all bits in TDX_FEATURES0 are valid for
> > configuration, e.g. TDX Module Extensions is a service that supports TDX
> > Connect, it is implicitly enabled when TDX Connect is enabled. Setting
> > TDX_FEATURES0_EXT in the bitmap has no effect.
> > 
> > TDX Module advances the version of TDH.SYS.CONFIG for the change, so
> > use the latest version (v1) for optional feature enabling. But
> > supporting existing Modules which only support v0 is still necessary
> > until they are deprecated, enumerate via TDX_FEATURES0 to decide which
> > version to use.
> 
> I would say this differently, it will always be the case that new
> kernels are needed to enable new features, but it is unlikely that
> TDH.SYS.CONFIG ever needs to change again. The v0 -> v1 transition means
> that feature bits are to be used here on out. So there is little value
> in worrying about deprecating v0 to save a couple lines of code in 5-7
> years when these original TDX platforms sunset.

OK. I'll use your rationale.

> 
> > TDX Module updates global metadata when optional features are enabled.
> > Host should update the cached tdx_sysinfo to reflect these changes.
> > 
> > Co-developed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> > Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> > Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> > ---
> >  arch/x86/virt/vmx/tdx/tdx.h |  3 ++-
> >  arch/x86/virt/vmx/tdx/tdx.c | 16 +++++++++++++++-
> >  2 files changed, 17 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> > index e5a9331df451..870bb75da3ba 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.h
> > +++ b/arch/x86/virt/vmx/tdx/tdx.h
> > @@ -58,7 +58,8 @@
> >  #define TDH_PHYMEM_CACHE_WB		40
> >  #define TDH_PHYMEM_PAGE_WBINVD		41
> >  #define TDH_VP_WR			43
> > -#define TDH_SYS_CONFIG			45
> > +#define TDH_SYS_CONFIG_V0		45
> > +#define TDH_SYS_CONFIG			SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
> >  
> >  /* TDX page types */
> >  #define	PT_NDA		0x0
> > diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> > index 130214933c2f..0c5d6bdd810f 100644
> > --- a/arch/x86/virt/vmx/tdx/tdx.c
> > +++ b/arch/x86/virt/vmx/tdx/tdx.c
> > @@ -1353,6 +1353,7 @@ static int construct_tdmrs(struct list_head *tmb_list,
> >  static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
> >  {
> >  	struct tdx_module_args args = {};
> > +	u64 seamcall_fn = TDH_SYS_CONFIG_V0;
> >  	u64 *tdmr_pa_array;
> >  	size_t array_sz;
> >  	int i, ret;
> > @@ -1377,7 +1378,15 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
> >  	args.rcx = __pa(tdmr_pa_array);
> >  	args.rdx = tdmr_list->nr_consumed_tdmrs;
> >  	args.r8 = global_keyid;
> > -	ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
> > +
> > +	if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_TDXCONNECT) {
> > +		args.r9 |= TDX_FEATURES0_TDXCONNECT;
> > +		args.r11 = ktime_get_real_seconds();
> 
> Mainline has reason to not entertain this module requirement. The fact
> that passing zero is an error is useful to detect unsupported modules.

Ah, it is "useful"...  :)

> An updated module would accept zero as indicating "VMM requests module
> disable all policy and mechanisms related to untrusted wall clock time".
> Specifically, there are several problems with this:
> 
> 1/ No other TSM implementation requires the VMM to pass in an untrusted time
> 2/ The wall time may change and may require hooks to keep the module time
>    up to date, but see point 1/, this would be a TDX special flower hook.
> 3/ Presumably this allows the module or the guest to do certificate expiration
>    checks, but that is the responsibility of the relying party. The
>    relying party may have reason to accept an "expired" cert as determined
>    by VMM wall clock, and the guest presumably already has mechanisms to
>    determine untrusted wall clock time from the VMM if it wants. Guests
>    do not need TDX ABI for that.
> 
> So I think Linux wants to pass 0 here and wait for modules that accept
> that as the start of TDX Connect support. As you said, given there are
> no released modules with TDX Connect there is time to make that first
> release drop this requirement.

I see. Actually the Module is about to remove the r11 RTC.

https://cdrdv2.intel.com/v1/dl/getContent/871617

I'll remove this r11.



^ permalink raw reply

* Re: [PATCH v2 06/31] x86/virt/tdx: Read global metadata for TDX Module Extensions/Connect
From: Xu Yilun @ 2026-04-23 11:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-coco, linux-pci, x86, chao.gao, dave.jiang, baolu.lu,
	yilun.xu, zhenzhong.duan, kvm, rick.p.edgecombe, dave.hansen, kas,
	xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <69e7f808e9aa5_fe0831004@djbw-dev.notmuch>

On Tue, Apr 21, 2026 at 03:19:52PM -0700, Dan Williams wrote:
> Xu Yilun wrote:
> > Add reading of the global metadata for TDX Module Extensions & TDX
> > Connect. Add them in a batch as TDX Connect is currently the only user
> > of TDX Module Extensions and no way to initialize TDX Module Extensions
> > without firstly enabling TDX Connect.
> > 
> > TDX Module Extensions & TDX Connect are optional features enumerated by
> > TDX_FEATURES0. Check the TDX_FEATURES0 before reading these metadata to
> > avoid failing the whole TDX initialization.
> 
> I think it is important to distinguish "optional" module features vs
> required Linux features. Linux requires all features that a module
> advertises to succeed at core TDX init time.

Agree. But I want to reduce the scope to only about metadata reading in
this patch. So:

    TDX Module Extensions is an optional features enumerated by
    TDX_FEATURES0. But in the implementation, Linux requires that all
    features that a Module advertises must have a complete, valid set of
    metadata, and the check must succeed at core TDX initialization time.

    Check TDX_FEATURES0 before reading these metadata. If a feature is
    advertised, a failure in reading associated metadata causes the whole
    TDX initialization to fail, otherwise skip.

> 
> Otherwise, this looks ok / consistent with other metadata reading. It 
> sets the precedent that if TDX Connect is advertised it must succeed all
> core initialization.

^ permalink raw reply

* Re: [PATCH v2 05/31] x86/virt/tdx: Extend tdx_page_array to support IOMMU_MT
From: Xu Yilun @ 2026-04-23 11:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Edgecombe, Rick P, Gao, Chao, Xu, Yilun, x86@kernel.org,
	kas@kernel.org, baolu.lu@linux.intel.com,
	dave.hansen@linux.intel.com, Li, Xiaoyao, Jiang, Dave,
	linux-pci@vger.kernel.org, linux-coco@lists.linux.dev,
	linux-kernel@vger.kernel.org, Duan, Zhenzhong, Verma, Vishal L,
	kvm@vger.kernel.org
In-Reply-To: <69e7f166319a5_fe083100f7@djbw-dev.notmuch>

> A couple recommendations come to mind:
> 
> * s/tdx_page_array_free/tdx_page_array_destroy/
> 
>   ...since "destroy" mirrors create and matches other cases where only
>   metadata is managed.
> 
> * Create a new tdx_page_array_repopulate() helper to make it clear which
>   paths depend on being able to repopulate and move the WARN_ON_ONCE() out of
>   the common path that does not repopulate. "repopulate" can have
>   "realloc" semantics where it allocates on first use, but otherwise
>   "populate" gets to not care about the corner cases. Make the WARN case
>   fail repopulate.

Agree. I end up add a function like that:


/**
 * tdx_page_array_repopulate() - repopulate a tdx_page_array
 * @array: The array descriptor to reallocate for.
 * @pages: Pointer to struct page array for tdx_page_array populating
 * @nr_pages: Size of @pages array.
 * 
 * Re-populate the tdx_page_array. If @array is %NULL, it behaves exactly like
 * tdx_page_array_create().
 *
 * Return: Re-populated tdx_page_array or NULL on failure.
 */
static struct tdx_page_array *
tdx_page_array_repopulate(struct tdx_page_array *array, struct page **pages,
			  unsigned int nr_pages)
{
	struct tdx_page_array *tmp = array;
	int ret;

	if (tmp) {
		/* Don't pass in something partially initialized */
		if (!tmp->root || !tmp->pages || !tmp->nr_pages)
			return NULL;

		/*
		 * When re-populating, the old pages are no longer tracked.
		 * Theoretically they require cache flushing before reclaiming
		 * for other kernel usage, similar to tdx_page_array_destroy().
		 * Since there is no use case to repopulate and then reclaim
		 * old pages yet, just warn to prompt future improvement.
		 */
		if (WARN_ON_ONCE(tmp->need_phymem_page_wbinvd))
			return NULL;
	} else {
		tmp = tdx_page_array_alloc();
		if (!tmp)
			return NULL;
	}

	ret = tdx_page_array_populate(tmp, pages, nr_pages);
	if (ret) {
		/* Only destroy newly allocated object */
		if (!array)
			tdx_page_array_destroy(tmp);

		return NULL;
	}

	return tmp;
}

^ permalink raw reply

* Re: [PATCH v2 10/31] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Huang, Kai @ 2026-04-23  0:59 UTC (permalink / raw)
  To: dan.j.williams@intel.com, linux-pci@vger.kernel.org,
	linux-coco@lists.linux.dev, yilun.xu@linux.intel.com,
	x86@kernel.org
  Cc: Gao, Chao, Edgecombe, Rick P, Xu, Yilun, Jiang, Dave,
	dave.hansen@linux.intel.com, baolu.lu@linux.intel.com,
	Duan, Zhenzhong, kas@kernel.org, Verma, Vishal L, Li, Xiaoyao,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260327160132.2946114-11-yilun.xu@linux.intel.com>

On Sat, 2026-03-28 at 00:01 +0800, Xu Yilun wrote:
> +static int tdx_ext_mem_add(struct tdx_page_array *ext_mem)
> +{
> +	struct tdx_module_args args = {
> +		.rcx = hpa_list_info_assign_raw(ext_mem),
> +	};
> +	u64 r;
> +
> +	tdx_clflush_page_array(ext_mem);
> +
> +	do {
> +		r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> +		cond_resched();
> +	} while (r == TDX_INTERRUPTED_RESUMABLE);


Ditto here.  I don't think we should introduce any more cond_resched().

Btw, I think technically we can reuse the seamcall_ir_resched() you introduced
later, albeit in which a local '_args' is used as a copy of the original 'args',
but that has no harm for the case where we can just use the 'args' to loop.

I am wondering whether we can just use that here, or we just get rid of that
helper (then open retry by the callers of these SEAMCALL wrappers), since there
will be more cases where we need to manually set 'resume=1' in the SEAMCALL
input 'args' when retrying TDX_INTERRUPTED_RESUMABLE.

Unless you have good idea to unify them all?

E.g., we have something like below in our internal KVM code, using macros to do
'resume=1' and retry as the caller wishes.  But my understanding is Dave
probably won't like macros.  :-)

(you may see broken indent/text due to text wrapper and sorry for that.) 

/*
 * ...
 *
 * The retry_func and update_args allow the SEAMCALL to be retried in a loop if
 * it can still return other error code when there's no race from both KVM and 
 * vCPUs and can be "retried" until it succeeds.                               
 */
#define tdh_do_no_vcpus_retry(tdh_func, kvm, retry_func, update_args, args...)\
({                                                                            \
        struct kvm_tdx *__kvm_tdx = to_kvm_tdx(kvm);                          \
        u64 __err;                                                            \
                                                                              \
        lockdep_assert_held_write(&kvm->mmu_lock);                            \
                                                                              \
        __err = retry_func(tdh_func, update_args, args);                      \
        if (unlikely(tdx_operand_busy(__err))) {                              \
                WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, true);               \
                kvm_make_all_cpus_request(kvm, KVM_REQ_OUTSIDE_GUEST_MODE);   \
                                                                              \
                __err = retry_func(tdh_func, update_args, args);              \
                                                                              \
                WRITE_ONCE(__kvm_tdx->wait_for_sept_zap, false);              \
        }                                                                     \
        __err;                                                                \
})                                                                             

#define tdh_intr_retry(tdh_func, update_args, args...)                        \
({                                                                            \
        u64 ____err;                                                          \
                                                                              \
        do {                                                                  \
                ____err = tdh_func(args);                                     \
                                                                              \
                if ((____err & TDX_SEAMCALL_STATUS_MASK) !=		      \
				TDX_INTERRUPTED_RESUMABLE)                   
\	                       
                        break;                                                \
                                                                              \
                update_args;                                                  \
        } while (1);                                                          \
        ____err;                                                              \
})

#define tdh_no_retry(tdh_func, update_args, args...)    tdh_func(args)

#define tdh_do_no_vcpus(tdh_func, kvm, args...) \
        tdh_do_no_vcpus_retry(tdh_func, kvm, tdh_no_retry, ;, args)

#define tdh_do_no_vcpus_intr_retry(tdh_func, kvm, update_args, args...) \
        tdh_do_no_vcpus_retry(tdh_func, kvm, tdh_intr_retry, update_args, args)

^ permalink raw reply

* Re: [PATCH v2 20/31] x86/virt/tdx: Add a helper to loop on TDX_INTERRUPTED_RESUMABLE
From: Huang, Kai @ 2026-04-23  0:29 UTC (permalink / raw)
  To: dan.j.williams@intel.com, linux-pci@vger.kernel.org,
	linux-coco@lists.linux.dev, yilun.xu@linux.intel.com,
	x86@kernel.org
  Cc: Gao, Chao, Edgecombe, Rick P, Xu, Yilun, Jiang, Dave,
	dave.hansen@linux.intel.com, baolu.lu@linux.intel.com,
	Duan, Zhenzhong, kas@kernel.org, Verma, Vishal L, Li, Xiaoyao,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260327160132.2946114-21-yilun.xu@linux.intel.com>

On Sat, 2026-03-28 at 00:01 +0800, Xu Yilun wrote:
> +static u64 __maybe_unused __seamcall_ir_resched(sc_func_t sc_func, u64 fn,
> +						struct tdx_module_args *args)
> +{
> +	struct tdx_module_args _args;
> +	u64 r;
> +
> +	while (1) {
> +		_args = *(args);
> +		r = sc_retry(sc_func, fn, &_args);
> +		if (r != TDX_INTERRUPTED_RESUMABLE)
> +			break;
> +
> +		cond_resched();
> +	}
> +
> +	*args = _args;
> +
> +	return r;
> +}

Since commit 7dadeaa6e851e ("sched: Further restrict the preemption modes")
for x86 only PREEMPT (full) and PREEMPT_LAZY are possible, even when
PREEMPT_DYNAMIC is on.

cond_resched() is useful in PREEMPT_NONE and PREEMPT_VOLUNTARY, but it is
basically a RET0 in both PREEMPT_LAZY and PREEMPT.  My understanding is we
shouldn't add any more cond_resched() for x86 now (see the aforementioned
commit changelog for more info).

^ permalink raw reply

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Moger, Babu @ 2026-04-22 21:42 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
	Lendacky, Thomas, elena.reshetova@intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	eranian@google.com, peternewman@google.com
In-Reply-To: <ed0d2463-edbf-42df-b345-b83da356e865@intel.com>

Hi Reinette,

On 4/21/2026 9:56 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 4/21/26 5:17 PM, Moger, Babu wrote:
>> On 4/21/2026 5:44 PM, Reinette Chatre wrote:
>>> On 4/21/26 3:04 PM, Moger, Babu wrote:
> 
>>>> That said, I agree we need to support this. Without it, we won’t be able to move the group from PLZA to non-PLZA.
>>>>
>>>> # cat info/kernel_mode
>>>>       inherit_ctrl_and_mon:
>>>>       global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
>>>>       [global_assign_ctrl_assign_mon_per_cpu]:group=ctrl1/mon1/
>>>
>>> Like above where the listing is inconsistent. Is this what you mean?
>>
>> I meant the listing of "inherit_ctrl_and_mon" does not have groups while other modes have it.
> 
> I think this is ok since it does not need a group or any other (for now?) property.
> What issues do you foresee here?

Nothing at this point.  Will let you know if something comes up when 
started working on it.

Thanks,
Babu

^ permalink raw reply

* Re: [PATCH v13 00/48] arm64: Support for Arm CCA in KVM
From: Steven Price @ 2026-04-22 15:38 UTC (permalink / raw)
  To: Jiahao zheng
  Cc: alexandru.elisei, alpergun, aneesh.kumar, catalin.marinas,
	christoffer.dall, fj0570is, gankulkarni, gshan, james.morse,
	joey.gouly, kvm, kvmarm, linux-arm-kernel, linux-coco,
	linux-kernel, maz, oliver.upton, sdonthineni, suzuki.poulose,
	tabba, vannapurve, will, yuzenghui
In-Reply-To: <20260421135145.14789-1-jahao.zheng@gmail.com_quarantine>

On 21/04/2026 14:51, Jiahao zheng wrote:
> Hi Steven,
> 
> I've been testing CCA patch series and noticed Realm VM cannot boot successfully when the host is forced to run in nVHE mode (e.g., via `kvm-arm.mode=nvhe`). The kvmtool debug information will be truncated in set_guest_bank_private_gpa. 
> 
> Currently, in `kvm_ioctl_vcpu_run()`, running a Realm VM (REC) bypasses the standard nVHE EL2 stub. `kvm_rec_enter()` directly executes the SMC instruction to transition to the RMM. Upon returning to the EL1 host, the code falls back to `kvm_vgic_sync_hwstate()`, where the VGIC save operation is explicitly skipped for nVHE. Since the EL2 stub was bypassed, `__vgic_v3_save_state()` is never executed, and `ICH_*_EL2` states are lost.
> 
> To resolve this, I have a couple of thoughts:
> 1. If Host nVHE mode is not intended to be supported for Realms:
> Since RME implies ARMv9 which mandates VHE, running a Realm with an nVHE host might just be an unsupported edge case. If so, we should explicitly reject RME initialization or REC creation when `!is_kernel_in_hyp_mode()`. This would cleanly prevent the undefined behavior.
> 2. If Host nVHE mode is intended to be supported:
> Since RMM should remain agnostic to the Non-Secure VGIC states, the burden of saving these states falls strictly on KVM. However, the EL1 host cannot access `ICH_*_EL2`. Therefore, KVM needs to add specific logic for this scenario. We would likely need to route the REC exit through a dedicated nVHE EL2 stub to invoke `__vgic_v3_save_state()` before dropping back to EL1, rather than jumping straight back to `kvm_ioctl_vcpu_run()`.
> 
> I might have missed some documentation or comments regarding nVHE restrictions for CCA. If this is an oversight, it would be great to see a check added in the next iteration of the series.

Thanks for the testing. Yes indeed this is an oversight. For now option
1 is what I'm going to go for. There's nothing stopping nVHE mode being
supported but as you note any platform with RME will have VHE so it's
not an immediate priority to support.

One interesting case of nVHE is of course pKVM and for that there needs
to be some significant work to ensure that the EL2 hypervisor
understands the RMM communication and prevent any confused-deputy style
attacks. E.g. the host must not be able to map a pVM's private memory
into a realm guest.

I don't have any immediate plans to work on nVHE - my focus is getting
the basic support merged. But I know there was some interest to ensure
that pKVM and CCA would be able to co-exist on a platform so I expect it
will come in some form or another.

Thanks,
Steve

^ permalink raw reply

* Re: [PATCH 0/3] arm64/virt: Add Arm CCA measurement register support
From: Jason Gunthorpe @ 2026-04-22 12:40 UTC (permalink / raw)
  To: Sami Mujawar
  Cc: Suzuki Poulose, Dan Williams,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, will@kernel.org,
	thuth@redhat.com, Steven Price, gshan@redhat.com, Yeoreum Yun,
	cedric.xing@intel.com, Dan Williams, Dionna Glaze,
	Aneesh Kumar K . V, Alexey Kardashevskiy,
	linux-coco@lists.linux.dev
In-Reply-To: <AS8PR08MB680669CBFA9654BC12179D5E842F2@AS8PR08MB6806.eurprd08.prod.outlook.com>

On Wed, Apr 22, 2026 at 08:57:00AM +0000, Sami Mujawar wrote:

> > So this is tying it to the same FW event log that TPM uses.
> > 
> > I think that strengthens my point this should all be uninform. TPM
> > drivers are directly exposing the event log today, but I guess that
> > needs generalization if non-TPM drivers are going to present it as
> > well.
> > 
> > How do you imagine getting and manipulating the EFI event log to use
> > with this?
>
> The event logs from UEFI will be handed off to the OS using CCEL
> ACPI table. The CCEL table spec update can be seen at
> https://github.com/tianocore/edk2/issues/11384

I ment from linux userspace. event log is well establihsed in the aCPI
side for TPM and in Linux today it is only exposed to userspace
through TPM.

This is not TPM, so how do you intend to give the event log to
userspace?

Jason

^ permalink raw reply

* Re: [PATCH v2 30/31] coco/tdx-host: Implement IDE stream setup/teardown
From: Xu Yilun @ 2026-04-22  9:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB5276BB48433E1AF97E5D27298C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 08:02:33AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:02 AM
> > 
> > Implementation for a most straightforward Selective IDE stream setup.
> > Hard code all parameters for Stream Control Register. And no IDE Key
> > Refresh support.
> > 
> 
> 'more straightforward', compared to what?

Actually it is " *most* straightforward", I just mean "very".


^ permalink raw reply

* Re: [PATCH v2 27/31] coco/tdx-host: Implement SPDM session setup
From: Xu Yilun @ 2026-04-22  9:53 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-coco, linux-pci, dan.j.williams, x86, chao.gao, dave.jiang,
	baolu.lu, yilun.xu, zhenzhong.duan, kvm, rick.p.edgecombe,
	dave.hansen, kas, xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <51f2be5a-47d2-4a06-92bb-368aaed73530@suse.com>

> > +#define TDISP_FUNC_ID		GENMASK(15, 0)
> > +#define TDISP_FUNC_ID_SEGMENT		GENMASK(23, 16)
> > +#define TDISP_FUNC_ID_SEG_VALID		BIT(24)
> > +
> > +static inline u32 tdisp_func_id(struct pci_dev *pdev)
> > +{
> > +	u32 func_id;
> > +
> > +	func_id = FIELD_PREP(TDISP_FUNC_ID_SEGMENT, pci_domain_nr(pdev->bus));
> > +	if (func_id)
> > +		func_id |= TDISP_FUNC_ID_SEG_VALID;
> 
> This check implies pci_domain_nr returning 0 is considered invalid. Other
> callers in the kernel seem to not care, they just use the domain nr, so is
> this check spurious or intentional ?

This is the func_id format defined in TDISP SPEC, bit 24 is a must. It
is not the linux defined SBDF format.

> 
> > +	func_id |= FIELD_PREP(TDISP_FUNC_ID,
> > +			      PCI_DEVID(pdev->bus->number, pdev->devfn));
> > +
> > +	return func_id;
> > +}
> > +
> > +struct spdm_config_info_t {
> > +	u32 vmm_spdm_cap;
> > +#define SPDM_CAP_HBEAT          BIT(13)
> > +#define SPDM_CAP_KEY_UPD        BIT(14)
> 
> nit: move those defines above the struct definition, they just break the
> reading flow as it is.

Yes.

...

> > +static void *tdx_dup_array_data(struct tdx_page_array *array,
> > +				unsigned int data_size)
> > +{
> > +	unsigned int npages = (data_size + PAGE_SIZE - 1) / PAGE_SIZE;
> 
> nit: There's DIV_ROUND_UP

Yes.

...

> > +DEFINE_FREE(tdx_spdm_session_teardown, struct tdx_tsm_link *,
> > +	    if (!IS_ERR_OR_NULL(_T)) tdx_spdm_session_teardown(_T))
> > +
> >   static int tdx_tsm_link_connect(struct pci_dev *pdev)
> >   {
> > -	return -ENXIO;
> > +	struct tdx_tsm_link *tlink = to_tdx_tsm_link(pdev->tsm);
> > +
> > +	struct tdx_tsm_link *tlink_spdm __free(tdx_spdm_session_teardown) =
> > +		tdx_spdm_session_setup(tlink);
> 
> Is the free() really needed here, either the session is correctly setup and
> tlink_spdm is returned. But if session_setup() files then what about calling
> spdm_session_disconnect() on an unestablished session?

Ah, we have more steps to add, the __free() will take function when the
following steps fail.

We may add __free() when we add more steps, but I think that makes the
diff harder to read, so I want to keep this style.

Thanks.

> 
> 
> > +	if (IS_ERR(tlink_spdm)) {
> > +		pci_err(pdev, "fail to setup spdm session\n");
> > +		return PTR_ERR(tlink_spdm);
> > +	}
> > +
> > +	retain_and_null_ptr(tlink_spdm);
> > +
> > +	return 0;
> >   }
> 
> <snip>
> 

^ permalink raw reply

* Re: [PATCH v2 25/31] x86/virt/tdx: Add SEAMCALL wrappers for SPDM management
From: Xu Yilun @ 2026-04-22  9:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB5276DE573BCA90782AE5BA608C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 07:59:33AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:01 AM
> > 
> 
> here ...
> 
> > - TDH.SPDM.MNG supports three SPDM runtime operations: HEARTBEAT,
> >   KEY_UPDATE and DEV_INFO_RECOLLECTION.
> 
> ... but the actual helper just pass whatever ops to TDX module

Ah, yes. But only tdx-host driver which implements link encryption cares
about these operation code. My preference is only define general the
SEAMCALL wrapper format in TDX core as other SEAMCALL do. And leave the
specific op code definition in tdx-host driver.

I can revise the commit log as:

   - TDH.SPDM.MNG supports various SPDM runtime operations: HEARTBEAT,
     KEY_UPDATE, DEV_INFO_RECOLLECTION... These operation codes are defined
     in tdx-host driver.
 
Thanks.

> 
> > +u64 tdh_exec_spdm_mng(u64 spdm_id, u64 spdm_op, struct page
> > *spdm_param,
> > +		      struct page *spdm_rsp, struct page *spdm_req,
> > +		      struct tdx_page_array *spdm_out,
> > +		      u64 *spdm_req_or_out_len)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = spdm_id,
> > +		.rdx = spdm_op,
> > +		.r8 = spdm_param ? page_to_phys(spdm_param) : -1,
> > +		.r9 = page_to_phys(spdm_rsp),
> > +		.r10 = page_to_phys(spdm_req),
> > +		.r11 = spdm_out ? hpa_array_t_assign_raw(spdm_out) : -1,
> > +	};
> > +	u64 r;
> > +
> > +	r = seamcall_ret_ir_exec(TDH_SPDM_MNG, &args);
> > +
> > +	*spdm_req_or_out_len = args.rcx;
> > +
> > +	return r;
> > +}
> > +EXPORT_SYMBOL_FOR_MODULES(tdh_exec_spdm_mng, "tdx-host");

^ permalink raw reply

* Re: [PATCH v2 24/31] coco/tdx-host: Add a helper to exchange SPDM messages through DOE
From: Xu Yilun @ 2026-04-22  9:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB5276E1C73BFF4EF6A806F69D8C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 07:56:06AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:01 AM
> > +
> > +static int __maybe_unused tdx_spdm_msg_exchange(struct tdx_tsm_link
> > *tlink,
> > +						void *request, size_t
> > request_sz,
> > +						void *response, size_t
> > response_sz)
> > +{
> > +	struct pci_dev *pdev = tlink->pci.base_tsm.pdev;
> 
> call it pci_spdm_msg_exchange() and pass in struct pci_dev directly.

I don't think so. There is kernel managed spdm transfer support WIP,
which is another topic.  We don't want to mix the namespace with that
one.

And we also don't name it tsm_spdm_msg_exchange, cause TSM firmwares
output different blobs for vendor TSM drivers to transfer. E.g. TDX
Module outputs buffers with DOE header & SPDM header, other vendors
(AMD IIRC) outputs buffers with only SPDM header. So this function is
TDX specific.

> 
> there is no other use of tlink in this function. could add a note that
> this should be moved to pci core when a 2nd user of raw frame comes.

^ permalink raw reply

* Re: [PATCH v2 23/31] coco/tdx-host: Setup all trusted IOMMUs on TDX Connect init
From: Xu Yilun @ 2026-04-22  9:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB527640A3D76B66DCE5FFA58C8C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 07:51:56AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:01 AM
> > 
> > Setup all trusted IOMMUs on TDX Connect initialization and clear all on
> > TDX Connect removal.
> > 
> > Trusted IOMMU setup is the pre-condition for all following TDX Connect
> > operations such as SPDM/IDE setup. It is more of a platform
> > configuration than a standalone IOMMU configuration, so put the
> > implementation in tdx-host driver.
> > 
> 
> not sure what above tries to tell. why is it a platform configuration
> when you have seamcalls on each IOMMU?

This is to say the TDH.IOMMU.SETUP relates to PCIe SPDM/IDE, it is not
just about IOMMU. By identifying the

  for_each_iommu(iommu)
	tdh.iommu.setup(iommu)

as a platform configuration, it justifies why we trigger this
configuration at tdx-host driver probe, rather than in some
IOMMU/IOMMUFD API.

^ permalink raw reply

* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Aneesh Kumar K.V @ 2026-04-22  9:18 UTC (permalink / raw)
  To: Jason Gunthorpe, Jiri Pirko
  Cc: dri-devel, linaro-mm-sig, iommu, linux-media, sumit.semwal,
	benjamin.gaignard, Brian.Starkey, jstultz, tjmercier,
	christian.koenig, m.szyprowski, robin.murphy, leon, sean.anderson,
	ptesarik, catalin.marinas, suzuki.poulose, steven.price,
	thomas.lendacky, john.allen, ashish.kalra, suravee.suthikulpanit,
	linux-coco
In-Reply-To: <20260421121004.GA3611611@ziepe.ca>

Jason Gunthorpe <jgg@ziepe.ca> writes:

> On Tue, Apr 21, 2026 at 01:53:31PM +0200, Jiri Pirko wrote:
>> >> You reach there when is_swiotlb_force_bounce(dev) is true and
>> >> DMA_ATTR_CC_SHARED is set. What am I missing?
>> >>
>> >
>> >So a swiotlb_force_bounce will not use swiotlb bouncing if
>> >DMA_ATTR_CC_SHARED is set ? 
>> 
>> Correct. Bouncing does not make sense in this case, as shared memory is
>> already being mapped.
>
> It is a little bit mangled, there are many reasons force_swiotlb can
> be set, but we loose them as it flows through - swiotlb_init()
> just has a simple SWIOTLB_FORCE
>
> Ideally DMA_ATTR_CC_SHARED would skip swiotlb only if it is being
> selected for CC reasons. For instance if you have the swiotlb force
> command line parameter I would still expect it bounce shared memory.
>
> Arguably I think this arch flow is misdesigned, the
> is_swiotlb_force_bounce() should not be used for CC. dma_capable() is
> the correct API to check if the device can DMA to the presented
> address, and it will trigger swiotlb_map() just the same without
> creating this gap.
>
> Jason

Something like this?

static inline dma_addr_t dma_direct_map_phys(struct device *dev,
		phys_addr_t phys, size_t size, enum dma_data_direction dir,
		unsigned long attrs, bool flush)
{
	dma_addr_t dma_addr;

	if (is_swiotlb_force_bounce(dev)) {
		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
			return DMA_MAPPING_ERROR;

		return swiotlb_map(dev, phys, size, dir, attrs);
	}

	if (attrs & DMA_ATTR_MMIO) {
		dma_addr = phys;
		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
			goto err_overflow;
		goto dma_mapped;
	} else if (attrs & DMA_ATTR_CC_SHARED) {
		dma_addr = phys_to_dma_unencrypted(dev, phys);
	} else {
		dma_addr = phys_to_dma_encrypted(dev, phys);
	}

	if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)) ||
	    dma_kmalloc_needs_bounce(dev, size, dir)) {
		if (is_swiotlb_active(dev) &&
		    !(attrs & DMA_ATTR_REQUIRE_COHERENT))
			return swiotlb_map(dev, phys, size, dir, attrs);
		goto err_overflow;
	}

dma_mapped:
	if (!dev_is_dma_coherent(dev) &&
	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
		arch_sync_dma_for_device(phys, size, dir);
		if (flush)
			arch_sync_dma_flush();
	}
	return dma_addr;

and dma_capable() now does
static inline bool dma_capable(struct device *dev, dma_addr_t addr, size_t size,
		bool is_ram, unsigned long attrs)
{
....

	/*
	 * if phys addr attribute is encrypted but the
	 * device is forcing an encrypted dma addr
	 */
	if (!(attrs & DMA_ATTR_CC_SHARED) && force_dma_unencrypted(dev))
		return false;
...

}


-aneesh

^ permalink raw reply

* Re: [PATCH 0/3] arm64/virt: Add Arm CCA measurement register support
From: Sami Mujawar @ 2026-04-22  8:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Suzuki Poulose
  Cc: Dan Williams, linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, Catalin Marinas, will@kernel.org,
	thuth@redhat.com, Steven Price, gshan@redhat.com, Yeoreum Yun,
	cedric.xing@intel.com, Dan Williams, Dionna Glaze,
	Aneesh Kumar K . V, Alexey Kardashevskiy,
	linux-coco@lists.linux.dev
In-Reply-To: <20260414133525.GA2577880@ziepe.ca>

Hi Jason,

> On Tue, Apr 14, 2026 at 02:26:58PM +0100, Suzuki K Poulose wrote:
> > On 14/04/2026 13:29, Jason Gunthorpe wrote:
> > > On Tue, Apr 14, 2026 at 11:10:51AM +0100, Suzuki K Poulose wrote:
> > >
> > > > > Isn't this also sort of incomplete?  Doesn't anything serious need
> > > > > signed measurements? Isnt't there alot more data that comes out of RMM
> > > > > than just a few measurement registers?
> > > > As mentioned above, this series adds the support for Runtime Extendible
> > > > Measurements (REM in CCA, RTMR on TDX). The RIM+Platform Attestation is
> > > > already provided via the TSM_REPORT
> > >
> > > Okay, but what actual use is this?
> > >
> >
> > Good point. This REMs are planned to be used for EFI_CC_MEASUREMENT_PROTOCOL
> > as described below:
> >
> > https://github.com/tianocore/edk2/issues/11383
> 
> So this is tying it to the same FW event log that TPM uses.
> 
> I think that strengthens my point this should all be uninform. TPM
> drivers are directly exposing the event log today, but I guess that
> needs generalization if non-TPM drivers are going to present it as
> well.
> 
> How do you imagine getting and manipulating the EFI event log to use
> with this?

The event logs from UEFI will be handed off to the OS using CCEL ACPI table. The CCEL table spec update can be seen at  https://github.com/tianocore/edk2/issues/11384 

Regards,

Sami Mujawar
> Jason
>



^ permalink raw reply

* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Petr Tesarik @ 2026-04-22  7:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jiri Pirko, Aneesh Kumar K.V, dri-devel, linaro-mm-sig, iommu,
	linux-media, sumit.semwal, benjamin.gaignard, Brian.Starkey,
	jstultz, tjmercier, christian.koenig, m.szyprowski, robin.murphy,
	leon, sean.anderson, catalin.marinas, suzuki.poulose,
	steven.price, thomas.lendacky, john.allen, ashish.kalra,
	suravee.suthikulpanit, linux-coco
In-Reply-To: <20260421121004.GA3611611@ziepe.ca>

On Tue, 21 Apr 2026 09:10:04 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Apr 21, 2026 at 01:53:31PM +0200, Jiri Pirko wrote:
> > >> You reach there when is_swiotlb_force_bounce(dev) is true and
> > >> DMA_ATTR_CC_SHARED is set. What am I missing?
> > >>  
> > >
> > >So a swiotlb_force_bounce will not use swiotlb bouncing if
> > >DMA_ATTR_CC_SHARED is set ?   
> > 
> > Correct. Bouncing does not make sense in this case, as shared memory is
> > already being mapped.  
> 
> It is a little bit mangled, there are many reasons force_swiotlb can
> be set, but we loose them as it flows through - swiotlb_init()
> just has a simple SWIOTLB_FORCE
> 
> Ideally DMA_ATTR_CC_SHARED would skip swiotlb only if it is being
> selected for CC reasons. For instance if you have the swiotlb force
> command line parameter I would still expect it bounce shared memory.
> 
> Arguably I think this arch flow is misdesigned, the
> is_swiotlb_force_bounce() should not be used for CC. dma_capable() is
> the correct API to check if the device can DMA to the presented
> address, and it will trigger swiotlb_map() just the same without
> creating this gap.

Seconded.

Then again, the whole DMA mapping logic is extremely convoluted, with
dmaops, direct, CMA, and swiotlb, so I'm no longer sure there is one
undisputable way where CC shared mappings should be added to the mix.

Has anyone considered a cleaner design yet? If yes, I'm volunteering to
help implement it. If not, then please ignore me as a random rant.

Petr T

^ permalink raw reply

* Re: [PATCH v2 22/31] iommu/vt-d: Export a helper to do function for each dmar_drhd_unit
From: Xu Yilun @ 2026-04-22  6:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB527698CD6C350B371C47540F8C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 07:49:46AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:01 AM
> > 
> > @@ -86,6 +86,8 @@ extern struct list_head dmar_drhd_units;
> >  				dmar_rcu_check())			\
> >  		if (i=drhd->iommu, 0) {} else
> > 
> > +int do_for_each_drhd_unit(int (*fn)(struct dmar_drhd_unit *));
> > +
> >  static inline bool dmar_rcu_check(void)
> 
> It's a bit weird to insert it here. Move it to follow for_each_iommu().

Sorry, it is following for_each_iommu(), is it?

> 
> > +
> > +int do_for_each_drhd_unit(int (*fn)(struct dmar_drhd_unit *))
> > +{
> > +	struct dmar_drhd_unit *drhd;
> > +	int ret;
> > +
> > +	guard(rwsem_read)(&dmar_global_lock);
> > +
> > +	for_each_drhd_unit(drhd) {
> > +		ret = fn(drhd);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +	return 0;
> > +}
> 
> use for_each_active_drhd_unit(). or is there need to setup the trusted
> configuration even on ignored iommu?

No, for_each_active_drhd_unit() is good.

^ permalink raw reply

* Re: [PATCH v2 21/31] x86/virt/tdx: Add SEAMCALL wrappers for trusted IOMMU setup and clear
From: Xu Yilun @ 2026-04-22  6:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB52765F4804B8E2223BBB0B1F8C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 07:30:32AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:01 AM
> > 
> > From: Zhenzhong Duan <zhenzhong.duan@intel.com>
> > 
> > Add SEAMCALLs to setup/clear trusted IOMMU for TDX Connect.
> 
> what is 'trusted IOMMU'? a new hardware, or some sensitive resource in
> the IOMMU which is only visible to TDX module?

Some extended HW resources in IOMMU so I think the later.

> 
> If the latter it's clearer to say "trusted configuration in IOMMU".

Yeah. And I realized there are more configuration except IOMMU, so I
would say:

  Add SEAMCALLs to setup/clear the IOMMU device and related I/O stack to
  work in trusted (TDX) mode.

> 
> > 
> > Enable TEE I/O support for a target device requires to setup trusted IOMMU
> > for the related IOMMU device first, even only for enabling physical secure
> > links like SPDM/IDE.
> 
> this series is just about SPDM/IDE. then the first part about TEE I/O is not
> really relevant.

This is truely obscure. I want to clarify some potential concern about
why we need to setup IOMMU when only to enable PCIe link encryption, my
re-phase:

    With the setup SEAMCALL, TDX Module ensures that related resources in
    the IOMMU device & I/O stack are in expected state and protected from
    further untrusted access, so that subsequent SPDM/IDE enabling is
    secure.

> 
> > 
> > TDH.IOMMU.SETUP takes the register base address (VTBAR) to position an
> > IOMMU device, and outputs an IOMMU_ID as the trusted IOMMU identifier.
> > TDH.IOMMU.CLEAR takes the IOMMU_ID to reverse the setup.
> 
> Intel IOMMU is called VT-d. It has a register block but not a PCI device so
> there is no BAR resource related.
> 
> let's just call it 'reg_base'

Yes.

> 
> intel-iommu driver already has its own 'id' definition for each iommu device.
> It's clearer to add a prefix to this new id, e.g. tdx_iommu_id?

Yes.

^ permalink raw reply

* Re: [PATCH v2 20/31] x86/virt/tdx: Add a helper to loop on TDX_INTERRUPTED_RESUMABLE
From: Xu Yilun @ 2026-04-22  6:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB52767E6C22CDF7DAD9D5FF1E8C582@BN9PR11MB5276.namprd11.prod.outlook.com>

On Thu, Apr 09, 2026 at 07:21:48AM +0000, Tian, Kevin wrote:
> > From: Xu Yilun <yilun.xu@linux.intel.com>
> > Sent: Saturday, March 28, 2026 12:01 AM
> > 
> > +static u64 __maybe_unused __seamcall_ir_resched(sc_func_t sc_func, u64
> > fn,
> > +						struct tdx_module_args *args)
> > +{
> 
> 'ir' sounds redundant with the trailing 'resched'?

Mm.. I want to 'ir' to reflect the loop-retry is dedicated for
INTERRUPTED_RESUMABLE in TDX context. When you say not big deal, I
assume I can keep the naming?

> 
> not big deal, just a bit confusing when seeing it in IOMMU side where
> 'ir' also refers to 'interrupt remapping' and is frequently used in 
> irq_remapping.c... :)

^ permalink raw reply

* Re: [PATCH v2 19/31] iommu/vt-d: Reserve the MSB domain ID bit for the TDX module
From: Xu Yilun @ 2026-04-22  6:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: linux-coco@lists.linux.dev, linux-pci@vger.kernel.org,
	Williams, Dan J, x86@kernel.org, Gao, Chao, Jiang, Dave,
	baolu.lu@linux.intel.com, Xu, Yilun, Duan, Zhenzhong,
	kvm@vger.kernel.org, Edgecombe, Rick P,
	dave.hansen@linux.intel.com, kas@kernel.org, Li, Xiaoyao,
	Verma, Vishal L, linux-kernel@vger.kernel.org
In-Reply-To: <BN9PR11MB52767F28E9FF692ACE69F3D78C582@BN9PR11MB5276.namprd11.prod.outlook.com>

> Here we need more words to explain the strategy here.
> 
> The comment says "When IOMMU is *enabled*...", but the code here
> just checks the static capability. It's probably a design choice that you
> don't want to add complexity on recycling DIDs when TDX connect
> is actually enabled, but it's worth a note here.

Yes, that's the rationale. I'll add it to comments.

> 
> btw in patch23 commit msg:
> 
> "
> There is no dedicated way to enumerate which IOMMU devices support
> trusted operations. The host has to call TDH.IOMMU.SETUP on all IOMMU
> devices and tell their trusted capability by the return value.
> "
> 
> which implies that ecap_tdxc() alone doesn't really report the capability?

Ah, good catch. Let me explain:

ecap_tdxc does report the capability. This bit is special cause both
trusted part & untrusted part access it.

For IOMMU driver (which now handles the untrusted part), it can directly
query to this bit and decide what to do.

But for tdx-host driver which handles the trusted part, it shouldn't
speculate into the IOMMU for capability enumeration. TDX Module has more
concerns about trusted capability, including the related I/O stack
capabilities e.g. SPDM/IDE cap...  So in patch23 I actually mean we
don't have an enumeration SEAMCALL for trusted capability, I will
refactor that message:

    There is no dedicated *SEAMCALL* to enumerate which IOMMU devices support
    trusted operations...

> 
> anyway all of those need a better explanation here...

^ permalink raw reply

* SVSM Development Call April 22, 2026
From: Jörg Rödel @ 2026-04-22  6:12 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-22  2:56 UTC (permalink / raw)
  To: Moger, Babu, Babu Moger, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
	Lendacky, Thomas, elena.reshetova@intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	eranian@google.com, peternewman@google.com
In-Reply-To: <39da36be-40a3-45cb-8e49-12dbb59aca74@amd.com>

Hi Babu,

On 4/21/26 5:17 PM, Moger, Babu wrote:
> On 4/21/2026 5:44 PM, Reinette Chatre wrote:
>> On 4/21/26 3:04 PM, Moger, Babu wrote:

>>> That said, I agree we need to support this. Without it, we won’t be able to move the group from PLZA to non-PLZA.
>>>
>>> # cat info/kernel_mode
>>>      inherit_ctrl_and_mon:
>>>      global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
>>>      [global_assign_ctrl_assign_mon_per_cpu]:group=ctrl1/mon1/
>>
>> Like above where the listing is inconsistent. Is this what you mean?
> 
> I meant the listing of "inherit_ctrl_and_mon" does not have groups while other modes have it.

I think this is ok since it does not need a group or any other (for now?) property.
What issues do you foresee here?

>>
>> sidenote: Should the last line be "[global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]"?
> 
> Yes.
> 
>>
>>>
>>> # echo "inherit_ctrl_and_mon:group=ctrl1/mon1/" > info/kernel_mode
>>
>> This does not look right. Why is a "group" property needed here? Can the mode not just
>> be set by itself? Specifically, why not just:
>>
>>     # echo "inherit_ctrl_and_mon" > info/kernel_mode
> 
> We can go with this based on your another comment below. While changing the mode use the defaults if properties are not provided.
> 
> 
>>
>> This reminds me that there is still an open remaining from
>> https://lore.kernel.org/lkml/71099958-1ddf-40dc-8a3c-aa13d0c56fee@intel.com/
>> Specifically this from that message:
>>     The named fields could be made optional, if group is omitted then it will become the
>>     default resource group, and if cpus/cpus_list is omitted then it will default to all CPUs.
>>     This may not be intuitive since a user may expect that not mentioning a field means
>>     that the field is left untouched. Have you considered this scenario in your proposal?
>>
>> I think this needs some clear description of behavior wrt properties, for example:
>> - Is it required to provide all properties on each write? More specifically, can user expect there
>>    to be "default" values when a property is not provided or is user required to provide a value
>>    for each property? We need to be careful here because we do not want user scripts to fail when a new
>>    property is added in the future. What if resctrl specifies that if user space does not provide
>>    a property then resctrl will pick a default. For example, if user runs:
>>     # echo "global_assign_ctrl_assign_mon_per_cpu" > info/kernel_mode
>>    then resctrl will switch to "global_assign_ctrl_assign_mon_per_cpu" mode initialized to
>>    the default group.
>>    I am not sure if resctrl needs to support re-configuration of modes in the future where the
>>    mode stays the same but a property changes? Consider, for example,
>>
>>     # cat info/kernel_mode
>>     [inherit_ctrl_and_mon:]
>>     global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
>>
>>     # echo "global_assign_ctrl_assign_mon_per_cpu" > info/kernel_mode
>>     /*
>>      * resctrl switches to "global_assign_ctrl_assign_mon_per_cpu" mode and sets
>>      * PLZA group to default group
>>      */
>>     # cat info/kernel_mode
>>     inherit_ctrl_and_mon:
>>     [global_assign_ctrl_assign_mon_per_cpu:group=//]
>>     # echo "global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/" > info/kernel_mode
>>     /*
>>      * resctrl stays in "global_assign_ctrl_assign_mon_per_cpu" mode and sets
>>      * PLZA group to default group
>>      */
> 
> I think you meant "PLZA group to ctrl1/mon1/" here.

Indeed, yes. Thank you.

> 
>>     # cat info/kernel_mode
>>     inherit_ctrl_and_mon:
>>     [global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]
>>     # echo "global_assign_ctrl_assign_mon_per_cpu" > info/kernel_mode
>>     /*
>>      * TBD: should resctrl switch back to default group or just keep
>>      * group as ctrl1/mon1/ ?
>>      */
>>
>>    resctrl could thus specify different behavior for switching to a mode where all properties
>>    not specified obtains default values and re-configuring a mode where only specified
>>    properties are changed. That means, the "TBD" above would be that the group stays
>>    as ctrl1/mon1/. So,
>>     # cat info/kernel_mode
>>     inherit_ctrl_and_mon:
>>     [global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]
>>
>>    What do you think?
> 
> Yes. Sure. We can do that. We only have 2 properties now (mode and group). We should be able to handle that.

Thank you for considering.

Reinette

^ permalink raw reply

* Re: [PATCH v2 08/31] x86/virt/tdx: Configure TDX Module with optional TDX Connect feature
From: Dan Williams @ 2026-04-22  1:19 UTC (permalink / raw)
  To: Xu Yilun, linux-coco, linux-pci, dan.j.williams, x86
  Cc: chao.gao, dave.jiang, baolu.lu, yilun.xu, yilun.xu,
	zhenzhong.duan, kvm, rick.p.edgecombe, dave.hansen, kas,
	xiaoyao.li, vishal.l.verma, linux-kernel
In-Reply-To: <20260327160132.2946114-9-yilun.xu@linux.intel.com>

Xu Yilun wrote:
> TDX Module supports optional TDX features (e.g. TDX Connect & TDX Module
> Extensions) that won't be enabled by default.

So this is another place where "optional" is misleading. For simplicity
there is no mechanism to fallback from TDX Connect operation if present,
at least in the core. The only optional aspects would be the mechanism
that could be unloaded through the tdx_host driver.

.../me notices that other comments on this patch say the same, but do
read on, another important detail about ktime_get_real_seconds() below.

> It extends TDH.SYS.CONFIG for host to choose to enable them on bootup.
> 
> Call TDH.SYS.CONFIG with a new bitmap input parameter to specify which
> features to enable. The bitmap uses the same definitions as
> TDX_FEATURES0. But note not all bits in TDX_FEATURES0 are valid for
> configuration, e.g. TDX Module Extensions is a service that supports TDX
> Connect, it is implicitly enabled when TDX Connect is enabled. Setting
> TDX_FEATURES0_EXT in the bitmap has no effect.
> 
> TDX Module advances the version of TDH.SYS.CONFIG for the change, so
> use the latest version (v1) for optional feature enabling. But
> supporting existing Modules which only support v0 is still necessary
> until they are deprecated, enumerate via TDX_FEATURES0 to decide which
> version to use.

I would say this differently, it will always be the case that new
kernels are needed to enable new features, but it is unlikely that
TDH.SYS.CONFIG ever needs to change again. The v0 -> v1 transition means
that feature bits are to be used here on out. So there is little value
in worrying about deprecating v0 to save a couple lines of code in 5-7
years when these original TDX platforms sunset.

> TDX Module updates global metadata when optional features are enabled.
> Host should update the cached tdx_sysinfo to reflect these changes.
> 
> Co-developed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> ---
>  arch/x86/virt/vmx/tdx/tdx.h |  3 ++-
>  arch/x86/virt/vmx/tdx/tdx.c | 16 +++++++++++++++-
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
> index e5a9331df451..870bb75da3ba 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.h
> +++ b/arch/x86/virt/vmx/tdx/tdx.h
> @@ -58,7 +58,8 @@
>  #define TDH_PHYMEM_CACHE_WB		40
>  #define TDH_PHYMEM_PAGE_WBINVD		41
>  #define TDH_VP_WR			43
> -#define TDH_SYS_CONFIG			45
> +#define TDH_SYS_CONFIG_V0		45
> +#define TDH_SYS_CONFIG			SEAMCALL_LEAF_VER(TDH_SYS_CONFIG_V0, 1)
>  
>  /* TDX page types */
>  #define	PT_NDA		0x0
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index 130214933c2f..0c5d6bdd810f 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -1353,6 +1353,7 @@ static int construct_tdmrs(struct list_head *tmb_list,
>  static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
>  {
>  	struct tdx_module_args args = {};
> +	u64 seamcall_fn = TDH_SYS_CONFIG_V0;
>  	u64 *tdmr_pa_array;
>  	size_t array_sz;
>  	int i, ret;
> @@ -1377,7 +1378,15 @@ static int config_tdx_module(struct tdmr_info_list *tdmr_list, u64 global_keyid)
>  	args.rcx = __pa(tdmr_pa_array);
>  	args.rdx = tdmr_list->nr_consumed_tdmrs;
>  	args.r8 = global_keyid;
> -	ret = seamcall_prerr(TDH_SYS_CONFIG, &args);
> +
> +	if (tdx_sysinfo.features.tdx_features0 & TDX_FEATURES0_TDXCONNECT) {
> +		args.r9 |= TDX_FEATURES0_TDXCONNECT;
> +		args.r11 = ktime_get_real_seconds();

Mainline has reason to not entertain this module requirement. The fact
that passing zero is an error is useful to detect unsupported modules.
An updated module would accept zero as indicating "VMM requests module
disable all policy and mechanisms related to untrusted wall clock time".
Specifically, there are several problems with this:

1/ No other TSM implementation requires the VMM to pass in an untrusted time
2/ The wall time may change and may require hooks to keep the module time
   up to date, but see point 1/, this would be a TDX special flower hook.
3/ Presumably this allows the module or the guest to do certificate expiration
   checks, but that is the responsibility of the relying party. The
   relying party may have reason to accept an "expired" cert as determined
   by VMM wall clock, and the guest presumably already has mechanisms to
   determine untrusted wall clock time from the VMM if it wants. Guests
   do not need TDX ABI for that.

So I think Linux wants to pass 0 here and wait for modules that accept
that as the start of TDX Connect support. As you said, given there are
no released modules with TDX Connect there is time to make that first
release drop this requirement.

^ permalink raw reply

* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Moger, Babu @ 2026-04-22  0:17 UTC (permalink / raw)
  To: Reinette Chatre, Babu Moger, corbet@lwn.net, tony.luck@intel.com,
	Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
  Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
	akpm@linux-foundation.org, pmladek@suse.com,
	rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
	kees@kernel.org, elver@google.com, paulmck@kernel.org,
	lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
	seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
	xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
	Lendacky, Thomas, elena.reshetova@intel.com,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	eranian@google.com, peternewman@google.com
In-Reply-To: <de608041-bc45-4ca0-81fe-423a5167d7d0@intel.com>

Hi Reinette,

On 4/21/2026 5:44 PM, Reinette Chatre wrote:
> Hi Babu,
> 
> On 4/21/26 3:04 PM, Moger, Babu wrote:
>> My bad. My only motivation was to keep the mode listing display consistent.
> 
> The listing display is already inconsistent since the different modes have different
> global properties, no?
> 

Yes. That is true.

>>
>> That said, I agree we need to support this. Without it, we won’t be able to move the group from PLZA to non-PLZA.
>>
>> # cat info/kernel_mode
>>      inherit_ctrl_and_mon:
>>      global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
>>      [global_assign_ctrl_assign_mon_per_cpu]:group=ctrl1/mon1/
> 
> Like above where the listing is inconsistent. Is this what you mean?

I meant the listing of "inherit_ctrl_and_mon" does not have groups while 
other modes have it.

> 
> sidenote: Should the last line be "[global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]"?

Yes.

> 
>>
>> # echo "inherit_ctrl_and_mon:group=ctrl1/mon1/" > info/kernel_mode
> 
> This does not look right. Why is a "group" property needed here? Can the mode not just
> be set by itself? Specifically, why not just:
> 
> 	# echo "inherit_ctrl_and_mon" > info/kernel_mode

We can go with this based on your another comment below. While changing 
the mode use the defaults if properties are not provided.


> 
> This reminds me that there is still an open remaining from
> https://lore.kernel.org/lkml/71099958-1ddf-40dc-8a3c-aa13d0c56fee@intel.com/
> Specifically this from that message:
> 	The named fields could be made optional, if group is omitted then it will become the
> 	default resource group, and if cpus/cpus_list is omitted then it will default to all CPUs.
> 	This may not be intuitive since a user may expect that not mentioning a field means
> 	that the field is left untouched. Have you considered this scenario in your proposal?
> 
> I think this needs some clear description of behavior wrt properties, for example:
> - Is it required to provide all properties on each write? More specifically, can user expect there
>    to be "default" values when a property is not provided or is user required to provide a value
>    for each property? We need to be careful here because we do not want user scripts to fail when a new
>    property is added in the future. What if resctrl specifies that if user space does not provide
>    a property then resctrl will pick a default. For example, if user runs:
> 	# echo "global_assign_ctrl_assign_mon_per_cpu" > info/kernel_mode
>    then resctrl will switch to "global_assign_ctrl_assign_mon_per_cpu" mode initialized to
>    the default group.
>    I am not sure if resctrl needs to support re-configuration of modes in the future where the
>    mode stays the same but a property changes? Consider, for example,
> 
> 	# cat info/kernel_mode
> 	[inherit_ctrl_and_mon:]
> 	global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
> 
> 	# echo "global_assign_ctrl_assign_mon_per_cpu" > info/kernel_mode
> 	/*
> 	 * resctrl switches to "global_assign_ctrl_assign_mon_per_cpu" mode and sets
> 	 * PLZA group to default group
> 	 */
> 	# cat info/kernel_mode
> 	inherit_ctrl_and_mon:
> 	[global_assign_ctrl_assign_mon_per_cpu:group=//]
> 	# echo "global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/" > info/kernel_mode
> 	/*
> 	 * resctrl stays in "global_assign_ctrl_assign_mon_per_cpu" mode and sets
> 	 * PLZA group to default group
> 	 */

I think you meant "PLZA group to ctrl1/mon1/" here.

> 	# cat info/kernel_mode
> 	inherit_ctrl_and_mon:
> 	[global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]
> 	# echo "global_assign_ctrl_assign_mon_per_cpu" > info/kernel_mode
> 	/*
> 	 * TBD: should resctrl switch back to default group or just keep
> 	 * group as ctrl1/mon1/ ?
> 	 */
> 
>    resctrl could thus specify different behavior for switching to a mode where all properties
>    not specified obtains default values and re-configuring a mode where only specified
>    properties are changed. That means, the "TBD" above would be that the group stays
>    as ctrl1/mon1/. So,
> 	# cat info/kernel_mode
> 	inherit_ctrl_and_mon:
> 	[global_assign_ctrl_assign_mon_per_cpu:group=ctrl1/mon1/]
> 
>    What do you think?

Yes. Sure. We can do that. We only have 2 properties now (mode and 
group). We should be able to handle that.


> 
>> # cat info/kernel_mode
>>      inherit_ctrl_and_mon:
>>      global_assign_ctrl_assign_mon_per_cpu:group=uninitialized
>>      [global_assign_ctrl_assign_mon_per_cpu]:group=uninitialized
> This does not look right. After switching the kernel_mode to inherit_ctrl_and_mon
> I expect inherit_ctrl_and_mon to be the active mode?

Yes. inherit_ctrl_and_mon should be active here.

Thanks
Babu



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox