* Re: [PATCH v7 08/22] x86/virt/seamldr: Allocate and populate a module update request
From: Edgecombe, Rick P @ 2026-04-11 0:33 UTC (permalink / raw)
To: Hansen, Dave, Gao, Chao
Cc: linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
Huang, Kai, kvm@vger.kernel.org, Li, Xiaoyao, Zhao, Yan Y,
dave.hansen@linux.intel.com, tony.lindgren@linux.intel.com,
Chatre, Reinette, seanjc@google.com, pbonzini@redhat.com,
binbin.wu@linux.intel.com, Weiny, Ira, nik.borisov@suse.com,
mingo@redhat.com, Verma, Vishal L, kas@kernel.org,
sagis@google.com, Annapurve, Vishal, hpa@zytor.com,
tglx@kernel.org, paulmck@kernel.org, bp@alien8.de,
yilun.xu@linux.intel.com, dan.j.williams@intel.com,
x86@kernel.org
In-Reply-To: <aczW7MZBqsCXw4gk@intel.com>
On Wed, 2026-04-01 at 16:27 +0800, Chao Gao wrote:
> And I assume we don't need WARN_ON_ONCE(PAGE_SIZE != SZ_4K) since this is
> unlikely to break soon and shouldn't be very hard to debug and fix if it
> does.
+1
^ permalink raw reply
* Re: [PATCH v7 07/22] coco/tdx-host: Implement firmware upload sysfs ABI for TDX module updates
From: Edgecombe, Rick P @ 2026-04-11 0:26 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-8-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> +static enum fw_upload_err tdx_fw_write(struct fw_upload *fwl, const u8 *data,
> + u32 offset, u32 size, u32 *written)
> +{
> + int ret;
> +
> + /*
> + * tdx_fw_write() always processes all data on the first call with
> + * offset == 0. Since it never returns partial success (it either
> + * succeeds completely or fails), there is no subsequent call with
> + * non-zero offsets.
> + */
> + WARN_ON_ONCE(offset);
> + ret = seamldr_install_module(data, size);
> + switch (ret) {
> + case 0:
> + *written = size;
> + return FW_UPLOAD_ERR_NONE;
> + case -EBUSY:
> + return FW_UPLOAD_ERR_BUSY;
> + case -EIO:
> + return FW_UPLOAD_ERR_HW_ERROR;
> + case -ENOMEM:
> + return FW_UPLOAD_ERR_RW_ERROR;
> + default:
> + return FW_UPLOAD_ERR_FW_INVALID;
It's hard to review whether these error codes match because the function doesn't
return them yet. Why isn't this patch just done later in the series after
everything is in place?
> + }
> +}
> +
^ permalink raw reply
* Re: [PATCH v7 05/22] x86/virt/seamldr: Add a helper to retrieve P-SEAMLDR information
From: Edgecombe, Rick P @ 2026-04-11 0:13 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-6-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> P-SEAMLDR returns its information such as version number, in response to
> the SEAMLDR.INFO SEAMCALL.
>
> This information is useful for userspace. For example, the admin can decide
> which TDX module versions are compatible with the P-SEAMLDR according to
> the P-SEAMLDR version.
We know what this will be used for, so it would have been nice to say it. Like:
This information is useful for userspace admins to decide.... Future changes
will expose it to userspace...
Plus the remaining updates part is not hinted at.
>
> Add a helper to retrieve P-SEAMLDR information in preparation for
> exposing P-SEAMLDR version and other necessary information to userspace.
> Export the new kAPI for use by tdx-host.ko.
>
> Note that there are two distinct P-SEAMLDR APIs with similar names:
>
> SEAMLDR.INFO: Returns a SEAMLDR_INFO structure containing SEAMLDR
> information such as version and remaining updates.
>
> SEAMLDR.SEAMINFO: Returns a SEAMLDR_SEAMINFO structure containing SEAM
> and system information such as Convertible Memory
> Regions (CMRs) and number of CPUs and sockets.
>
> The former is used here.
>
> For details, see "Intel® Trust Domain Extensions - SEAM Loader (SEAMLDR)
> Interface Specification".
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Either way,
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* Re: [PATCH v7 04/22] x86/virt/seamldr: Introduce a wrapper for P-SEAMLDR SEAMCALLs
From: Edgecombe, Rick P @ 2026-04-10 23:58 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao,
linux-rt-devel@lists.linux.dev
Cc: bigeasy@linutronix.de, Li, Xiaoyao, Huang, Kai, Zhao, Yan Y,
dave.hansen@linux.intel.com, rostedt@goodmis.org, kas@kernel.org,
seanjc@google.com, binbin.wu@linux.intel.com, pbonzini@redhat.com,
Chatre, Reinette, Verma, Vishal L, nik.borisov@suse.com,
mingo@redhat.com, Weiny, Ira, clrkwllms@kernel.org,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, hpa@zytor.com, tglx@kernel.org,
paulmck@kernel.org, bp@alien8.de, yilun.xu@linux.intel.com,
dan.j.williams@intel.com, x86@kernel.org
In-Reply-To: <20260331124214.117808-5-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> The TDX architecture uses the "SEAMCALL" instruction to communicate with
> SEAM mode software. Right now, the only SEAM mode software that the kernel
> communicates with is the TDX module. But, there is actually another
> component that runs in SEAM mode but it is separate from the TDX module:
> the persistent SEAM loader or "P-SEAMLDR". Right now, the only component
> that communicates with it is the BIOS which loads the TDX module itself at
> boot. But, to support updating the TDX module, the kernel now needs to be
> able to talk to it.
>
> P-SEAMLDR SEAMCALLs differ from TDX module SEAMCALLs in areas such as
> concurrency requirements. Add a P-SEAMLDR wrapper to handle these
> differences and prepare for implementing concrete functions.
>
> Use seamcall_prerr() (not '_ret') because current P-SEAMLDR calls do not
> use any output registers other than RAX.
>
> Note that unlike P-SEAMLDR, there is also a non-persistent SEAM loader
> ("NP-SEAMLDR"). This is an authenticated code module (ACM) that is not
> callable at runtime. Only BIOS launches it to load P-SEAMLDR at boot;
> the kernel does not need to interact with it for runtime update.
>
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> Reviewed-by: Kai Huang <kai.huang@intel.com>
> Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Link: https://cdrdv2.intel.com/v1/dl/getContent/733582 # [1]
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
^ permalink raw reply
* Re: [PATCH v7 01/22] x86/virt/tdx: Move low level SEAMCALL helpers out of <asm/tdx.h>
From: Edgecombe, Rick P @ 2026-04-10 23:42 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-coco@lists.linux.dev,
linux-kernel@vger.kernel.org, Gao, Chao
Cc: Li, Xiaoyao, Huang, Kai, Zhao, Yan Y, dave.hansen@linux.intel.com,
kas@kernel.org, seanjc@google.com, binbin.wu@linux.intel.com,
pbonzini@redhat.com, Chatre, Reinette, Verma, Vishal L,
nik.borisov@suse.com, mingo@redhat.com, Weiny, Ira,
tony.lindgren@linux.intel.com, Annapurve, Vishal,
sagis@google.com, Duan, Zhenzhong, tglx@kernel.org,
paulmck@kernel.org, hpa@zytor.com, bp@alien8.de,
yilun.xu@linux.intel.com, dan.j.williams@intel.com,
x86@kernel.org
In-Reply-To: <20260331124214.117808-2-chao.gao@intel.com>
On Tue, 2026-03-31 at 05:41 -0700, Chao Gao wrote:
> From: Kai Huang <kai.huang@intel.com>
>
> TDX host core code implements three seamcall*() helpers to make SEAMCALL
This patch already has Dave's ack. So nits are not really in order here. But it
bothers me that the very first sentence of the series has what looks like a
typo. Not the best first impression. Should be "SEAMCALLs"?
> to the TDX module. Currently, they are implemented in <asm/tdx.h> and
> are exposed to other kernel code which includes <asm/tdx.h>.
^ permalink raw reply
* [PATCH v2] KVM: TDX: Fix x2APIC MSR handling in tdx_has_emulated_msr()
From: Rick Edgecombe @ 2026-04-10 23:26 UTC (permalink / raw)
To: kas, kvm, linux-coco, linux-kernel, pbonzini, seanjc, binbin.wu,
dmaluka
Cc: rick.p.edgecombe
Rework tdx_has_emulated_msr() to explicitly enumerate the x2APIC MSRs
that KVM can emulate, instead of trying to enumerate the MSRs that KVM
cannot emulate. Drop the inner switch and list the emulatable x2APIC
registers directly in the outer switch's "return true" block.
The old code had multiple bugs in the x2APIC range handling.
X2APIC_MSR(APIC_ISR + APIC_ISR_NR) was incorrect because APIC_ISR_NR is
0x8, not 0x80, so the X2APIC_MSR() shift lost the lower bits, collapsing
each range to a single MSR. IA32_X2APIC_SELF_IPI was also missing from
the non-emulatable list.
KVM has no visibility into whether or not a guest has enabled #VE
reduction, which changes which MSRs the TDX-Module handles itself versus
triggering a #VE for the guest to make a TDVMCALL. So maintaining a list
of non-emulatable MSRs is fragile. Listing only the MSRs KVM can always
emulate sidesteps the problem.
Suggested-by: Sean Christopherson <seanjc@google.com>
Reported-by: Dmytro Maluka <dmaluka@chromium.org>
Fixes: dd50294f3e3c ("KVM: TDX: Implement callbacks for MSR operations")
Assisted-by: Claude:claude-opus-4-6
[based on a diff from Sean, but added missed LVTCMCI case, log]
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
Thanks to Dmytro for finding this. They said to feel free to take this
over, so here is another version with Sean's suggestions. Tested in the
TDX CI.
In Sean's suggestion LVTCMCI was missed, so it's added here.
arch/x86/kvm/vmx/tdx.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1e47c194af53..76ab6805ab29 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2116,23 +2116,27 @@ bool tdx_has_emulated_msr(u32 index)
case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
/* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
case MSR_KVM_POLL_CONTROL:
+ /*
+ * x2APIC registers that are virtualized by the CPU can't be
+ * emulated, KVM doesn't have access to the virtual APIC page.
+ */
+ case X2APIC_MSR(APIC_ID):
+ case X2APIC_MSR(APIC_LVR):
+ case X2APIC_MSR(APIC_LDR):
+ case X2APIC_MSR(APIC_SPIV):
+ case X2APIC_MSR(APIC_ESR):
+ case X2APIC_MSR(APIC_LVTCMCI):
+ case X2APIC_MSR(APIC_ICR):
+ case X2APIC_MSR(APIC_LVTT):
+ case X2APIC_MSR(APIC_LVTTHMR):
+ case X2APIC_MSR(APIC_LVTPC):
+ case X2APIC_MSR(APIC_LVT0):
+ case X2APIC_MSR(APIC_LVT1):
+ case X2APIC_MSR(APIC_LVTERR):
+ case X2APIC_MSR(APIC_TMICT):
+ case X2APIC_MSR(APIC_TMCCT):
+ case X2APIC_MSR(APIC_TDCR):
return true;
- case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
- /*
- * x2APIC registers that are virtualized by the CPU can't be
- * emulated, KVM doesn't have access to the virtual APIC page.
- */
- switch (index) {
- case X2APIC_MSR(APIC_TASKPRI):
- case X2APIC_MSR(APIC_PROCPRI):
- case X2APIC_MSR(APIC_EOI):
- case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
- case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
- case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
- return false;
- default:
- return true;
- }
default:
return false;
}
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Moger, Babu @ 2026-04-10 22:52 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
Lendacky, Thomas, elena.reshetova@intel.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev, kvm@vger.kernel.org,
eranian@google.com, peternewman@google.com
In-Reply-To: <68a551ea-d9f0-436a-9bef-e35fd027bb95@intel.com>
Hi Reinette,
On 4/9/2026 10:41 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 4/9/26 4:42 PM, Moger, Babu wrote:
>> Hi Reinette,
>>
>> On 4/9/2026 3:50 PM, Reinette Chatre wrote:
>>> Hi Babu,
>>>
>>> On 4/9/26 11:05 AM, Moger, Babu wrote:
>>>> On 4/9/2026 12:26 PM, Reinette Chatre wrote:
>>>>> On 4/9/26 10:19 AM, Moger, Babu wrote:
>>>>>> On 4/8/2026 6:41 PM, Reinette Chatre wrote:
>>>>>
>>>>>>> When the user switches to either "global_assign_ctrl_inherit_mon_per_cpu" or
>>>>>>> 'global_assign_ctrl_assign_mon_per_cpu" then "info/kernel_mode_assignment" is created
>>>>>>> (or made visible to user space) and is expected to point to default group.
>>>>>>> User can change the group using "info/kernel_mode_assignment" at this point.
>>>>>>>
>>>>>>> If the current scenario is below ...
>>>>>>> # cat info/kernel_mode
>>>>>>> [global_assign_ctrl_inherit_mon_per_cpu]
>>>>>>> inherit_ctrl_and_mon
>>>>>>> global_assign_ctrl_assign_mon_per_cpu
>>>>>>>
>>>>>>> ... then "info/kernel_mode_assignment" will exist but what it should contain if
>>>>>>> user switches mode at this point may be up for discussion.
>>>>>>>
>>>>>>> option 1)
>>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>>> the resource group in "info/kernel_mode_assignment" is reset to the
>>>>>>> default group and all CPUs PLZA state reset to match. The kernel_mode_cpus
>>>>>>> and kernel_mode_cpuslist files become visible in default resource group
>>>>>>> and they contain "all online CPUs".
>>>>>>>
>>>>>>> option 2)
>>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>>> the resource group in "info/kernel_mode_assignment" is kept and all
>>>>>>> CPUs PLZA state set to match it while also keeping the current
>>>>>>> values of that resource group's kernel_mode_cpus and kernel_mode_cpuslist
>>>>>>> files.
>>>>>>>
>>>>>>> I am leaning towards "option 1" to keep it consistent with a switch from
>>>>>>> "inherit_ctrl_and_mon" and being deterministic about how a mode is started with
>>>>>>
>>>>>> Yes. The "option 1" seems appropriate.
>>>>>>
>>>>>>> a clean slate. What are your thoughts? What would be use case where a user would
>>>>>>> want to switch between "global_assign_ctrl_inherit_mon_per_cpu" and
>>>>>>> "global_assign_ctrl_assign_mon_per_cpu" to just switch rmid_en on and off?
>>>>>>
>>>>>>
>>>>>> This is a bit tricky.
>>>>>>
>>>>>> Currently, our requirement is to have a CTRL_MON group for
>>>>>> global_assign_ctrl_inherit_mon_per_cpu. In this scenario, we use the
>>>>>> group’s CLOSID for PLZA configuration, and RMID is not used (rmid_en
>>>>>> = 0) when setting up PLZA.
>>>>>>
>>>>>> Our requirement is also to have a CTRL_MON/MON group for
>>>>>> global_assign_ctrl_assign_mon_per_cpu. In this case as well, the
>>>>>> group’s CLOSID and RMID (rmid_en = 1) both are used configure PLZA.
>>>>>
>>>>> ah, right. Good catch.
>>>>>
>>>>>>
>>>>>> Actually, we should not allow these changes from
>>>>>> global_assign_ctrl_inherit_mon_per_cpu to
>>>>>> global_assign_ctrl_assign_mon_per_cpu or visa versa.
>>>>>
>>>>> resctrl could allow it but as part of the switch it resets the "kernel mode group" to
>>>>> be the default group every time? This would be the "option 1" above.
>>>>
>>>> Other options.
>>>>
>>>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>>>
>>>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>>>> to CTRL_MON/MON -> CTRL_MON.
>>>>
>>>
>>> ok. Could you please return the courtesy of providing feedback on the
>>> suggestion you are responding to and also include the motivation why your
>>> suggestion is the better option?
>>
>> Yea. Sure.
>>
>> We need to allow the switch between the modes. Otherwise only way to reset is to remount the resctrl filesystem. That is not a good option.
>>
>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>
>> This option is same as you suggested.
>>
>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>> to CTRL_MON/MON -> CTRL_MON. This option basically disables monitor (rmid_en=0). It is less disruptive. Move is between child group to parent group.
>
> ok. I am concerned that this creates an inconsistent interface. Specifically, sometimes
> when switching the mode the kernel group will reset and sometimes it won't. This inconsistency
> may be more apparent when writing the user documentation as part of this work. If you are
> able to clearly explain how this resctrl fs interface behaves (this cannot be about PLZA
> internals as above) then this could work.
>
Yes, certainly. I’ll begin work on v3, and we can continue refining it
as we move forward.
Thanks
Babu
^ permalink raw reply
* Re: [PATCH v2 1/6] KVM: x86: Add dedicated storage for guest RIP
From: Chang S. Bae @ 2026-04-10 18:43 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Kiryl Shutsemau
Cc: kvm, x86, linux-coco, linux-kernel
In-Reply-To: <20260409224236.2021562-2-seanjc@google.com>
On 4/9/2026 3:42 PM, Sean Christopherson wrote:
> Add kvm_vcpu_arch.rip to track guest RIP instead of including it in the
> generic regs[] array. Decoupling RIP from regs[] will allow using a
> *completely* arbitrary index for RIP, as opposed to the mostly-arbitrary
> index that is currently used. That in turn will allow using indices
> 16-31 to track R16-R31 that are coming with APX....
> - unsigned long regs[NR_VCPU_REGS];
> + unsigned long regs[NR_VCPU_GENERAL_PURPOSE_REGS];
> + unsigned long rip;
Digging the history, this effectively reverts part of changes in
commit 5fdbf9765b7b ("KVM: x86: accessors for guest registers")
which had
- unsigned long regs[NR_VCPU_REGS];
- unsigned long rip; /* needs vcpu_load_rsp_rip() */
+ /*
+ * rip and regs accesses must go through
+ * kvm_{register,rip}_{read,write} functions.
+ */
+ unsigned long regs[NR_VCPU_REGS];
But its changelog didn't go into much detail about this change. I could
only relate to vcpu_load_rsp_rip() which might establish perception
coupling RSP with RIP back then.
In any case, it doesn't matter. I think this patch makes a clear
improvement - for example, now aligns with _regs[NR_EMULATOR_GPRS] in
struct x86_emulate_ctxt for general consistency.
Indeed, this and the whole series paves the way for APX. Appreciate for
the time and effort!
Reviewed-by: Chang S. Bae <chang.seok.bae@intel.com>
^ permalink raw reply
* Re: [PATCH v13 46/48] KVM: arm64: Expose KVM_ARM_VCPU_REC to user space
From: Steven Price @ 2026-04-10 15:12 UTC (permalink / raw)
To: Suzuki K Poulose, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve
In-Reply-To: <79564d59-032e-40c9-b4eb-f79f805b8238@arm.com>
On 19/03/2026 17:36, Suzuki K Poulose wrote:
> On 18/03/2026 15:54, Steven Price wrote:
>> Increment KVM_VCPU_MAX_FEATURES to expose the new capability to user
>> space.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>
> Not needed any more as we don't need the VCPU feature.
This patch caused so much bother with rebasing in the past, I'd
completely forgotten it isn't actually needed! Thanks for spotting!
Thanks,
Steve
> Cheers
> Suzuki
>
>
>
>> ---
>> Changes since v8:
>> * Since NV is now merged and enabled, this no longer conflicts with it.
>> ---
>> arch/arm64/include/asm/kvm_host.h | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/
>> asm/kvm_host.h
>> index 1d5fb001408c..b02f97de4436 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -40,7 +40,7 @@
>> #define KVM_MAX_VCPUS VGIC_V3_MAX_CPUS
>> -#define KVM_VCPU_MAX_FEATURES 9
>> +#define KVM_VCPU_MAX_FEATURES 10
>> #define KVM_VCPU_VALID_FEATURES (BIT(KVM_VCPU_MAX_FEATURES) - 1)
>> #define KVM_REQ_SLEEP \
>
^ permalink raw reply
* Re: [PATCH v13 37/48] arm64: RMI: Prevent Device mappings for Realms
From: Steven Price @ 2026-04-10 15:12 UTC (permalink / raw)
To: Joey Gouly
Cc: kvm, kvmarm, Catalin Marinas, Marc Zyngier, Will Deacon,
James Morse, Oliver Upton, Suzuki K Poulose, Zenghui Yu,
linux-arm-kernel, linux-kernel, Alexandru Elisei,
Christoffer Dall, Fuad Tabba, linux-coco, Ganapatrao Kulkarni,
Gavin Shan, Shanker Donthineni, Alper Gun, Aneesh Kumar K . V,
Emi Kisanuki, Vishal Annapurve
In-Reply-To: <20260319102734.GC3942350@e124191.cambridge.arm.com>
On 19/03/2026 10:27, Joey Gouly wrote:
> On Wed, Mar 18, 2026 at 03:54:01PM +0000, Steven Price wrote:
>> Physical device assignment is not supported by RMM v1.0, so it
>
> But we're targetting 2.0 now!
Whoops ;) In my head it's still "in the future" but I guess the "future"
is now!
I'll update this to say:
Physical device assignment is not yet supported. RMM v2.0 does add the
relevant APIs, but device assignment is a big topic so will be handled
in a future patch series. For now prevent device mappings when the guest
is a realm.
Thanks,
Steve
> I guess just change it to something about device support being a later feature.
>
> Thanks,
> Joey
>
>> doesn't make much sense to allow device mappings within the realm.
>> Prevent them when the guest is a realm.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes from v6:
>> * Fix the check in user_mem_abort() to prevent all pages that are not
>> guest_memfd() from being mapped into the protected half of the IPA.
>> Changes from v5:
>> * Also prevent accesses in user_mem_abort()
>> ---
>> arch/arm64/kvm/mmu.c | 13 +++++++++++++
>> 1 file changed, 13 insertions(+)
>>
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ad1300f366df..7d7caab8f573 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -1222,6 +1222,10 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>> if (is_protected_kvm_enabled())
>> return -EPERM;
>>
>> + /* We don't support mapping special pages into a Realm */
>> + if (kvm_is_realm(kvm))
>> + return -EPERM;
>> +
>> size += offset_in_page(guest_ipa);
>> guest_ipa &= PAGE_MASK;
>>
>> @@ -1965,6 +1969,15 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>> return 1;
>> }
>>
>> + /*
>> + * For now we shouldn't be hitting protected addresses because they are
>> + * handled in private_memslot_fault(). In the future this check may be
>> + * relaxed to support e.g. protected devices.
>> + */
>> + if (vcpu_is_rec(vcpu) &&
>> + kvm_gpa_from_fault(kvm, fault_ipa) == fault_ipa)
>> + return -EINVAL;
>> +
>> if (nested)
>> adjust_nested_fault_perms(nested, &prot, &writable);
>>
>> --
>> 2.43.0
>>
>>
^ permalink raw reply
* Re: [PATCH v13 32/48] arm64: Don't expose stolen time for realm guests
From: Steven Price @ 2026-04-10 15:12 UTC (permalink / raw)
To: Suzuki K Poulose, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve
In-Reply-To: <4a2ab5b0-dec6-4ee2-8790-f27c2501f653@arm.com>
On 30/03/2026 11:52, Suzuki K Poulose wrote:
> On 18/03/2026 15:53, Steven Price wrote:
>> It doesn't make much sense as a realm guest wouldn't want to trust the
>> host. It will also need some extra work to ensure that KVM will only
>> attempt to write into a shared memory region. So for now just disable
>> it.
>>
>> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v7:
>> * Update the documentation to add a note about stolen time being
>> unavailable in a realm.
>> ---
>> Documentation/virt/kvm/api.rst | 3 +++
>> arch/arm64/kvm/arm.c | 5 ++++-
>> 2 files changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/
>> api.rst
>> index bc180c853faf..70911fe6d435 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -9240,6 +9240,9 @@ is supported, than the other should as well and
>> vice versa. For arm64
>> see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL".
>> For x86 see Documentation/virt/kvm/x86/msr.rst "MSR_KVM_STEAL_TIME".
>> +Note that steal time accounting is not available when a guest is
>> running
>> +within a Arm CCA realm (machine type KVM_VM_TYPE_ARM_REALM).
>> +
>> 8.25 KVM_CAP_S390_DIAG318
>> -------------------------
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 61182eb0cf70..7d92ddb06460 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -469,7 +469,10 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm,
>> long ext)
>> r = system_supports_mte();
>> break;
>> case KVM_CAP_STEAL_TIME:
>> - r = kvm_arm_pvtime_supported();
>> + if (kvm_is_realm(kvm))
>> + r = 0;
>> + else
>> + r = kvm_arm_pvtime_supported();
>
> Could this be handled in kvm_realm_ext_allowed() ?
Indeed it is already handled there. I'm not sure how I missed that, but
this patch is completely unnecessary now. Stolen time was an extension
that I knew about (having added it in the first place) and needed
disabling because it's implemented with the assumption that the host can
write into the guest.
In theory with some extra work it could be supported in a realm guest,
but it requires some extra plumbing to ensure the structures end up in
shared memory. My intention is that this can be revisited once the basic
CCA support is in.
Thanks,
Steve
> Suzuki
>
>
>> break;
>> case KVM_CAP_ARM_EL1_32BIT:
>> r = cpus_have_final_cap(ARM64_HAS_32BIT_EL1);
>
^ permalink raw reply
* Re: [PATCH v13 24/48] arm64: RMI: Allow populating initial contents
From: Steven Price @ 2026-04-10 15:12 UTC (permalink / raw)
To: Suzuki K Poulose, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve
In-Reply-To: <7ab3dcd2-23b9-45a6-84ef-9617c4614c0a@arm.com>
On 23/03/2026 11:32, Suzuki K Poulose wrote:
> On 18/03/2026 15:53, Steven Price wrote:
>> The VMM needs to populate the realm with some data before starting (e.g.
>> a kernel and initrd). This is measured by the RMM and used as part of
>> the attestation later on.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v12:
>> * The ioctl now updates the structure with the amount populated rather
>> than returning this through the ioctl return code.
>> * Use the new RMM v2.0 range based RMI calls.
>> * Adapt to upstream changes in kvm_gmem_populate().
>> Changes since v11:
>> * The multiplex CAP is gone and there's a new ioctl which makes use of
>> the generic kvm_gmem_populate() functionality.
>> Changes since v7:
>> * Improve the error codes.
>> * Other minor changes from review.
>> Changes since v6:
>> * Handle host potentially having a larger page size than the RMM
>> granule.
>> * Drop historic "par" (protected address range) from
>> populate_par_region() - it doesn't exist within the current
>> architecture.
>> * Add a cond_resched() call in kvm_populate_realm().
>> Changes since v5:
>> * Refactor to use PFNs rather than tracking struct page in
>> realm_create_protected_data_page().
>> * Pull changes from a later patch (in the v5 series) for accessing
>> pages from a guest memfd.
>> * Do the populate in chunks to avoid holding locks for too long and
>> triggering RCU stall warnings.
>> ---
>> arch/arm64/include/asm/kvm_rmi.h | 4 ++
>> arch/arm64/kvm/Kconfig | 1 +
>> arch/arm64/kvm/arm.c | 13 ++++
>> arch/arm64/kvm/rmi.c | 111 +++++++++++++++++++++++++++++++
>> 4 files changed, 129 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index 46b0cbe6c202..bf663bb240c4 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -96,6 +96,10 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu);
>> int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>> int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>> +struct kvm_arm_rmi_populate;
>> +
>> +int kvm_arm_rmi_populate(struct kvm *kvm,
>> + struct kvm_arm_rmi_populate *arg);
>> void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long ipa,
>> unsigned long size,
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 1cac6dfc0972..b495dfd3a8b4 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -39,6 +39,7 @@ menuconfig KVM
>> select GUEST_PERF_EVENTS if PERF_EVENTS
>> select KVM_GUEST_MEMFD
>> select KVM_GENERIC_MEMORY_ATTRIBUTES
>> + select HAVE_KVM_ARCH_GMEM_POPULATE
>> help
>> Support hosting virtualized guest machines.
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index badb94b398bc..43d05da7e694 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -2089,6 +2089,19 @@ int kvm_arch_vm_ioctl(struct file *filp,
>> unsigned int ioctl, unsigned long arg)
>> return -EFAULT;
>> return kvm_vm_ioctl_get_reg_writable_masks(kvm, &range);
>> }
>> + case KVM_ARM_RMI_POPULATE: {
>> + struct kvm_arm_rmi_populate req;
>> + int ret;
>> +
>> + if (!kvm_is_realm(kvm))
>> + return -ENXIO;
>> + if (copy_from_user(&req, argp, sizeof(req)))
>> + return -EFAULT;
>> + ret = kvm_arm_rmi_populate(kvm, &req);
>> + if (copy_to_user(argp, &req, sizeof(req)))
>> + return -EFAULT;
>> + return ret;
>> + }
>> default:
>> return -EINVAL;
>> }
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 13eed6f0b9eb..b48f4e12e4e0 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -718,6 +718,80 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long start,
>> realm_unmap_private_range(kvm, start, end, may_block);
>> }
>> +static int realm_create_protected_data_page(struct kvm *kvm,
>
> minor nit: To align with the RMM ABI, could we rename this to :
>
> realm_data_map_init() ?
Makes sense - I have to admit there is a bit of cruft in some of the
names because the spec keeps changing ;)
>> + unsigned long ipa,
>> + kvm_pfn_t dst_pfn,
>> + kvm_pfn_t src_pfn,
>> + unsigned long flags)
>> +{
>> + struct realm *realm = &kvm->arch.realm;
>> + phys_addr_t rd = virt_to_phys(realm->rd);
>> + phys_addr_t dst_phys, src_phys;
>> + int ret;
>> +
>> + dst_phys = __pfn_to_phys(dst_pfn);
>> + src_phys = __pfn_to_phys(src_pfn);
>> +
>> + if (delegate_page(dst_phys))
>> + return -ENXIO;
>> +
>> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>> + if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> + /* Create missing RTTs and retry */
>> + int level = RMI_RETURN_INDEX(ret);
>> +
>> + KVM_BUG_ON(level == RMM_RTT_MAX_LEVEL, kvm);
>
> A buggy VMM can trigger this by calling RMI_POPULATE twice ? Should we
> return -ENXIO here rather ? The delegate_page() above could prevent
> normal cases, but is the VMM allowed to somehow trigger a "pfn" change
> backing the KVM ? Either way, this need not be Fatal ?
I believe KVM_BUG_ON is only fatal to the VM - it won't take down the
host (KVM_BUG_ON_DATA_CORRUPTION() is for that). AFAICT the
delegate_page() check should catch this case, but in the event that it
doesn't then this will kill the VM but otherwise recover.
Of course if it turns out there is a way of a VMM hitting this we might
need to tone down the error handling to something a bit quieter. But I
think it would also be good to get an understanding of how this can happen.
Thanks,
Steve
> Otherwise looks good to me.
>
> Suzuki
>
>
>> +
>> + ret = realm_create_rtt_levels(realm, ipa, level,
>> + RMM_RTT_MAX_LEVEL, NULL);
>> + if (!ret) {
>> + ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys,
>> + flags);
>> + }
>> + }
>> +
>> + if (ret) {
>> + if (WARN_ON(undelegate_page(dst_phys))) {
>> + /* Undelegate failed, so we leak the page */
>> + get_page(pfn_to_page(dst_pfn));
>> + }
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> + struct page *src_page, void *opaque)
>> +{
>> + unsigned long data_flags = *(unsigned long *)opaque;
>> + phys_addr_t ipa = gfn_to_gpa(gfn);
>> +
>> + if (!src_page)
>> + return -EOPNOTSUPP;
>> +
>> + return realm_create_protected_data_page(kvm, ipa, pfn,
>> + page_to_pfn(src_page),
>> + data_flags);
>> +}
>> +
>> +static long populate_region(struct kvm *kvm,
>> + gfn_t base_gfn,
>> + unsigned long pages,
>> + u64 uaddr,
>> + unsigned long data_flags)
>> +{
>> + long ret = 0;
>> +
>> + mutex_lock(&kvm->slots_lock);
>> + mmap_read_lock(current->mm);
>> + ret = kvm_gmem_populate(kvm, base_gfn, u64_to_user_ptr(uaddr),
>> pages,
>> + populate_region_cb, &data_flags);
>> + mmap_read_unlock(current->mm);
>> + mutex_unlock(&kvm->slots_lock);
>> +
>> + return ret;
>> +}
>> +
>> enum ripas_action {
>> RIPAS_INIT,
>> RIPAS_SET,
>> @@ -815,6 +889,43 @@ static int realm_ensure_created(struct kvm *kvm)
>> return -ENXIO;
>> }
>> +int kvm_arm_rmi_populate(struct kvm *kvm,
>> + struct kvm_arm_rmi_populate *args)
>> +{
>> + unsigned long data_flags = 0;
>> + unsigned long ipa_start = args->base;
>> + unsigned long ipa_end = ipa_start + args->size;
>> + long pages_populated;
>> + int ret;
>> +
>> + if (args->reserved ||
>> + (args->flags & ~KVM_ARM_RMI_POPULATE_FLAGS_MEASURE) ||
>> + !IS_ALIGNED(ipa_start, PAGE_SIZE) ||
>> + !IS_ALIGNED(ipa_end, PAGE_SIZE) ||
>> + !IS_ALIGNED(args->source_uaddr, PAGE_SIZE))
>> + return -EINVAL;
>> +
>> + ret = realm_ensure_created(kvm);
>> + if (ret)
>> + return ret;
>> +
>> + if (args->flags & KVM_ARM_RMI_POPULATE_FLAGS_MEASURE)
>> + data_flags |= RMI_MEASURE_CONTENT;
>> +
>> + pages_populated = populate_region(kvm, gpa_to_gfn(ipa_start),
>> + args->size >> PAGE_SHIFT,
>> + args->source_uaddr, data_flags);
>> +
>> + if (pages_populated < 0)
>> + return pages_populated;
>> +
>> + args->size -= pages_populated << PAGE_SHIFT;
>> + args->source_uaddr += pages_populated << PAGE_SHIFT;
>> + args->base += pages_populated << PAGE_SHIFT;
>> +
>> + return 0;
>> +}
>> +
>> static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
>> {
>> struct kvm *kvm = vcpu->kvm;
>
^ permalink raw reply
* Re: [PATCH v13 21/48] arm64: RMI: Handle RMI_EXIT_RIPAS_CHANGE
From: Steven Price @ 2026-04-10 15:12 UTC (permalink / raw)
To: Suzuki K Poulose, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve
In-Reply-To: <59c83ace-c8be-4c71-99b6-cd5f085a3063@arm.com>
On 20/03/2026 11:15, Suzuki K Poulose wrote:
> Hi Steven
>
> On 18/03/2026 15:53, Steven Price wrote:
>> The guest can request that a region of it's protected address space is
>> switched between RIPAS_RAM and RIPAS_EMPTY (and back) using
>> RSI_IPA_STATE_SET. This causes a guest exit with the
>> RMI_EXIT_RIPAS_CHANGE code. We treat this as a request to convert a
>> protected region to unprotected (or back), exiting to the VMM to make
>> the necessary changes to the guest_memfd and memslot mappings. On the
>> next entry the RIPAS changes are committed by making RMI_RTT_SET_RIPAS
>> calls.
>>
>> The VMM may wish to reject the RIPAS change requested by the guest. For
>> now it can only do this by no longer scheduling the VCPU as we don't
>> currently have a usecase for returning that rejection to the guest, but
>> by postponing the RMI_RTT_SET_RIPAS changes to entry we leave the door
>> open for adding a new ioctl in the future for this purpose.
>
> I have been thinking about this. Today we do a KVM_MEMORY_FAULT_EXIT
> to the VMM to handle the request. The other option is to make this
> a KVM_EXIT_HYPERCALL with SMC_RSI_SET_RIPAS. But this would leak RSI
> implementation to the VMM. The advantage is that the VMM can provide
> a clear response RSI_ACCEPT vs RSI_REJECT (including accepting a partial
> range) and KVM can satisfy the RMI_RTT_SET_RIPAS.
Faking up hypercalls based on a very narrow interface seems like a bad
API design. The guest is of course perfectly allowed to have another
interface directly to the VMM to communicate intent.
> We may end up doing something similar for Device assignment too, where
> the VMM gets a chance to reject any inconsistent device mappings.
>
> Like you mentioned, the VMM can stop the Realm today as an alternate
> approach.
The intention here is that we start with a very basic interface and can
then extend it when we work out how to handle rejection. The ordering in
the kernel is explicitly to allow that extension. I get the impression
that device assignment has a much better justification for handling the
reject flow so it's probably best to leave that until device assignment
is implemented.
Thanks,
Steve
^ permalink raw reply
* Re: [PATCH v13 20/48] arm64: RMI: Handle realm enter/exit
From: Steven Price @ 2026-04-10 15:11 UTC (permalink / raw)
To: Suzuki K Poulose, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve
In-Reply-To: <c4e30ad1-6fbe-448f-a9fe-6a2480581082@arm.com>
On 23/03/2026 10:03, Suzuki K Poulose wrote:
> On 20/03/2026 16:32, Steven Price wrote:
>> On 20/03/2026 14:08, Suzuki K Poulose wrote:
>>> On 18/03/2026 15:53, Steven Price wrote:
[...]
>>>> +int noinstr kvm_rec_enter(struct kvm_vcpu *vcpu)
>>>> +{
>>>> + struct realm_rec *rec = &vcpu->arch.rec;
>>>> + int ret;
>>>> +
>>>> + guest_state_enter_irqoff();
>>>> + ret = rmi_rec_enter(virt_to_phys(rec->rec_page),
>>>> + virt_to_phys(rec->run));
>>>
>>> In the normal VM case, we try to fixup some of the exits (e.g., GIC
>>> CPUIF register accesses) which may be applicable to Realms. Do we
>>> need such fixups here ? Given the cost of world switch, it is
>>> debatable whether it matters or not.
>>
>> I'm not really sure what you are referring to here. Can you point me at
>> the normal VM case? This function is the equivalent of
>> kvm_arm_vcpu_enter_exit().
>
> This happens via fixup_guest_exit() in either vhe/nvhe cases. The VGIC
> registers are emulated in the fast path for normal VMs (when trapping is
> enabled)
Ah, I see what you mean. Yes the VGIC emulation in theory could be
shortcut and speeded up slightly. I'd prefer to leave this sort of pure
optimisation until the basic support is merged to keep the size of the
series (vaguely) under control.
I think it would also be worth doing some benchmarking on a real
platform to see whether it really makes a meaningful difference (or
whether we need to push for an architectural change moving more
processing into the RMM).
Thanks,
Steve
^ permalink raw reply
* Re: [PATCH v13 15/48] arm64: RMI: RTT tear down
From: Steven Price @ 2026-04-10 15:11 UTC (permalink / raw)
To: Wei-Lin Chang, kvm, kvmarm
Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
Vishal Annapurve
In-Reply-To: <5chegrtlkmet4n5u53wmjbpyflul2dy5o6lpevvxo3hycvyszx@3ogzwuvsbvr3>
On 21/03/2026 13:04, Wei-Lin Chang wrote:
> On Fri, Mar 20, 2026 at 04:12:48PM +0000, Steven Price wrote:
>> On 19/03/2026 17:35, Wei-Lin Chang wrote:
>>> On Wed, Mar 18, 2026 at 03:53:39PM +0000, Steven Price wrote:
>>>> The RMM owns the stage 2 page tables for a realm, and KVM must request
>>>> that the RMM creates/destroys entries as necessary. The physical pages
>>>> to store the page tables are delegated to the realm as required, and can
>>>> be undelegated when no longer used.
>>>>
>>>> Creating new RTTs is the easy part, tearing down is a little more
>>>> tricky. The result of realm_rtt_destroy() can be used to effectively
>>>> walk the tree and destroy the entries (undelegating pages that were
>>>> given to the realm).
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>> Changes since v12:
>>>> * Simplify some functions now we know RMM page size is the same as the
>>>> host's.
>>>> Changes since v11:
>>>> * Moved some code from earlier in the series to this one so that it's
>>>> added when it's first used.
>>>> Changes since v10:
>>>> * RME->RMI rename.
>>>> * Some code to handle freeing stage 2 PGD moved into this patch where
>>>> it belongs.
>>>> Changes since v9:
>>>> * Add a comment clarifying that root level RTTs are not destroyed until
>>>> after the RD is destroyed.
>>>> Changes since v8:
>>>> * Introduce free_rtt() wrapper which calls free_delegated_granule()
>>>> followed by kvm_account_pgtable_pages(). This makes it clear where an
>>>> RTT is being freed rather than just a delegated granule.
>>>> Changes since v6:
>>>> * Move rme_rtt_level_mapsize() and supporting defines from kvm_rme.h
>>>> into rme.c as they are only used in that file.
>>>> Changes since v5:
>>>> * Rename some RME_xxx defines to do with page sizes as RMM_xxx - they are
>>>> a property of the RMM specification not the RME architecture.
>>>> Changes since v2:
>>>> * Moved {alloc,free}_delegated_page() and ensure_spare_page() to a
>>>> later patch when they are actually used.
>>>> * Some simplifications now rmi_xxx() functions allow NULL as an output
>>>> parameter.
>>>> * Improved comments and code layout.
>>>> ---
>>>> arch/arm64/include/asm/kvm_rmi.h | 7 ++
>>>> arch/arm64/kvm/mmu.c | 15 +++-
>>>> arch/arm64/kvm/rmi.c | 145 +++++++++++++++++++++++++++++++
>>>> 3 files changed, 166 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>>>> index 0ada525af18f..16a297f3091a 100644
>>>> --- a/arch/arm64/include/asm/kvm_rmi.h
>>>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>>>> @@ -68,5 +68,12 @@ u32 kvm_realm_ipa_limit(void);
>>>>
>>>> int kvm_init_realm_vm(struct kvm *kvm);
>>>> void kvm_destroy_realm(struct kvm *kvm);
>>>> +void kvm_realm_destroy_rtts(struct kvm *kvm);
>>>> +
>>>> +static inline bool kvm_realm_is_private_address(struct realm *realm,
>>>> + unsigned long addr)
>>>> +{
>>>> + return !(addr & BIT(realm->ia_bits - 1));
>>>> +}
>>>>
>>>> #endif /* __ASM_KVM_RMI_H */
>>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>>> index 9dc242c3b9c8..41152abf55b2 100644
>>>> --- a/arch/arm64/kvm/mmu.c
>>>> +++ b/arch/arm64/kvm/mmu.c
>>>> @@ -1098,10 +1098,23 @@ void stage2_unmap_vm(struct kvm *kvm)
>>>> void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>>> {
>>>> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>>>> - struct kvm_pgtable *pgt = NULL;
>>>> + struct kvm_pgtable *pgt;
>>>>
>>>> write_lock(&kvm->mmu_lock);
>>>> pgt = mmu->pgt;
>>>> + if (kvm_is_realm(kvm) &&
>>>> + (kvm_realm_state(kvm) != REALM_STATE_DEAD &&
>>>> + kvm_realm_state(kvm) != REALM_STATE_NONE)) {
>>>> + write_unlock(&kvm->mmu_lock);
>>>> + kvm_realm_destroy_rtts(kvm);
>>>> +
>>>> + /*
>>>> + * The PGD pages can be reclaimed only after the realm (RD) is
>>>> + * destroyed. We call this again from kvm_destroy_realm() after
>>>> + * the RD is destroyed.
>>>> + */
>>>> + return;
>>>> + }
>>>
>>> Hi,
>>>
>>> I see that kvm_free_stage2_pgd() will be called twice:
>>>
>>> kvm_destroy_vm()
>>> mmu_notifier_unregister()
>>> kvm_mmu_notifier_release()
>>> kvm_flush_shadow_all()
>>> kvm_arch_flush_shadow_all()
>>> kvm_uninit_stage2_mmu()
>>> kvm_free_stage2_pgd()
>>> kvm_arch_destroy_vm()
>>> kvm_destroy_realm()
>>> kvm_free_stage2_pgd()
>>>
>>> At the first call the realm state is REALM_STATE_ACTIVE, at the second
>>> it is REALM_STATE_DEAD. Reading the comment added to
>>> kvm_free_stage2_pgd() here, does it mean this function is called twice
>>> on purpose? If so do you think it's better to extract this and create
>>> another function instead, then use kvm_is_realm() to choose which to
>>> run? I think it is confusing to have this function run twice for a
>>> realm.
>>
>> So the issue here is that the RMM requires we do things in a different
>> order to a normal VM. For a realm the PGD cannot be destroyed until the
>> realm itself is destroyed - the RMM revent the host undelegating them.
>>
>> So the first call cannot actually do the free - this is the
>> REALM_STATE_ACTIVE case.
>>
>> In kvm_destroy_realm() we tear down the actual realm and undelegate the
>> granules. We then need to actually free the PGD - the "obvious" way of
>> doing that is calling kvm_free_stage2_pgd() as that handles the KVM
>> intricacies - e.g. updating the mmu object.
>>
>> I'm not sure how to structure the code better without causing
>> duplication - I don't want another copy of the cleanup from
>> kvm_free_stage2_pgd() in a CCA specific file because it will most likely
>> get out of sync with the normal VM case. There is a comment added
>> explaining "we call this again" which I hoped would make it less confusing.
>>
>
> Oh, I see, thanks for letting me know!
>
> During this I found in the first call of kvm_free_stage2_pgd() it's doing
> kvm_stage2_unmap_range() and kvm_realm_destroy_rtts(), but they are also
> called in kvm_destroy_realm(), is that intentional?
> If they can be called at kvm_destroy_realm() time, could we just not do
> kvm_free_stage2_pgd() in kvm_uninit_stage2_mmu() for realms?
> And if they should be called in kvm_free_stage2_pgd(), could we refactor
> it to something like:
> (just showing the idea, didn't try compiling or anything)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7d7caab8f573..280d2bef8492 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1030,9 +1030,25 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
> return err;
> }
>
> +static void kvm_realm_uninit_stage2(struct kvm_s2_mmu *mmu)
> +{
> + struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> + struct realm *realm = &kvm->arch.realm;
> +
> + WARN_ON(kvm_realm_state(kvm) != REALM_STATE_ACTIVE);
> + write_lock(&kvm->mmu_lock);
> + kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true);
> + write_unlock(&kvm->mmu_lock);
> + kvm_realm_destroy_rtts(kvm);
> +}
> +
> void kvm_uninit_stage2_mmu(struct kvm *kvm)
> {
> - kvm_free_stage2_pgd(&kvm->arch.mmu);
> + if (kvm_is_realm(kvm))
> + kvm_realm_uninit_stage2(&kvm->arch.mmu);
> + else
> + kvm_free_stage2_pgd(&kvm->arch.mmu);
> +
> kvm_mmu_free_memory_cache(&kvm->arch.mmu.split_page_cache);
> }
>
> @@ -1117,22 +1133,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>
> write_lock(&kvm->mmu_lock);
> pgt = mmu->pgt;
> - if (kvm_is_realm(kvm) &&
> - (kvm_realm_state(kvm) != REALM_STATE_DEAD &&
> - kvm_realm_state(kvm) != REALM_STATE_NONE)) {
> - struct realm *realm = &kvm->arch.realm;
> -
> - kvm_stage2_unmap_range(mmu, 0, BIT(realm->ia_bits - 1), true);
> - write_unlock(&kvm->mmu_lock);
> - kvm_realm_destroy_rtts(kvm);
>
> - /*
> - * The PGD pages can be reclaimed only after the realm (RD) is
> - * destroyed. We call this again from kvm_destroy_realm() after
> - * the RD is destroyed.
> - */
> - return;
> - }
> if (pgt) {
> mmu->pgd_phys = 0;
> mmu->pgt = NULL;
>
> Sorry if I missed anything!
No I don't think you've missed anything, that actually does look nicer.
Thanks for the suggestion.
Thanks,
Steve
^ permalink raw reply
* Re: [PATCH v2] dma-buf: heaps: system: document system_cc_shared heap
From: Marek Szyprowski @ 2026-04-10 12:20 UTC (permalink / raw)
To: Sumit Semwal, Jiri Pirko
Cc: dri-devel, linaro-mm-sig, iommu, linux-media, benjamin.gaignard,
Brian.Starkey, jstultz, tjmercier, christian.koenig, robin.murphy,
jgg, leon, ptesarik, catalin.marinas, aneesh.kumar,
suzuki.poulose, steven.price, thomas.lendacky, john.allen,
ashish.kalra, suravee.suthikulpanit, linux-coco
In-Reply-To: <CAO_48GFt21rv0PJd2Csa0O4OEpN053_p__4Zux+m7jQdHSagEg@mail.gmail.com>
On 10.04.2026 14:14, Sumit Semwal wrote:
> On Tue, 7 Apr 2026 at 14:56, Jiri Pirko <jiri@resnulli.us> wrote:
>> From: Jiri Pirko <jiri@nvidia.com>
>>
>> Document the system_cc_shared dma-buf heap that was introduced
>> recently. Describe its purpose, availability conditions and
>> relation to confidential computing VMs.
>>
>> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
>> Reviewed-by: T.J.Mercier <tjmercier@google.com>
> Thank you for the patch!
>
> Marek: Since you're taking the dependent patches through your tree,
> could you please use:
> Acked-by: Sumit Semwal <sumit.semwal@linaro.org>
>
> and take this as well?
Yes, sure. Applied to dma-mapping-for-next. Thanks!
Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland
^ permalink raw reply
* Re: [PATCH v2] dma-buf: heaps: system: document system_cc_shared heap
From: Sumit Semwal @ 2026-04-10 12:14 UTC (permalink / raw)
To: Jiri Pirko
Cc: dri-devel, linaro-mm-sig, iommu, linux-media, benjamin.gaignard,
Brian.Starkey, jstultz, tjmercier, christian.koenig, m.szyprowski,
robin.murphy, jgg, leon, ptesarik, catalin.marinas, aneesh.kumar,
suzuki.poulose, steven.price, thomas.lendacky, john.allen,
ashish.kalra, suravee.suthikulpanit, linux-coco
In-Reply-To: <20260407092617.635223-1-jiri@resnulli.us>
Hello Jiri,
On Tue, 7 Apr 2026 at 14:56, Jiri Pirko <jiri@resnulli.us> wrote:
>
> From: Jiri Pirko <jiri@nvidia.com>
>
> Document the system_cc_shared dma-buf heap that was introduced
> recently. Describe its purpose, availability conditions and
> relation to confidential computing VMs.
>
> Signed-off-by: Jiri Pirko <jiri@nvidia.com>
> Reviewed-by: T.J.Mercier <tjmercier@google.com>
Thank you for the patch!
Marek: Since you're taking the dependent patches through your tree,
could you please use:
Acked-by: Sumit Semwal <sumit.semwal@linaro.org>
and take this as well?
Thanks and Best regards,
Sumit.
> ---
> Documentation/userspace-api/dma-buf-heaps.rst | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/Documentation/userspace-api/dma-buf-heaps.rst b/Documentation/userspace-api/dma-buf-heaps.rst
> index 05445c83b79a..f56b743cdb36 100644
> --- a/Documentation/userspace-api/dma-buf-heaps.rst
> +++ b/Documentation/userspace-api/dma-buf-heaps.rst
> @@ -16,6 +16,13 @@ following heaps:
>
> - The ``system`` heap allocates virtually contiguous, cacheable, buffers.
>
> + - The ``system_cc_shared`` heap allocates virtually contiguous, cacheable,
> + buffers using shared (decrypted) memory. It is only present on
> + confidential computing (CoCo) VMs where memory encryption is active
> + (e.g., AMD SEV, Intel TDX). The allocated pages have the encryption
> + bit cleared, making them accessible for device DMA without TDISP
> + support. On non-CoCo VM configurations, this heap is not registered.
> +
> - The ``default_cma_region`` heap allocates physically contiguous,
> cacheable, buffers. Only present if a CMA region is present. Such a
> region is usually created either through the kernel commandline
> --
> 2.51.1
>
^ permalink raw reply
* Re: [PATCH v2 07/19] PCI/TSM: Add Device Security (TVM Guest) ACCEPT operation support
From: Lai, Yi @ 2026-04-10 8:53 UTC (permalink / raw)
To: Dan Williams
Cc: linux-coco, linux-pci, gregkh, aik, aneesh.kumar, yilun.xu,
bhelgaas, alistair23, lukas, jgg
In-Reply-To: <20260303000207.1836586-8-dan.j.williams@intel.com>
On Mon, Mar 02, 2026 at 04:01:55PM -0800, Dan Williams wrote:
> The final operation of the PCIe Trusted Execution Environment (TEE) Device
> Interface Security Protocol (TDISP) is asking the TEE Security Manager
> (TEE) to enable private DMA and MMIO.
>
> The story so far in the security lifecycle of the device is that the VMM
> setup an SPDM session and link encryption with the device's physical
> function0. The VMM then assigned either that physical function or other
> virtual function of that device to a VM. The VM asked the TSM to transition
> the device from TDISP UNLOCKED->LOCKED. With the device LOCKED the VM
> validated signed fresh device evidence and expected MMIO mappings.
>
> The VM now accepts the device to transition it from LOCKED to RUN and tell
> the TSM to unblock DMA to VM private memory.
>
> Implement a sysfs trigger to flip the device to private operation and plumb
> that to a 'struct pci_tsm_ops::accept()' operation.
>
> Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> Co-developed-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> drivers/pci/Kconfig | 2 +
> Documentation/ABI/testing/sysfs-bus-pci | 13 +++++
> include/linux/pci-tsm.h | 7 ++-
> drivers/pci/tsm.c | 69 ++++++++++++++++++++++++-
> 4 files changed, 88 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index e3f848ffb52a..c45c6b978e1d 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -127,6 +127,8 @@ config PCI_IDE
>
> config PCI_TSM
> bool "PCI TSM: Device security protocol support"
> + depends on ARCH_HAS_CC_PLATFORM
> + select CONFIDENTIAL_DEVICES
> select PCI_IDE
> select PCI_DOE
> select TSM
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 1ed77b9402a6..c2a5c4fe9373 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -732,3 +732,16 @@ Description:
> 'lock' to teardown the connection. Writes fail with EBUSY if
> this device is bound to a driver. This is a "devsec" TSM
> attribute, see Documentation/ABI/testing/sysfs-class-tsm.
> +
> +What: /sys/bus/pci/devices/.../tsm/accept
> +Contact: linux-coco@lists.linux.dev
> +Description:
> + (RW) Write "1" (or any boolean "true" string) to this file to
> + request that TSM transition the device from the TDISP LOCKED
> + state to the RUN state and arrange the for the secure IOMMU to
> + accept requests with T=1 in the PCIe packet header (TLP)
> + targeting private memory. Per TDISP the only exits from the RUN
> + state are via an explicit unlock request or an event that
> + transitions the device to the ERROR state. Writes fail with
> + EBUSY if this device is bound to a driver. This is a "devsec"
> + TSM attribute, see Documentation/ABI/testing/sysfs-class-tsm.
> diff --git a/include/linux/pci-tsm.h b/include/linux/pci-tsm.h
> index 2a896b83bff9..176d214cd0da 100644
> --- a/include/linux/pci-tsm.h
> +++ b/include/linux/pci-tsm.h
> @@ -66,15 +66,18 @@ struct pci_tsm_ops {
> * pci_tsm') for follow-on security state transitions from the
> * LOCKED state
> * @unlock: destroy TSM context and return device to UNLOCKED state
> + * @accept: accept a locked TDI for use, move it to RUN state
> *
> * Context: @lock and @unlock run under pci_tsm_rwsem held for write to
> - * sync with TSM unregistration and each other. All operations run under
> - * the device lock for mutual exclusion with driver attach and detach.
> + * sync with TSM unregistration and each other. @accept runs under
> + * pci_tsm_rwsem held for read. All operations run under the device lock
> + * for mutual exclusion with driver attach and detach.
> */
> struct_group_tagged(pci_tsm_devsec_ops, devsec_ops,
> struct pci_tsm *(*lock)(struct tsm_dev *tsm_dev,
> struct pci_dev *pdev);
> void (*unlock)(struct pci_tsm *tsm);
> + int (*accept)(struct pci_dev *pdev);
> );
> };
>
> diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
> index 259e75092618..aa93a59d2720 100644
> --- a/drivers/pci/tsm.c
> +++ b/drivers/pci/tsm.c
> @@ -557,6 +557,71 @@ static ssize_t dsm_show(struct device *dev, struct device_attribute *attr,
> }
> static DEVICE_ATTR_RO(dsm);
>
> +/**
> + * pci_tsm_accept() - accept a device for private MMIO+DMA operation
> + * @pdev: PCI device to accept
> + *
> + * "Accept" transitions a device to the run state, it is only suitable to make
> + * that transition from a known DMA-idle (no active mappings) state. The "driver
> + * detached" state is a coarse way to assert that requirement.
> + */
> +static int pci_tsm_accept(struct pci_dev *pdev)
> +{
> + int rc;
> +
> + ACQUIRE(rwsem_read_intr, lock)(&pci_tsm_rwsem);
> + if ((rc = ACQUIRE_ERR(rwsem_read_intr, &lock)))
> + return rc;
> +
> + if (!pdev->tsm)
> + return -EINVAL;
> +
> + ACQUIRE(device_intr, dev_lock)(&pdev->dev);
> + if ((rc = ACQUIRE_ERR(device_intr, &dev_lock)))
> + return rc;
> +
> + if (pdev->dev.driver)
> + return -EBUSY;
> +
> + rc = to_pci_tsm_ops(pdev->tsm)->accept(pdev);
> + if (rc)
> + return rc;
> +
> + return device_cc_accept(&pdev->dev);
> +}
> +
# Re-send to Dan's kernel.org address. Sorry if you receive the same
# email twice.
Repeated accept on a device that is already in RUN state is not rejected
by the PCI TSM core, and multiple encrypted MMIO resources for the same
BAR range can be created. Furthermore, a later request to move the
device to UNLOCKED state only removes the most recently tracked
encrypted range.
Reproduce steps:
1. echo tsmX > /sys/bus/pci/devices/<bdf>/tsm/lock
2. echo 1 > /sys/bus/pci/devices/<bdf>/tsm/accept
3. echo 1 > /sys/bus/pci/devices/<bdf>/tsm/accept
4. cat /proc/iomem | grep "PCI MMIO Encrypted"
5. echo tsmX > /sys/bus/pci/devices/<bdf>/tsm/unlock
6. cat /proc/iomem | grep "PCI MMIO Encrypted"
Observed results after step4 (duplicate BAR range):
380002000000-3800021fffff : PCI MMIO Encrypted
380002000000-3800021fffff : PCI MMIO Encrypted
Observed results after step 6 (leaked resource):
380002000000-3800021fffff : PCI MMIO Encrypted
Regards,
Yi Lai
> +static ssize_t accept_store(struct device *dev, struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + bool accept;
> + int rc;
> +
> + rc = kstrtobool(buf, &accept);
> + if (rc)
> + return rc;
> +
> + /*
> + * TDISP can only go from RUN to UNLOCKED/ERROR, so there is no
> + * 'unaccept' verb.
> + */
> + if (!accept)
> + return -EINVAL;
> +
> + rc = pci_tsm_accept(pdev);
> + if (rc)
> + return rc;
> +
> + return len;
> +}
> +
> +static ssize_t accept_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", device_cc_accepted(dev));
> +}
> +static DEVICE_ATTR_RW(accept);
> +
> /**
> * pci_tsm_unlock() - Transition TDI from LOCKED/RUN to UNLOCKED
> * @pdev: TDI device to unlock
> @@ -740,7 +805,8 @@ static umode_t pci_tsm_attr_visible(struct kobject *kobj,
> }
>
> if (pci_tsm_devsec_group_visible(kobj)) {
> - if (attr == &dev_attr_lock.attr ||
> + if (attr == &dev_attr_accept.attr ||
> + attr == &dev_attr_lock.attr ||
> attr == &dev_attr_unlock.attr)
> return attr->mode;
> }
> @@ -760,6 +826,7 @@ static struct attribute *pci_tsm_attrs[] = {
> &dev_attr_disconnect.attr,
> &dev_attr_bound.attr,
> &dev_attr_dsm.attr,
> + &dev_attr_accept.attr,
> &dev_attr_lock.attr,
> &dev_attr_unlock.attr,
> NULL
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH v2 07/19] PCI/TSM: Add Device Security (TVM Guest) ACCEPT operation support
From: Lai, Yi @ 2026-04-10 8:44 UTC (permalink / raw)
To: Dan Williams, djbw
Cc: linux-coco, linux-pci, gregkh, aik, aneesh.kumar, yilun.xu,
bhelgaas, alistair23, lukas, jgg, djbw
In-Reply-To: <20260303000207.1836586-8-dan.j.williams@intel.com>
On Mon, Mar 02, 2026 at 04:01:55PM -0800, Dan Williams wrote:
> The final operation of the PCIe Trusted Execution Environment (TEE) Device
> Interface Security Protocol (TDISP) is asking the TEE Security Manager
> (TEE) to enable private DMA and MMIO.
>
> The story so far in the security lifecycle of the device is that the VMM
> setup an SPDM session and link encryption with the device's physical
> function0. The VMM then assigned either that physical function or other
> virtual function of that device to a VM. The VM asked the TSM to transition
> the device from TDISP UNLOCKED->LOCKED. With the device LOCKED the VM
> validated signed fresh device evidence and expected MMIO mappings.
>
> The VM now accepts the device to transition it from LOCKED to RUN and tell
> the TSM to unblock DMA to VM private memory.
>
> Implement a sysfs trigger to flip the device to private operation and plumb
> that to a 'struct pci_tsm_ops::accept()' operation.
>
> Co-developed-by: Xu Yilun <yilun.xu@linux.intel.com>
> Signed-off-by: Xu Yilun <yilun.xu@linux.intel.com>
> Co-developed-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> drivers/pci/Kconfig | 2 +
> Documentation/ABI/testing/sysfs-bus-pci | 13 +++++
> include/linux/pci-tsm.h | 7 ++-
> drivers/pci/tsm.c | 69 ++++++++++++++++++++++++-
> 4 files changed, 88 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index e3f848ffb52a..c45c6b978e1d 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -127,6 +127,8 @@ config PCI_IDE
>
> config PCI_TSM
> bool "PCI TSM: Device security protocol support"
> + depends on ARCH_HAS_CC_PLATFORM
> + select CONFIDENTIAL_DEVICES
> select PCI_IDE
> select PCI_DOE
> select TSM
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 1ed77b9402a6..c2a5c4fe9373 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -732,3 +732,16 @@ Description:
> 'lock' to teardown the connection. Writes fail with EBUSY if
> this device is bound to a driver. This is a "devsec" TSM
> attribute, see Documentation/ABI/testing/sysfs-class-tsm.
> +
> +What: /sys/bus/pci/devices/.../tsm/accept
> +Contact: linux-coco@lists.linux.dev
> +Description:
> + (RW) Write "1" (or any boolean "true" string) to this file to
> + request that TSM transition the device from the TDISP LOCKED
> + state to the RUN state and arrange the for the secure IOMMU to
> + accept requests with T=1 in the PCIe packet header (TLP)
> + targeting private memory. Per TDISP the only exits from the RUN
> + state are via an explicit unlock request or an event that
> + transitions the device to the ERROR state. Writes fail with
> + EBUSY if this device is bound to a driver. This is a "devsec"
> + TSM attribute, see Documentation/ABI/testing/sysfs-class-tsm.
> diff --git a/include/linux/pci-tsm.h b/include/linux/pci-tsm.h
> index 2a896b83bff9..176d214cd0da 100644
> --- a/include/linux/pci-tsm.h
> +++ b/include/linux/pci-tsm.h
> @@ -66,15 +66,18 @@ struct pci_tsm_ops {
> * pci_tsm') for follow-on security state transitions from the
> * LOCKED state
> * @unlock: destroy TSM context and return device to UNLOCKED state
> + * @accept: accept a locked TDI for use, move it to RUN state
> *
> * Context: @lock and @unlock run under pci_tsm_rwsem held for write to
> - * sync with TSM unregistration and each other. All operations run under
> - * the device lock for mutual exclusion with driver attach and detach.
> + * sync with TSM unregistration and each other. @accept runs under
> + * pci_tsm_rwsem held for read. All operations run under the device lock
> + * for mutual exclusion with driver attach and detach.
> */
> struct_group_tagged(pci_tsm_devsec_ops, devsec_ops,
> struct pci_tsm *(*lock)(struct tsm_dev *tsm_dev,
> struct pci_dev *pdev);
> void (*unlock)(struct pci_tsm *tsm);
> + int (*accept)(struct pci_dev *pdev);
> );
> };
>
> diff --git a/drivers/pci/tsm.c b/drivers/pci/tsm.c
> index 259e75092618..aa93a59d2720 100644
> --- a/drivers/pci/tsm.c
> +++ b/drivers/pci/tsm.c
> @@ -557,6 +557,71 @@ static ssize_t dsm_show(struct device *dev, struct device_attribute *attr,
> }
> static DEVICE_ATTR_RO(dsm);
>
> +/**
> + * pci_tsm_accept() - accept a device for private MMIO+DMA operation
> + * @pdev: PCI device to accept
> + *
> + * "Accept" transitions a device to the run state, it is only suitable to make
> + * that transition from a known DMA-idle (no active mappings) state. The "driver
> + * detached" state is a coarse way to assert that requirement.
> + */
Hi Dan,
Repeated accept on a device that is already in RUN state is not rejected
by the PCI TSM core, and multiple encrypted MMIO resources for the same
BAR range can be created. Furthermore, a later request to move the
device to UNLOCKED state only removes the most recently tracked
encrypted range.
Reproduce steps:
1. echo tsmX > /sys/bus/pci/devices/<bdf>/tsm/lock
2. echo 1 > /sys/bus/pci/devices/<bdf>/tsm/accept
3. echo 1 > /sys/bus/pci/devices/<bdf>/tsm/accept
4. cat /proc/iomem | grep "PCI MMIO Encrypted"
5. echo tsmX > /sys/bus/pci/devices/<bdf>/tsm/unlock
6. cat /proc/iomem | grep "PCI MMIO Encrypted"
Observed results after step4 (duplicate BAR range):
380002000000-3800021fffff : PCI MMIO Encrypted
380002000000-3800021fffff : PCI MMIO Encrypted
Observed results after step 6 (leaked resource):
380002000000-3800021fffff : PCI MMIO Encrypted
Regards,
Yi Lai
> +static int pci_tsm_accept(struct pci_dev *pdev)
> +{
> + int rc;
> +
> + ACQUIRE(rwsem_read_intr, lock)(&pci_tsm_rwsem);
> + if ((rc = ACQUIRE_ERR(rwsem_read_intr, &lock)))
> + return rc;
> +
> + if (!pdev->tsm)
> + return -EINVAL;
> +
> + ACQUIRE(device_intr, dev_lock)(&pdev->dev);
> + if ((rc = ACQUIRE_ERR(device_intr, &dev_lock)))
> + return rc;
> +
> + if (pdev->dev.driver)
> + return -EBUSY;
> +
> + rc = to_pci_tsm_ops(pdev->tsm)->accept(pdev);
> + if (rc)
> + return rc;
> +
> + return device_cc_accept(&pdev->dev);
> +}
> +
> +static ssize_t accept_store(struct device *dev, struct device_attribute *attr,
> + const char *buf, size_t len)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + bool accept;
> + int rc;
> +
> + rc = kstrtobool(buf, &accept);
> + if (rc)
> + return rc;
> +
> + /*
> + * TDISP can only go from RUN to UNLOCKED/ERROR, so there is no
> + * 'unaccept' verb.
> + */
> + if (!accept)
> + return -EINVAL;
> +
> + rc = pci_tsm_accept(pdev);
> + if (rc)
> + return rc;
> +
> + return len;
> +}
> +
> +static ssize_t accept_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", device_cc_accepted(dev));
> +}
> +static DEVICE_ATTR_RW(accept);
> +
> /**
> * pci_tsm_unlock() - Transition TDI from LOCKED/RUN to UNLOCKED
> * @pdev: TDI device to unlock
> @@ -740,7 +805,8 @@ static umode_t pci_tsm_attr_visible(struct kobject *kobj,
> }
>
> if (pci_tsm_devsec_group_visible(kobj)) {
> - if (attr == &dev_attr_lock.attr ||
> + if (attr == &dev_attr_accept.attr ||
> + attr == &dev_attr_lock.attr ||
> attr == &dev_attr_unlock.attr)
> return attr->mode;
> }
> @@ -760,6 +826,7 @@ static struct attribute *pci_tsm_attrs[] = {
> &dev_attr_disconnect.attr,
> &dev_attr_bound.attr,
> &dev_attr_dsm.attr,
> + &dev_attr_accept.attr,
> &dev_attr_lock.attr,
> &dev_attr_unlock.attr,
> NULL
> --
> 2.52.0
>
^ permalink raw reply
* Re: [PATCH 2/2] x86/tdx: Accept hotplugged memory before online
From: David Hildenbrand (Arm) @ 2026-04-10 7:49 UTC (permalink / raw)
To: Duan, Zhenzhong, Marc-André Lureau
Cc: Edgecombe, Rick P, Reshetova, Elena, pbonzini@redhat.com,
prsampat@amd.com, x86@kernel.org, kas@kernel.org,
dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
mingo@redhat.com, bp@alien8.de, Qiang, Chenyi, tglx@kernel.org,
hpa@zytor.com, kvm@vger.kernel.org, linux-coco@lists.linux.dev
In-Reply-To: <IA3PR11MB9136BA4C593C5254EDD9D62092592@IA3PR11MB9136.namprd11.prod.outlook.com>
On 4/10/26 03:05, Duan, Zhenzhong wrote:
>
>
>> -----Original Message-----
>> From: Marc-André Lureau <marcandre.lureau@redhat.com>
>> Subject: Re: [PATCH 2/2] x86/tdx: Accept hotplugged memory before online
>>
>> Hi
>>
>> On Thu, Apr 9, 2026 at 5:36 AM Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> wrote:
>>>
>>>
>>>
>> how
>>>
>>> For that solution, analog to hotplug, TDX Connect needs a hot-unplug handler to
>>> use "release" seamcall to unaccept private memory before unplug, that's it. But
>>> if the zapping S-EPT will not happen in host, I think this "release" seamcall is also
>>> unnecessary for TDX Connect.
>>>
>>> I also have a silly question which I looked over this thread and didn't find answer.
>>> Do we have to support private memory hotplug, what benefit we get to support
>> it?
>>> If we only allow shared memory plug/unplug to TD, then we don't need this
>> series.
>>> Guest decides to convert shared memory to private after plug and do the
>> opposite before unplug.
>>> This works for both TDX connect and memory unplug as memory release is
>> implicitly triggered
>>> in memory convert.
>>
>> I did some successful experiments with modified QEMU & kernel, this
>> seems to work.
>
> Good to see, thanks for verifying.
>
>>
>> On virtio-mem plug, set_memory_encrypted() makes the memory private +
>> accepted. On unplug, make it return to shared with
>> set_memory_decrypted(). QEMU handles REQ_UNPLUG and can punch both
>> shared & guest_memfd planes (which will TDH.MEM.PAGE.REMOVE).
>> Re-plugging also works fine.
>
> If guest called set_memory_decrypted() on unplug, QEMU punching
> guest_memfd in REQ_UNPLUG is unnecessary as it's already taken during
> memory convert. So just to confirm, you want QEMU to take cover the case
> when guest failed on set_memory_decrypted() or never called it?
Once we have in-place conversion with guest_memfd, the punching will be
required, though.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Reinette Chatre @ 2026-04-10 3:41 UTC (permalink / raw)
To: Moger, Babu, Babu Moger, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
Lendacky, Thomas, elena.reshetova@intel.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev, kvm@vger.kernel.org,
eranian@google.com, peternewman@google.com
In-Reply-To: <90f4a692-1c27-4967-bf12-ec3cb597681d@amd.com>
Hi Babu,
On 4/9/26 4:42 PM, Moger, Babu wrote:
> Hi Reinette,
>
> On 4/9/2026 3:50 PM, Reinette Chatre wrote:
>> Hi Babu,
>>
>> On 4/9/26 11:05 AM, Moger, Babu wrote:
>>> On 4/9/2026 12:26 PM, Reinette Chatre wrote:
>>>> On 4/9/26 10:19 AM, Moger, Babu wrote:
>>>>> On 4/8/2026 6:41 PM, Reinette Chatre wrote:
>>>>
>>>>>> When the user switches to either "global_assign_ctrl_inherit_mon_per_cpu" or
>>>>>> 'global_assign_ctrl_assign_mon_per_cpu" then "info/kernel_mode_assignment" is created
>>>>>> (or made visible to user space) and is expected to point to default group.
>>>>>> User can change the group using "info/kernel_mode_assignment" at this point.
>>>>>>
>>>>>> If the current scenario is below ...
>>>>>> # cat info/kernel_mode
>>>>>> [global_assign_ctrl_inherit_mon_per_cpu]
>>>>>> inherit_ctrl_and_mon
>>>>>> global_assign_ctrl_assign_mon_per_cpu
>>>>>>
>>>>>> ... then "info/kernel_mode_assignment" will exist but what it should contain if
>>>>>> user switches mode at this point may be up for discussion.
>>>>>>
>>>>>> option 1)
>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>> the resource group in "info/kernel_mode_assignment" is reset to the
>>>>>> default group and all CPUs PLZA state reset to match. The kernel_mode_cpus
>>>>>> and kernel_mode_cpuslist files become visible in default resource group
>>>>>> and they contain "all online CPUs".
>>>>>>
>>>>>> option 2)
>>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>>> the resource group in "info/kernel_mode_assignment" is kept and all
>>>>>> CPUs PLZA state set to match it while also keeping the current
>>>>>> values of that resource group's kernel_mode_cpus and kernel_mode_cpuslist
>>>>>> files.
>>>>>>
>>>>>> I am leaning towards "option 1" to keep it consistent with a switch from
>>>>>> "inherit_ctrl_and_mon" and being deterministic about how a mode is started with
>>>>>
>>>>> Yes. The "option 1" seems appropriate.
>>>>>
>>>>>> a clean slate. What are your thoughts? What would be use case where a user would
>>>>>> want to switch between "global_assign_ctrl_inherit_mon_per_cpu" and
>>>>>> "global_assign_ctrl_assign_mon_per_cpu" to just switch rmid_en on and off?
>>>>>
>>>>>
>>>>> This is a bit tricky.
>>>>>
>>>>> Currently, our requirement is to have a CTRL_MON group for
>>>>> global_assign_ctrl_inherit_mon_per_cpu. In this scenario, we use the
>>>>> group’s CLOSID for PLZA configuration, and RMID is not used (rmid_en
>>>>> = 0) when setting up PLZA.
>>>>>
>>>>> Our requirement is also to have a CTRL_MON/MON group for
>>>>> global_assign_ctrl_assign_mon_per_cpu. In this case as well, the
>>>>> group’s CLOSID and RMID (rmid_en = 1) both are used configure PLZA.
>>>>
>>>> ah, right. Good catch.
>>>>
>>>>>
>>>>> Actually, we should not allow these changes from
>>>>> global_assign_ctrl_inherit_mon_per_cpu to
>>>>> global_assign_ctrl_assign_mon_per_cpu or visa versa.
>>>>
>>>> resctrl could allow it but as part of the switch it resets the "kernel mode group" to
>>>> be the default group every time? This would be the "option 1" above.
>>>
>>> Other options.
>>>
>>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>>
>>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>>> to CTRL_MON/MON -> CTRL_MON.
>>>
>>
>> ok. Could you please return the courtesy of providing feedback on the
>> suggestion you are responding to and also include the motivation why your
>> suggestion is the better option?
>
> Yea. Sure.
>
> We need to allow the switch between the modes. Otherwise only way to reset is to remount the resctrl filesystem. That is not a good option.
>
> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>
> This option is same as you suggested.
>
> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
> to CTRL_MON/MON -> CTRL_MON. This option basically disables monitor (rmid_en=0). It is less disruptive. Move is between child group to parent group.
ok. I am concerned that this creates an inconsistent interface. Specifically, sometimes
when switching the mode the kernel group will reset and sometimes it won't. This inconsistency
may be more apparent when writing the user documentation as part of this work. If you are
able to clearly explain how this resctrl fs interface behaves (this cannot be about PLZA
internals as above) then this could work.
Reinette
^ permalink raw reply
* RE: [PATCH 2/2] x86/tdx: Accept hotplugged memory before online
From: Duan, Zhenzhong @ 2026-04-10 1:05 UTC (permalink / raw)
To: Marc-André Lureau, David Hildenbrand
Cc: Edgecombe, Rick P, Reshetova, Elena, pbonzini@redhat.com,
prsampat@amd.com, x86@kernel.org, kas@kernel.org,
dave.hansen@linux.intel.com, linux-kernel@vger.kernel.org,
mingo@redhat.com, bp@alien8.de, Qiang, Chenyi, tglx@kernel.org,
hpa@zytor.com, kvm@vger.kernel.org, linux-coco@lists.linux.dev
In-Reply-To: <CAMxuvawVL7-9+8U=Yf7_pJcsbCivgJqFStrXAtAaiBEqtZnc1A@mail.gmail.com>
>-----Original Message-----
>From: Marc-André Lureau <marcandre.lureau@redhat.com>
>Subject: Re: [PATCH 2/2] x86/tdx: Accept hotplugged memory before online
>
>Hi
>
>On Thu, Apr 9, 2026 at 5:36 AM Duan, Zhenzhong <zhenzhong.duan@intel.com>
>wrote:
>>
>>
>>
>> >-----Original Message-----
>> >From: Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>> >Subject: Re: [PATCH 2/2] x86/tdx: Accept hotplugged memory before online
>> >
>> >On Fri, 2026-04-03 at 10:37 +0000, Reshetova, Elena wrote:
>> >> > > > So the part about whether a triggered accept succeeds or returns an
>> >> > > > already accepted error is already under the control of the host. > >
>> >> > > > I.e., if we don't have the zeroing behavior, the host can already > >
>> >> > > > cause the page to get zeroed. So I don't think anything is > >
>> >> > > > regressed. Both come down to how careful the guest is about what it > >
>> >> > > > accepts.
>> >> >
>> >> > Yes, and my point is that we should not allow guest to freely double
>> >> > accepting ever.
>> >> > For any use case that requires releasing memory and accepting it > back, it
>> >> > should be explicit action by the guest to track that memory > has been
>> >> > "released" (under correct and safe conditions) and then it > is ok to accept
>> >> > it back (even if it doesnt mean physically accepting > it) and in this case
>> >> > it is ok (and even strongly desired) to zero the > page to simulate the
>> >> > normal accept behaviour.
>> >
>> >Hmm, it doesn't seem like you engaged with my point. Or at least I'm not
>> >following what is exposed?
>> >
>> >So I'm going to assume you agree that this procedure would not open up any
>> >specific new capabilities for the host that don't exist today. And instead you
>> >are just saying that the guest should have infrastructure to not double accept
>> >memory in the first place.
>> >
>> >But the problem here is not that the guest losing track of the accept state
>> >actually. It is that the guest relies on the host to actually zap the S-EPT
>> >before re-plugging memory at the same physical address space. So the guest is
>> >tracking that the memory is released correctly. Better tracking will not help.
>> >It relies on host behavior to not hit a double accept.
>> >
>> >TDX connect will use this "unaccept" seamcall, so I asked Zhenzhong (Cced)
>how
>> >much of what we need for that solution will just get added for TDX connect
>> >anyway. It seems like we should make sure the same solution will work for both
>> >SNP and TDX and keep the options open at this stage.
>>
>> For that solution, analog to hotplug, TDX Connect needs a hot-unplug handler to
>> use "release" seamcall to unaccept private memory before unplug, that's it. But
>> if the zapping S-EPT will not happen in host, I think this "release" seamcall is also
>> unnecessary for TDX Connect.
>>
>> I also have a silly question which I looked over this thread and didn't find answer.
>> Do we have to support private memory hotplug, what benefit we get to support
>it?
>> If we only allow shared memory plug/unplug to TD, then we don't need this
>series.
>> Guest decides to convert shared memory to private after plug and do the
>opposite before unplug.
>> This works for both TDX connect and memory unplug as memory release is
>implicitly triggered
>> in memory convert.
>
>I did some successful experiments with modified QEMU & kernel, this
>seems to work.
Good to see, thanks for verifying.
>
>On virtio-mem plug, set_memory_encrypted() makes the memory private +
>accepted. On unplug, make it return to shared with
>set_memory_decrypted(). QEMU handles REQ_UNPLUG and can punch both
>shared & guest_memfd planes (which will TDH.MEM.PAGE.REMOVE).
>Re-plugging also works fine.
If guest called set_memory_decrypted() on unplug, QEMU punching
guest_memfd in REQ_UNPLUG is unnecessary as it's already taken during
memory convert. So just to confirm, you want QEMU to take cover the case
when guest failed on set_memory_decrypted() or never called it?
>
>The virtio spec should probably be updated to explicitly define the
>shared state on unplug and the private state on plug, driven by the
>guest/driver. Those are KVM memory attributes, I suppose this is
>generic enough.
Yes.
Thanks
Zhenzhong
^ permalink raw reply
* Re: [PATCH v2 00/16] fs,x86/resctrl: Add kernel-mode (e.g., PLZA) support to the resctrl subsystem
From: Moger, Babu @ 2026-04-09 23:42 UTC (permalink / raw)
To: Reinette Chatre, Babu Moger, corbet@lwn.net, tony.luck@intel.com,
Dave.Martin@arm.com, james.morse@arm.com, tglx@kernel.org,
mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com
Cc: skhan@linuxfoundation.org, x86@kernel.org, hpa@zytor.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
vschneid@redhat.com, kas@kernel.org, rick.p.edgecombe@intel.com,
akpm@linux-foundation.org, pmladek@suse.com,
rdunlap@infradead.org, dapeng1.mi@linux.intel.com,
kees@kernel.org, elver@google.com, paulmck@kernel.org,
lirongqing@baidu.com, safinaskar@gmail.com, fvdl@google.com,
seanjc@google.com, pawan.kumar.gupta@linux.intel.com,
xin@zytor.com, tiala@microsoft.com, chang.seok.bae@intel.com,
Lendacky, Thomas, elena.reshetova@intel.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-coco@lists.linux.dev, kvm@vger.kernel.org,
eranian@google.com, peternewman@google.com
In-Reply-To: <973067bf-6e6c-446a-a81a-713840d701a9@intel.com>
Hi Reinette,
On 4/9/2026 3:50 PM, Reinette Chatre wrote:
> Hi Babu,
>
> On 4/9/26 11:05 AM, Moger, Babu wrote:
>> On 4/9/2026 12:26 PM, Reinette Chatre wrote:
>>> On 4/9/26 10:19 AM, Moger, Babu wrote:
>>>> On 4/8/2026 6:41 PM, Reinette Chatre wrote:
>>>
>>>>> When the user switches to either "global_assign_ctrl_inherit_mon_per_cpu" or
>>>>> 'global_assign_ctrl_assign_mon_per_cpu" then "info/kernel_mode_assignment" is created
>>>>> (or made visible to user space) and is expected to point to default group.
>>>>> User can change the group using "info/kernel_mode_assignment" at this point.
>>>>>
>>>>> If the current scenario is below ...
>>>>> # cat info/kernel_mode
>>>>> [global_assign_ctrl_inherit_mon_per_cpu]
>>>>> inherit_ctrl_and_mon
>>>>> global_assign_ctrl_assign_mon_per_cpu
>>>>>
>>>>> ... then "info/kernel_mode_assignment" will exist but what it should contain if
>>>>> user switches mode at this point may be up for discussion.
>>>>>
>>>>> option 1)
>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>> the resource group in "info/kernel_mode_assignment" is reset to the
>>>>> default group and all CPUs PLZA state reset to match. The kernel_mode_cpus
>>>>> and kernel_mode_cpuslist files become visible in default resource group
>>>>> and they contain "all online CPUs".
>>>>>
>>>>> option 2)
>>>>> When user switches mode to "global_assign_ctrl_assign_mon_per_cpu" then
>>>>> the resource group in "info/kernel_mode_assignment" is kept and all
>>>>> CPUs PLZA state set to match it while also keeping the current
>>>>> values of that resource group's kernel_mode_cpus and kernel_mode_cpuslist
>>>>> files.
>>>>>
>>>>> I am leaning towards "option 1" to keep it consistent with a switch from
>>>>> "inherit_ctrl_and_mon" and being deterministic about how a mode is started with
>>>>
>>>> Yes. The "option 1" seems appropriate.
>>>>
>>>>> a clean slate. What are your thoughts? What would be use case where a user would
>>>>> want to switch between "global_assign_ctrl_inherit_mon_per_cpu" and
>>>>> "global_assign_ctrl_assign_mon_per_cpu" to just switch rmid_en on and off?
>>>>
>>>>
>>>> This is a bit tricky.
>>>>
>>>> Currently, our requirement is to have a CTRL_MON group for
>>>> global_assign_ctrl_inherit_mon_per_cpu. In this scenario, we use the
>>>> group’s CLOSID for PLZA configuration, and RMID is not used (rmid_en
>>>> = 0) when setting up PLZA.
>>>>
>>>> Our requirement is also to have a CTRL_MON/MON group for
>>>> global_assign_ctrl_assign_mon_per_cpu. In this case as well, the
>>>> group’s CLOSID and RMID (rmid_en = 1) both are used configure PLZA.
>>>
>>> ah, right. Good catch.
>>>
>>>>
>>>> Actually, we should not allow these changes from
>>>> global_assign_ctrl_inherit_mon_per_cpu to
>>>> global_assign_ctrl_assign_mon_per_cpu or visa versa.
>>>
>>> resctrl could allow it but as part of the switch it resets the "kernel mode group" to
>>> be the default group every time? This would be the "option 1" above.
>>
>> Other options.
>>
>> Allow global_assign_ctrl_inherit_mon_per_cpu -> global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the "kernel mode group" to the default group.
>>
>> Allow global_assign_ctrl_assign_mon_per_cpu -> global_assign_ctrl_inherit_mon_per_cpu. In this case switch
>> to CTRL_MON/MON -> CTRL_MON.
>>
>
> ok. Could you please return the courtesy of providing feedback on the
> suggestion you are responding to and also include the motivation why your
> suggestion is the better option?
Yea. Sure.
We need to allow the switch between the modes. Otherwise only way to
reset is to remount the resctrl filesystem. That is not a good option.
Allow global_assign_ctrl_inherit_mon_per_cpu ->
global_assign_ctrl_assign_mon_per_cpu. As part of the switch, reset the
"kernel mode group" to the default group.
This option is same as you suggested.
Allow global_assign_ctrl_assign_mon_per_cpu ->
global_assign_ctrl_inherit_mon_per_cpu. In this case switch
to CTRL_MON/MON -> CTRL_MON. This option basically disables monitor
(rmid_en=0). It is less disruptive. Move is between child group to
parent group.
Thanks
Babu
^ permalink raw reply
* [PATCH v2 6/6] KVM: x86: Use a proper bitmap for tracking available/dirty registers
From: Sean Christopherson @ 2026-04-09 22:42 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Kiryl Shutsemau
Cc: kvm, x86, linux-coco, linux-kernel, Chang S . Bae
In-Reply-To: <20260409224236.2021562-1-seanjc@google.com>
Define regs_{avail,dirty} as bitmaps instead of U32s to harden against
overflow, and to allow for dynamically sizing the bitmaps when APX comes
along, which will add 16 more GPRs (R16-R31) and thus increase the total
number of registers beyond 32.
Open code writes in the "reset" APIs, as the writes are hot paths and
bitmap_write() is complete overkill for what KVM needs. Even better,
hardcoding writes to entry '0' in the array is a perfect excuse to assert
that the array contains exactly one entry, e.g. to effectively add guard
against defining R16-R31 in 32-bit kernels.
For all intents and purposes, no functional change intended even though
using bitmap_fill() will mean "undefined" registers are no longer marked
available and dirty (KVM should never be querying those bits).
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_host.h | 6 ++++--
arch/x86/kvm/kvm_cache_regs.h | 20 ++++++++++++--------
arch/x86/kvm/x86.c | 4 ++--
3 files changed, 18 insertions(+), 12 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c47eb294c066..ef0c368676c5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -211,6 +211,8 @@ enum kvm_reg {
VCPU_REG_SEGMENTS,
VCPU_REG_EXIT_INFO_1,
VCPU_REG_EXIT_INFO_2,
+
+ NR_VCPU_TOTAL_REGS,
};
enum {
@@ -802,8 +804,8 @@ struct kvm_vcpu_arch {
*/
unsigned long regs[NR_VCPU_GENERAL_PURPOSE_REGS];
unsigned long rip;
- unsigned long regs_avail;
- unsigned long regs_dirty;
+ DECLARE_BITMAP(regs_avail, NR_VCPU_TOTAL_REGS);
+ DECLARE_BITMAP(regs_dirty, NR_VCPU_TOTAL_REGS);
unsigned long cr0;
unsigned long cr0_guest_owned_bits;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 171e6bc2e169..2ae492ad6412 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -67,29 +67,29 @@ static inline bool kvm_register_is_available(struct kvm_vcpu *vcpu,
enum kvm_reg reg)
{
kvm_assert_register_caching_allowed(vcpu);
- return test_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
+ return test_bit(reg, vcpu->arch.regs_avail);
}
static inline bool kvm_register_is_dirty(struct kvm_vcpu *vcpu,
enum kvm_reg reg)
{
kvm_assert_register_caching_allowed(vcpu);
- return test_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
+ return test_bit(reg, vcpu->arch.regs_dirty);
}
static inline void kvm_register_mark_available(struct kvm_vcpu *vcpu,
enum kvm_reg reg)
{
kvm_assert_register_caching_allowed(vcpu);
- __set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
+ __set_bit(reg, vcpu->arch.regs_avail);
}
static inline void kvm_register_mark_dirty(struct kvm_vcpu *vcpu,
enum kvm_reg reg)
{
kvm_assert_register_caching_allowed(vcpu);
- __set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
- __set_bit(reg, (unsigned long *)&vcpu->arch.regs_dirty);
+ __set_bit(reg, vcpu->arch.regs_avail);
+ __set_bit(reg, vcpu->arch.regs_dirty);
}
/*
@@ -102,12 +102,15 @@ static __always_inline bool kvm_register_test_and_mark_available(struct kvm_vcpu
enum kvm_reg reg)
{
kvm_assert_register_caching_allowed(vcpu);
- return arch___test_and_set_bit(reg, (unsigned long *)&vcpu->arch.regs_avail);
+ return arch___test_and_set_bit(reg, vcpu->arch.regs_avail);
}
static __always_inline void kvm_clear_available_registers(struct kvm_vcpu *vcpu,
unsigned long clear_mask)
{
+ BUILD_BUG_ON(sizeof(clear_mask) != sizeof(vcpu->arch.regs_avail[0]));
+ BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.regs_avail) != 1);
+
/*
* Note the bitwise-AND! In practice, a straight write would also work
* as KVM initializes the mask to all ones and never clears registers
@@ -115,12 +118,13 @@ static __always_inline void kvm_clear_available_registers(struct kvm_vcpu *vcpu,
* sanity checking as incorrectly marking an eagerly sync'd register
* unavailable will generate a WARN due to an unexpected cache request.
*/
- vcpu->arch.regs_avail &= ~clear_mask;
+ vcpu->arch.regs_avail[0] &= ~clear_mask;
}
static __always_inline void kvm_reset_dirty_registers(struct kvm_vcpu *vcpu)
{
- vcpu->arch.regs_dirty = 0;
+ BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.regs_dirty) != 1);
+ vcpu->arch.regs_dirty[0] = 0;
}
/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac05cc289b56..b8a91feec8e1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12836,8 +12836,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
int r;
vcpu->arch.last_vmentry_cpu = -1;
- vcpu->arch.regs_avail = ~0;
- vcpu->arch.regs_dirty = ~0;
+ bitmap_fill(vcpu->arch.regs_avail, NR_VCPU_TOTAL_REGS);
+ bitmap_fill(vcpu->arch.regs_dirty, NR_VCPU_TOTAL_REGS);
kvm_gpc_init(&vcpu->arch.pv_time, vcpu->kvm);
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply related
* [PATCH v2 5/6] KVM: x86: Track available/dirty register masks as "unsigned long" values
From: Sean Christopherson @ 2026-04-09 22:42 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, Kiryl Shutsemau
Cc: kvm, x86, linux-coco, linux-kernel, Chang S . Bae
In-Reply-To: <20260409224236.2021562-1-seanjc@google.com>
Convert regs_{avail,dirty} and all related masks to "unsigned long" values
as an intermediate step towards declaring the fields as actual bitmaps, and
as a step toward support APX, which will push the total number of registers
beyond 32 on 64-bit kernels.
Opportunistically convert TDX's ULL bitmask to a UL to match everything
else (TDX is 64-bit only, so it's a nop in the end).
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/include/asm/kvm_host.h | 4 ++--
arch/x86/kvm/kvm_cache_regs.h | 2 +-
arch/x86/kvm/svm/svm.h | 2 +-
arch/x86/kvm/vmx/tdx.c | 36 ++++++++++++++++-----------------
arch/x86/kvm/vmx/vmx.h | 20 +++++++++---------
5 files changed, 32 insertions(+), 32 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b1eae1e7b04f..c47eb294c066 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -802,8 +802,8 @@ struct kvm_vcpu_arch {
*/
unsigned long regs[NR_VCPU_GENERAL_PURPOSE_REGS];
unsigned long rip;
- u32 regs_avail;
- u32 regs_dirty;
+ unsigned long regs_avail;
+ unsigned long regs_dirty;
unsigned long cr0;
unsigned long cr0_guest_owned_bits;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 7f71d468178c..171e6bc2e169 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -106,7 +106,7 @@ static __always_inline bool kvm_register_test_and_mark_available(struct kvm_vcpu
}
static __always_inline void kvm_clear_available_registers(struct kvm_vcpu *vcpu,
- u32 clear_mask)
+ unsigned long clear_mask)
{
/*
* Note the bitwise-AND! In practice, a straight write would also work
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 677d268ae9c7..7b46a3f13de1 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -474,7 +474,7 @@ static inline bool svm_is_vmrun_failure(u64 exit_code)
* KVM_REQ_LOAD_MMU_PGD is always requested when the cached vcpu->arch.cr3
* is changed. svm_load_mmu_pgd() then syncs the new CR3 value into the VMCB.
*/
-#define SVM_REGS_LAZY_LOAD_SET (1 << VCPU_REG_PDPTR)
+#define SVM_REGS_LAZY_LOAD_SET (BIT(VCPU_REG_PDPTR))
static inline void __vmcb_set_intercept(unsigned long *intercepts, u32 bit)
{
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c9ab7902151f..85f28363e4cc 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1013,23 +1013,23 @@ static fastpath_t tdx_exit_handlers_fastpath(struct kvm_vcpu *vcpu)
return EXIT_FASTPATH_NONE;
}
-#define TDX_REGS_AVAIL_SET (BIT_ULL(VCPU_REG_EXIT_INFO_1) | \
- BIT_ULL(VCPU_REG_EXIT_INFO_2) | \
- BIT_ULL(VCPU_REGS_RAX) | \
- BIT_ULL(VCPU_REGS_RBX) | \
- BIT_ULL(VCPU_REGS_RCX) | \
- BIT_ULL(VCPU_REGS_RDX) | \
- BIT_ULL(VCPU_REGS_RBP) | \
- BIT_ULL(VCPU_REGS_RSI) | \
- BIT_ULL(VCPU_REGS_RDI) | \
- BIT_ULL(VCPU_REGS_R8) | \
- BIT_ULL(VCPU_REGS_R9) | \
- BIT_ULL(VCPU_REGS_R10) | \
- BIT_ULL(VCPU_REGS_R11) | \
- BIT_ULL(VCPU_REGS_R12) | \
- BIT_ULL(VCPU_REGS_R13) | \
- BIT_ULL(VCPU_REGS_R14) | \
- BIT_ULL(VCPU_REGS_R15))
+#define TDX_REGS_AVAIL_SET (BIT(VCPU_REG_EXIT_INFO_1) | \
+ BIT(VCPU_REG_EXIT_INFO_2) | \
+ BIT(VCPU_REGS_RAX) | \
+ BIT(VCPU_REGS_RBX) | \
+ BIT(VCPU_REGS_RCX) | \
+ BIT(VCPU_REGS_RDX) | \
+ BIT(VCPU_REGS_RBP) | \
+ BIT(VCPU_REGS_RSI) | \
+ BIT(VCPU_REGS_RDI) | \
+ BIT(VCPU_REGS_R8) | \
+ BIT(VCPU_REGS_R9) | \
+ BIT(VCPU_REGS_R10) | \
+ BIT(VCPU_REGS_R11) | \
+ BIT(VCPU_REGS_R12) | \
+ BIT(VCPU_REGS_R13) | \
+ BIT(VCPU_REGS_R14) | \
+ BIT(VCPU_REGS_R15))
static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu)
{
@@ -1098,7 +1098,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu, u64 run_flags)
tdx_load_host_xsave_state(vcpu);
- kvm_clear_available_registers(vcpu, ~(u32)TDX_REGS_AVAIL_SET);
+ kvm_clear_available_registers(vcpu, ~TDX_REGS_AVAIL_SET);
if (unlikely(tdx->vp_enter_ret == EXIT_REASON_EPT_MISCONFIG))
return EXIT_FASTPATH_NONE;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9fb76ea48caf..48447fa983f4 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -620,16 +620,16 @@ BUILD_CONTROLS_SHADOW(tertiary_exec, TERTIARY_VM_EXEC_CONTROL, 64)
* cache on demand. Other registers not listed here are synced to
* the cache immediately after VM-Exit.
*/
-#define VMX_REGS_LAZY_LOAD_SET ((1 << VCPU_REG_RIP) | \
- (1 << VCPU_REGS_RSP) | \
- (1 << VCPU_REG_RFLAGS) | \
- (1 << VCPU_REG_PDPTR) | \
- (1 << VCPU_REG_SEGMENTS) | \
- (1 << VCPU_REG_CR0) | \
- (1 << VCPU_REG_CR3) | \
- (1 << VCPU_REG_CR4) | \
- (1 << VCPU_REG_EXIT_INFO_1) | \
- (1 << VCPU_REG_EXIT_INFO_2))
+#define VMX_REGS_LAZY_LOAD_SET (BIT(VCPU_REGS_RSP) | \
+ BIT(VCPU_REG_RIP) | \
+ BIT(VCPU_REG_RFLAGS) | \
+ BIT(VCPU_REG_PDPTR) | \
+ BIT(VCPU_REG_SEGMENTS) | \
+ BIT(VCPU_REG_CR0) | \
+ BIT(VCPU_REG_CR3) | \
+ BIT(VCPU_REG_CR4) | \
+ BIT(VCPU_REG_EXIT_INFO_1) | \
+ BIT(VCPU_REG_EXIT_INFO_2))
static inline unsigned long vmx_l1_guest_owned_cr0_bits(void)
{
--
2.53.0.1213.gd9a14994de-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox