Linux Confidential Computing Development

Linux Confidential Computing Development
 help / color / mirror / Atom feed

* Re: [PATCH v14 07/44] arm64: RMI: Configure the RMM with the host's page size
From: Suzuki K Poulose @ 2026-05-21 14:53 UTC (permalink / raw)
  To: Marc Zyngier, Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86a4tsx536.wl-maz@kernel.org>

On 21/05/2026 14:30, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:15 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> RMM v2.0 brings the ability to set the RMM's granule size. Check the
>> feature registers and configure the RMM so that it matches the host's
>> page size. This means that operations can be done with a granulatity
>> equal to PAGE_SIZE.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Moved out of KVM.
>> ---
>>   arch/arm64/kernel/rmi.c | 42 +++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
>>
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> index 99c1ccc35c11..a14ead5dedda 100644
>> --- a/arch/arm64/kernel/rmi.c
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -49,6 +49,45 @@ static int rmi_check_version(void)
>>   	return 0;
>>   }
>>   
>> +static int rmi_configure(void)
>> +{
>> +	struct rmm_config *config __free(free_page) = NULL;
>> +	unsigned long ret;
>> +
>> +	config = (struct rmm_config *)get_zeroed_page(GFP_KERNEL);
>> +	if (!config)
>> +		return -ENOMEM;
> 
> This is the sort of buggy construct that is highlighted in
> include/linux/cleanup.h: initialising the object for cleanup with
> NULL, and only later assigning the expected value.
> 
> It may not matter here, but it will catch you (or more probably me) in
> the future.
> 
>> +
>> +	switch (PAGE_SIZE) {
>> +	case SZ_4K:
>> +		config->rmi_granule_size = RMI_GRANULE_SIZE_4KB;
>> +		break;
>> +	case SZ_16K:
>> +		config->rmi_granule_size = RMI_GRANULE_SIZE_16KB;
>> +		break;
>> +	case SZ_64K:
>> +		config->rmi_granule_size = RMI_GRANULE_SIZE_64KB;
>> +		break;
>> +	default:
>> +		pr_err("Unsupported PAGE_SIZE for RMM\n");
> 
> Do you really anticipate PAGE_SIZE being any other value? This is 100%
> dead code. If you want to be extra cautious, have a BUILD_BUg_ON().
> 
>> +		return -EINVAL;
>> +	}
>> +
>> +	ret = rmi_rmm_config_set(virt_to_phys(config));
>> +	if (ret) {
>> +		pr_err("RMM config set failed\n");
>> +		return -EINVAL;
>> +	}
> 
> What is the live cycle of the page when the call succeeds? Is it
> switched back to the NS PAS and allowed to be freed?

It always remains in the NS world. We relay some information in the
NS PAS page, which the RMM consumes. The checks are performed on
the values consumed by the RMM.

Kind regards
Suzuki

> 
>> +
>> +	ret = rmi_rmm_activate();
>> +	if (ret) {
>> +		pr_err("RMM activate failed\n");
>> +		return -ENXIO;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>   static int __init arm64_init_rmi(void)
>>   {
>>   	/* Continue without realm support if we can't agree on a version */
>> @@ -60,6 +99,9 @@ static int __init arm64_init_rmi(void)
>>   	if (WARN_ON(rmi_features(1, &rmm_feat_reg1)))
>>   		return 0;
>>   
>> +	if (rmi_configure())
>> +		return 0;
>> +
>>   	return 0;
>>   }
>>   subsys_initcall(arm64_init_rmi);
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 01/44] kvm: arm64: Include kvm_emulate.h in kvm/arm_psci.h
From: Steven Price @ 2026-05-21 15:11 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Suzuki K Poulose, Catalin Marinas, Will Deacon,
	James Morse, Oliver Upton, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86jysxvze2.wl-maz@kernel.org>

On 21/05/2026 11:19, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:09 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> From: Suzuki K Poulose <suzuki.poulose@arm.com>
>>
>> Fix a potential build error (like below, when asm/kvm_emulate.h gets
>> included after the kvm/arm_psci.h) by including the missing header file
>> in kvm/arm_psci.h:
>>
>> ./include/kvm/arm_psci.h: In function ‘kvm_psci_version’:
>> ./include/kvm/arm_psci.h:29:13: error: implicit declaration of function
>>    ‘vcpu_has_feature’; did you mean ‘cpu_have_feature’? [-Werror=implicit-function-declaration]
>>    29 |         if (vcpu_has_feature(vcpu, KVM_ARM_VCPU_PSCI_0_2)) {
>> 	         |             ^~~~~~~~~~~~~~~~
>> 			       |             cpu_have_feature
>>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Signed-off-by: Steven Price <steven.price@arm.com>
> 
> Unrelated to this patch, but really easy to fix: the standard prefix
> for patches targeting KVM/arm64 is:
> 
> "KVM: arm64: [opt subsys:] Something starting with a capital letter"
> 
> where "opt subsys" could be "CCA" where applicable.
> 
> It'd be good to have some consistency.

Sure, I think back when I started this there wasn't great consistency so
I picked up something from git log. I'm happy to change this for the
next posting.

Thanks,
Steve

> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 02/44] kvm: arm64: Avoid including linux/kvm_host.h in kvm_pgtable.h
From: Steven Price @ 2026-05-21 15:11 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86ik8hvz2f.wl-maz@kernel.org>

On 21/05/2026 11:26, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:10 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> To avoid future include cycles, drop the linux/kvm_host.h include in
>> kvm_pgtable.h and include two _types.h headers for the types that are
>> actually used. Additionally provide a forward declaration for struct
>> kvm_s2_mmu as it's only used as a pointer in this file.
>>
>> Both pgtable.c and kvm_pkvm.h relied on the indirect inclusion of
>> kvm_host.h, so make that explicit.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> New patch in v13
>> ---
>>  arch/arm64/include/asm/kvm_pgtable.h | 5 ++++-
>>  arch/arm64/include/asm/kvm_pkvm.h    | 2 +-
>>  arch/arm64/kvm/hyp/pgtable.c         | 1 +
>>  3 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
>> index 41a8687938eb..e4770ce2ccf6 100644
>> --- a/arch/arm64/include/asm/kvm_pgtable.h
>> +++ b/arch/arm64/include/asm/kvm_pgtable.h
>> @@ -8,9 +8,12 @@
>>  #define __ARM64_KVM_PGTABLE_H__
>>  
>>  #include <linux/bits.h>
>> -#include <linux/kvm_host.h>
>> +#include <linux/kvm_types.h>
>> +#include <linux/rbtree_types.h>
> 
> I'm surprised by this. Where is the rbtree_type.h requirement coming
> from?

struct kvm_pgtable has a "struct rb_root_cached" for pkvm_mappings.
There's definitely an argument that that's a bit ugly - but this seemed
the cleanest fix from a include perspective.

Thanks,
Steve

> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 03/44] arm64: RME: Handle Granule Protection Faults (GPFs)
From: Steven Price @ 2026-05-21 15:15 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86fr3lvtk3.wl-maz@kernel.org>

On 21/05/2026 13:25, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:11 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> If the host attempts to access granules that have been delegated for use
>> in a realm these accesses will be caught and will trigger a Granule
>> Protection Fault (GPF).
>>
>> A fault during a page walk signals a bug in the kernel and is handled by
>> oopsing the kernel. A non-page walk fault could be caused by user space
>> having access to a page which has been delegated to the kernel and will
>> trigger a SIGBUS to allow debugging why user space is trying to access a
>> delegated page.
>>
>> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v10:
>>  * Don't call arm64_notify_die() in do_gpf() but simply return 1.
>> Changes since v2:
>>  * Include missing "Granule Protection Fault at level -1"
>> ---
>>  arch/arm64/mm/fault.c | 28 ++++++++++++++++++++++------
>>  1 file changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index 0f3c5c7ca054..6358ea4787ba 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -905,6 +905,22 @@ static int do_tag_check_fault(unsigned long far, unsigned long esr,
>>  	return 0;
>>  }
>>  
>> +static int do_gpf_ptw(unsigned long far, unsigned long esr, struct pt_regs *regs)
>> +{
>> +	const struct fault_info *inf = esr_to_fault_info(esr);
>> +
>> +	die_kernel_fault(inf->name, far, esr, regs);
>> +	return 0;
>> +}
>> +
>> +static int do_gpf(unsigned long far, unsigned long esr, struct pt_regs *regs)
>> +{
>> +	if (!is_el1_instruction_abort(esr) && fixup_exception(regs, esr))
>> +		return 0;
>> +
>> +	return 1;
>> +}
>> +
>>  static const struct fault_info fault_info[] = {
>>  	{ do_bad,		SIGKILL, SI_KERNEL,	"ttbr address size fault"	},
>>  	{ do_bad,		SIGKILL, SI_KERNEL,	"level 1 address size fault"	},
>> @@ -941,12 +957,12 @@ static const struct fault_info fault_info[] = {
>>  	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 32"			},
>>  	{ do_alignment_fault,	SIGBUS,  BUS_ADRALN,	"alignment fault"		},
>>  	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 34"			},
>> -	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 35"			},
>> -	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 36"			},
>> -	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 37"			},
>> -	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 38"			},
>> -	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 39"			},
>> -	{ do_bad,		SIGKILL, SI_KERNEL,	"unknown 40"			},
>> +	{ do_gpf_ptw,		SIGKILL, SI_KERNEL,	"Granule Protection Fault at level -1" },
>> +	{ do_gpf_ptw,		SIGKILL, SI_KERNEL,	"Granule Protection Fault at level 0" },
>> +	{ do_gpf_ptw,		SIGKILL, SI_KERNEL,	"Granule Protection Fault at level 1" },
>> +	{ do_gpf_ptw,		SIGKILL, SI_KERNEL,	"Granule Protection Fault at level 2" },
>> +	{ do_gpf_ptw,		SIGKILL, SI_KERNEL,	"Granule Protection Fault at level 3" },
>> +	{ do_gpf,		SIGBUS,  SI_KERNEL,	"Granule Protection Fault not on table walk" },
> 
> It wouldn't hurt to align the textual description with what we have
> for other fault syndromes:
> 
> 	"level X granule protection fault (translation table walk)"
> 
> for the PTW-trigger faults, and
> 
> 	"granule protection fault"
> 
> for the non PTW case.

Sure, no problem.

Thanks,
Steve

> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 04/44] arm64: RMI: Add SMC definitions for calling the RMM
From: Steven Price @ 2026-05-21 15:33 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86ecj5vsu4.wl-maz@kernel.org>

On 21/05/2026 13:40, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:12 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> The RMM (Realm Management Monitor) provides functionality that can be
>> accessed by SMC calls from the host.
>>
>> The SMC definitions are based on DEN0137[1] version 2.0-bet1
>>
>> [1] https://developer.arm.com/documentation/den0137/2-0bet1/
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * Updated to RMM spec v2.0-bet1
>> Changes since v12:
>>  * Updated to RMM spec v2.0-bet0
>> Changes since v9:
>>  * Corrected size of 'ripas_value' in struct rec_exit. The spec states
>>    this is an 8-bit type with padding afterwards (rather than a u64).
>> Changes since v8:
>>  * Added RMI_PERMITTED_GICV3_HCR_BITS to define which bits the RMM
>>    permits to be modified.
>> Changes since v6:
>>  * Renamed REC_ENTER_xxx defines to include 'FLAG' to make it obvious
>>    these are flag values.
>> Changes since v5:
>>  * Sorted the SMC #defines by value.
>>  * Renamed SMI_RxI_CALL to SMI_RMI_CALL since the macro is only used for
>>    RMI calls.
>>  * Renamed REC_GIC_NUM_LRS to REC_MAX_GIC_NUM_LRS since the actual
>>    number of available list registers could be lower.
>>  * Provided a define for the reserved fields of FeatureRegister0.
>>  * Fix inconsistent names for padding fields.
>> Changes since v4:
>>  * Update to point to final released RMM spec.
>>  * Minor rearrangements.
>> Changes since v3:
>>  * Update to match RMM spec v1.0-rel0-rc1.
>> Changes since v2:
>>  * Fix specification link.
>>  * Rename rec_entry->rec_enter to match spec.
>>  * Fix size of pmu_ovf_status to match spec.
>> ---
>>  arch/arm64/include/asm/rmi_smc.h | 448 +++++++++++++++++++++++++++++++
>>  1 file changed, 448 insertions(+)
>>  create mode 100644 arch/arm64/include/asm/rmi_smc.h
>>
>> diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
>> new file mode 100644
>> index 000000000000..a09b7a631fef
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/rmi_smc.h
>> @@ -0,0 +1,448 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (C) 2023-2026 ARM Ltd.
>> + *
>> + * The values and structures in this file are from the Realm Management Monitor
>> + * specification (DEN0137) version 2.0-bet1:
>> + * https://developer.arm.com/documentation/den0137/2-0bet1/
> 
> How long is this spec going to be available on the ARM web site, which
> has a tendency of being reorganised every other week? And there is
> already a beta2.

Obviously I can't predict the next reorganisation - but at least it's a
link that could be fed into archive.org or similar.

There is a beta2 - that was released just after this series. I'll
obviously be updating to that shortly. Sadly the spec is still a bit of
a moving target, but hopefully all the major changes have already happened.

> 
>> + */
>> +
>> +#ifndef __ASM_RMI_SMC_H
>> +#define __ASM_RMI_SMC_H
>> +
>> +#include <linux/arm-smccc.h>
>> +
>> +#define SMC_RMI_CALL(func)				\
>> +	ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL,		\
>> +			   ARM_SMCCC_SMC_64,		\
>> +			   ARM_SMCCC_OWNER_STANDARD,	\
>> +			   (func))
>> +
>> +#define SMC_RMI_VERSION				SMC_RMI_CALL(0x0150)
>> +
>> +#define SMC_RMI_RTT_DATA_MAP_INIT		SMC_RMI_CALL(0x0153)
>> +
>> +#define SMC_RMI_REALM_ACTIVATE			SMC_RMI_CALL(0x0157)
>> +#define SMC_RMI_REALM_CREATE			SMC_RMI_CALL(0x0158)
>> +#define SMC_RMI_REALM_DESTROY			SMC_RMI_CALL(0x0159)
>> +#define SMC_RMI_REC_CREATE			SMC_RMI_CALL(0x015a)
>> +#define SMC_RMI_REC_DESTROY			SMC_RMI_CALL(0x015b)
>> +#define SMC_RMI_REC_ENTER			SMC_RMI_CALL(0x015c)
>> +#define SMC_RMI_RTT_CREATE			SMC_RMI_CALL(0x015d)
>> +#define SMC_RMI_RTT_DESTROY			SMC_RMI_CALL(0x015e)
>> +
>> +#define SMC_RMI_RTT_READ_ENTRY			SMC_RMI_CALL(0x0161)
>> +
>> +#define SMC_RMI_RTT_DEV_VALIDATE		SMC_RMI_CALL(0x0163)
>> +#define SMC_RMI_PSCI_COMPLETE			SMC_RMI_CALL(0x0164)
>> +#define SMC_RMI_FEATURES			SMC_RMI_CALL(0x0165)
>> +#define SMC_RMI_RTT_FOLD			SMC_RMI_CALL(0x0166)
>> +
>> +#define SMC_RMI_RTT_INIT_RIPAS			SMC_RMI_CALL(0x0168)
>> +#define SMC_RMI_RTT_SET_RIPAS			SMC_RMI_CALL(0x0169)
>> +#define SMC_RMI_VSMMU_CREATE			SMC_RMI_CALL(0x016a)
>> +#define SMC_RMI_VSMMU_DESTROY			SMC_RMI_CALL(0x016b)
>> +#define SMC_RMI_RMM_CONFIG_SET			SMC_RMI_CALL(0x016e)
>> +#define SMC_RMI_PSMMU_IRQ_NOTIFY		SMC_RMI_CALL(0x016f)
>> +
>> +#define SMC_RMI_PDEV_ABORT			SMC_RMI_CALL(0x0174)
>> +#define SMC_RMI_PDEV_COMMUNICATE		SMC_RMI_CALL(0x0175)
>> +#define SMC_RMI_PDEV_CREATE			SMC_RMI_CALL(0x0176)
>> +#define SMC_RMI_PDEV_DESTROY			SMC_RMI_CALL(0x0177)
>> +#define SMC_RMI_PDEV_GET_STATE			SMC_RMI_CALL(0x0178)
>> +
>> +#define SMC_RMI_PDEV_STREAM_KEY_REFRESH		SMC_RMI_CALL(0x017a)
>> +#define SMC_RMI_PDEV_SET_PUBKEY			SMC_RMI_CALL(0x017b)
>> +#define SMC_RMI_PDEV_STOP			SMC_RMI_CALL(0x017c)
>> +#define SMC_RMI_RTT_AUX_CREATE			SMC_RMI_CALL(0x017d)
>> +#define SMC_RMI_RTT_AUX_DESTROY			SMC_RMI_CALL(0x017e)
>> +#define SMC_RMI_RTT_AUX_FOLD			SMC_RMI_CALL(0x017f)
>> +
>> +#define SMC_RMI_VDEV_ABORT			SMC_RMI_CALL(0x0185)
>> +#define SMC_RMI_VDEV_COMMUNICATE		SMC_RMI_CALL(0x0186)
>> +#define SMC_RMI_VDEV_CREATE			SMC_RMI_CALL(0x0187)
>> +#define SMC_RMI_VDEV_DESTROY			SMC_RMI_CALL(0x0188)
>> +#define SMC_RMI_VDEV_GET_STATE			SMC_RMI_CALL(0x0189)
>> +#define SMC_RMI_VDEV_UNLOCK			SMC_RMI_CALL(0x018a)
>> +#define SMC_RMI_RTT_SET_S2AP			SMC_RMI_CALL(0x018b)
>> +#define SMC_RMI_VDEV_COMPLETE			SMC_RMI_CALL(0x018e)
>> +
>> +#define SMC_RMI_VDEV_GET_INTERFACE_REPORT	SMC_RMI_CALL(0x01d0)
>> +#define SMC_RMI_VDEV_GET_MEASUREMENTS		SMC_RMI_CALL(0x01d1)
>> +#define SMC_RMI_VDEV_LOCK			SMC_RMI_CALL(0x01d2)
>> +#define SMC_RMI_VDEV_START			SMC_RMI_CALL(0x01d3)
>> +
>> +#define SMC_RMI_VSMMU_EVENT_NOTIFY		SMC_RMI_CALL(0x01d6)
>> +#define SMC_RMI_PSMMU_ACTIVATE			SMC_RMI_CALL(0x01d7)
>> +#define SMC_RMI_PSMMU_DEACTIVATE		SMC_RMI_CALL(0x01d8)
>> +
>> +#define SMC_RMI_PSMMU_ST_L2_CREATE		SMC_RMI_CALL(0x01db)
>> +#define SMC_RMI_PSMMU_ST_L2_DESTROY		SMC_RMI_CALL(0x01dc)
>> +#define SMC_RMI_DPT_L0_CREATE			SMC_RMI_CALL(0x01dd)
>> +#define SMC_RMI_DPT_L0_DESTROY			SMC_RMI_CALL(0x01de)
>> +#define SMC_RMI_DPT_L1_CREATE			SMC_RMI_CALL(0x01df)
>> +#define SMC_RMI_DPT_L1_DESTROY			SMC_RMI_CALL(0x01e0)
>> +#define SMC_RMI_GRANULE_TRACKING_GET		SMC_RMI_CALL(0x01e1)
>> +
>> +#define SMC_RMI_GRANULE_TRACKING_SET		SMC_RMI_CALL(0x01e3)
>> +
>> +#define SMC_RMI_RMM_CONFIG_GET			SMC_RMI_CALL(0x01ec)
>> +
>> +#define SMC_RMI_RMM_STATE_GET			SMC_RMI_CALL(0x01ee)
>> +
>> +#define SMC_RMI_PSMMU_EVENT_CONSUME		SMC_RMI_CALL(0x01f0)
>> +#define SMC_RMI_GRANULE_RANGE_DELEGATE		SMC_RMI_CALL(0x01f1)
>> +#define SMC_RMI_GRANULE_RANGE_UNDELEGATE	SMC_RMI_CALL(0x01f2)
>> +#define SMC_RMI_GPT_L1_CREATE			SMC_RMI_CALL(0x01f3)
>> +#define SMC_RMI_GPT_L1_DESTROY			SMC_RMI_CALL(0x01f4)
>> +#define SMC_RMI_RTT_DATA_MAP			SMC_RMI_CALL(0x01f5)
>> +#define SMC_RMI_RTT_DATA_UNMAP			SMC_RMI_CALL(0x01f6)
>> +#define SMC_RMI_RTT_DEV_MAP			SMC_RMI_CALL(0x01f7)
>> +#define SMC_RMI_RTT_DEV_UNMAP			SMC_RMI_CALL(0x01f8)
>> +#define SMC_RMI_RTT_ARCH_DEV_MAP		SMC_RMI_CALL(0x01f9)
>> +#define SMC_RMI_RTT_ARCH_DEV_UNMAP		SMC_RMI_CALL(0x01fa)
>> +#define SMC_RMI_RTT_UNPROT_MAP			SMC_RMI_CALL(0x01fb)
>> +#define SMC_RMI_RTT_UNPROT_UNMAP		SMC_RMI_CALL(0x01fc)
>> +#define SMC_RMI_RTT_AUX_PROT_MAP		SMC_RMI_CALL(0x01fd)
>> +#define SMC_RMI_RTT_AUX_PROT_UNMAP		SMC_RMI_CALL(0x01fe)
>> +#define SMC_RMI_RTT_AUX_UNPROT_MAP		SMC_RMI_CALL(0x01ff)
>> +#define SMC_RMI_RTT_AUX_UNPROT_UNMAP		SMC_RMI_CALL(0x0200)
>> +#define SMC_RMI_REALM_TERMINATE			SMC_RMI_CALL(0x0201)
>> +#define SMC_RMI_RMM_ACTIVATE			SMC_RMI_CALL(0x0202)
>> +#define SMC_RMI_OP_CONTINUE			SMC_RMI_CALL(0x0203)
>> +#define SMC_RMI_PDEV_STREAM_CONNECT		SMC_RMI_CALL(0x0204)
>> +#define SMC_RMI_PDEV_STREAM_DISCONNECT		SMC_RMI_CALL(0x0205)
>> +#define SMC_RMI_PDEV_STREAM_COMPLETE		SMC_RMI_CALL(0x0206)
>> +#define SMC_RMI_PDEV_STREAM_KEY_PURGE		SMC_RMI_CALL(0x0207)
>> +#define SMC_RMI_OP_MEM_DONATE			SMC_RMI_CALL(0x0208)
>> +#define SMC_RMI_OP_MEM_RECLAIM			SMC_RMI_CALL(0x0209)
>> +#define SMC_RMI_OP_CANCEL			SMC_RMI_CALL(0x020a)
>> +#define SMC_RMI_VSMMU_FEATURES			SMC_RMI_CALL(0x020b)
>> +#define SMC_RMI_VSMMU_CMD_GET			SMC_RMI_CALL(0x020c)
>> +#define SMC_RMI_VSMMU_CMD_COMPLETE		SMC_RMI_CALL(0x020d)
>> +#define SMC_RMI_PSMMU_INFO			SMC_RMI_CALL(0x020e)
>> +
>> +#define RMI_ABI_MAJOR_VERSION	2
>> +#define RMI_ABI_MINOR_VERSION	0
>> +
>> +#define RMI_ABI_VERSION_GET_MAJOR(version) ((version) >> 16)
>> +#define RMI_ABI_VERSION_GET_MINOR(version) ((version) & 0xFFFF)
>> +#define RMI_ABI_VERSION(major, minor)      (((major) << 16) | (minor))
>> +
>> +#define RMI_UNASSIGNED			0
>> +#define RMI_ASSIGNED			1
>> +#define RMI_TABLE			2
>> +
>> +#define RMI_RETURN_STATUS(ret)		((ret) & 0xFF)
>> +#define RMI_RETURN_INDEX(ret)		(((ret) >> 8) & 0xFF)
>> +#define RMI_RETURN_MEMREQ(ret)		(((ret) >> 8) & 0x3)
>> +#define RMI_RETURN_CAN_CANCEL(ret)	(((ret) >> 10) & 0x1)
> 
> Use FIELD_GET() and specify masks that define the actual fields.

Sure, that makes sense.

>> +
>> +#define RMI_SUCCESS			0
>> +#define RMI_ERROR_INPUT			1
>> +#define RMI_ERROR_REALM			2
>> +#define RMI_ERROR_REC			3
>> +#define RMI_ERROR_RTT			4
>> +#define RMI_ERROR_NOT_SUPPORTED		5
>> +#define RMI_ERROR_DEVICE		6
>> +#define RMI_ERROR_RTT_AUX		7
>> +#define RMI_ERROR_PSMMU_ST		8
>> +#define RMI_ERROR_DPT			9
>> +#define RMI_BUSY			10
>> +#define RMI_ERROR_GLOBAL		11
>> +#define RMI_ERROR_TRACKING		12
>> +#define RMI_INCOMPLETE			13
>> +#define RMI_BLOCKED			14
>> +#define RMI_ERROR_GPT			15
>> +#define RMI_ERROR_GRANULE		16
>> +
>> +#define RMI_OP_MEM_REQ_NONE		0
>> +#define RMI_OP_MEM_REQ_DONATE		1
>> +#define RMI_OP_MEM_REQ_RECLAIM		2
>> +
>> +#define RMI_DONATE_SIZE(req)		((req) & 0x3)
>> +#define RMI_DONATE_COUNT_MASK		GENMASK(15, 2)
>> +#define RMI_DONATE_COUNT(req)		(((req) & RMI_DONATE_COUNT_MASK) >> 2)
>> +#define RMI_DONATE_CONTIG(req)		(!!((req) & BIT(16)))
>> +#define RMI_DONATE_STATE(req)		(!!((req) & BIT(17)))
> 
> FIELD_GET().
> 
>> +
>> +#define RMI_OP_MEM_DELEGATED		0
>> +#define RMI_OP_MEM_UNDELEGATED		1
>> +
>> +#define RMI_ADDR_TYPE_NONE		0
>> +#define RMI_ADDR_TYPE_SINGLE		1
>> +#define RMI_ADDR_TYPE_LIST		2
>> +
>> +#define RMI_ADDR_RANGE_SIZE_MASK	GENMASK(1, 0)
>> +#define RMI_ADDR_RANGE_COUNT_MASK	GENMASK(PAGE_SHIFT - 1, 2)
>> +#define RMI_ADDR_RANGE_ADDR_MASK	(PAGE_MASK & GENMASK(51, 0))
>> +#define RMI_ADDR_RANGE_STATE_MASK	BIT(63)
>> +
>> +#define RMI_ADDR_RANGE_SIZE(ar)		(FIELD_GET(RMI_ADDR_RANGE_SIZE_MASK, \
>> +						   (ar)))
>> +#define RMI_ADDR_RANGE_COUNT(ar)	(FIELD_GET(RMI_ADDR_RANGE_COUNT_MASK, \
>> +						   (ar)))
>> +#define RMI_ADDR_RANGE_ADDR(ar)		((ar) & RMI_ADDR_RANGE_ADDR_MASK)
>> +#define RMI_ADDR_RANGE_STATE(ar)	(FIELD_GET(RMI_ADDR_RANGE_STATE_MASK, \
>> +						   (ar)))
>> +
>> +enum rmi_ripas {
>> +	RMI_EMPTY = 0,
>> +	RMI_RAM = 1,
>> +	RMI_DESTROYED = 2,
>> +	RMI_DEV = 3,
>> +};
>> +
>> +#define RMI_NO_MEASURE_CONTENT	0
>> +#define RMI_MEASURE_CONTENT	1
>> +
>> +#define RMI_FEATURE_REGISTER_0_S2SZ		GENMASK(7, 0)
>> +#define RMI_FEATURE_REGISTER_0_LPA2		BIT(8)
>> +#define RMI_FEATURE_REGISTER_0_SVE		BIT(9)
>> +#define RMI_FEATURE_REGISTER_0_SVE_VL		GENMASK(13, 10)
>> +#define RMI_FEATURE_REGISTER_0_NUM_BPS		GENMASK(19, 14)
>> +#define RMI_FEATURE_REGISTER_0_NUM_WPS		GENMASK(25, 20)
>> +#define RMI_FEATURE_REGISTER_0_PMU		BIT(26)
>> +#define RMI_FEATURE_REGISTER_0_PMU_NUM_CTRS	GENMASK(31, 27)
>> +
>> +#define RMI_FEATURE_REGISTER_1_RMI_GRAN_SZ_4KB	BIT(0)
>> +#define RMI_FEATURE_REGISTER_1_RMI_GRAN_SZ_16KB	BIT(1)
>> +#define RMI_FEATURE_REGISTER_1_RMI_GRAN_SZ_64KB	BIT(2)
>> +#define RMI_FEATURE_REGISTER_1_HASH_SHA_256	BIT(3)
>> +#define RMI_FEATURE_REGISTER_1_HASH_SHA_384	BIT(4)
>> +#define RMI_FEATURE_REGISTER_1_HASH_SHA_512	BIT(5)
>> +#define RMI_FEATURE_REGISTER_1_MAX_RECS_ORDER	GENMASK(9, 6)
>> +#define RMI_FEATURE_REGISTER_1_L0GPTSZ		GENMASK(13, 10)
>> +#define RMI_FEATURE_REGISTER_1_PPS		GENMASK(16, 14)
>> +
>> +#define RMI_FEATURE_REGISTER_2_DA		BIT(0)
>> +#define RMI_FEATURE_REGISTER_2_DA_COH		BIT(1)
>> +#define RMI_FEATURE_REGISTER_2_VSMMU		BIT(2)
>> +#define RMI_FEATURE_REGISTER_2_ATS		BIT(3)
>> +#define RMI_FEATURE_REGISTER_2_MAX_VDEVS_ORDER	GENMASK(7, 4)
>> +#define RMI_FEATURE_REGISTER_2_VDEV_KROU	BIT(8)
>> +#define RMI_FEATURE_REGISTER_2_NON_TEE_STREAM	BIT(9)
>> +
>> +#define RMI_FEATURE_REGISTER_3_MAX_NUM_AUX_PLANES	GENMASK(3, 0)
>> +#define RMI_FEATURE_REGISTER_3_RTT_PLAN			GENMASK(5, 4)
>> +#define RMI_FEATURE_REGISTER_3_RTT_S2AP_INDIRECT	BIT(6)
>> +
>> +#define RMI_FEATURE_REGISTER_4_MEC_COUNT		GENMASK(63, 0)
>> +
>> +#define RMI_MEM_CATEGORY_CONVENTIONAL		0
>> +#define RMI_MEM_CATEGORY_DEV_NCOH		1
>> +#define RMI_MEM_CATEGORY_DEV_COH		2
>> +
>> +#define RMI_TRACKING_RESERVED			0
>> +#define RMI_TRACKING_NONE			1
>> +#define RMI_TRACKING_FINE			2
>> +#define RMI_TRACKING_COARSE			3
>> +
>> +#define RMI_GRANULE_SIZE_4KB	0
>> +#define RMI_GRANULE_SIZE_16KB	1
>> +#define RMI_GRANULE_SIZE_64KB	2
>> +
>> +/*
>> + * Note many of these fields are smaller than u64 but all fields have u64
>> + * alignment, so use u64 to ensure correct alignment.
>> + */
>> +struct rmm_config {
>> +	union { /* 0x0 */
>> +		struct {
>> +			u64 tracking_region_size;
>> +			u64 rmi_granule_size;
>> +		};
>> +		u8 sizer[0x1000];
> 
> SZ_4K?
> 
>> +	};
>> +};
>> +
>> +#define RMI_REALM_PARAM_FLAG_LPA2		BIT(0)
>> +#define RMI_REALM_PARAM_FLAG_SVE		BIT(1)
>> +#define RMI_REALM_PARAM_FLAG_PMU		BIT(2)
>> +
>> +struct realm_params {
>> +	union { /* 0x0 */
>> +		struct {
>> +			u64 flags;
>> +			u64 s2sz;
>> +			u64 sve_vl;
>> +			u64 num_bps;
>> +			u64 num_wps;
>> +			u64 pmu_num_ctrs;
>> +			u64 hash_algo;
>> +			u64 num_aux_planes;
>> +		};
>> +		u8 padding0[0x400];
> 
> SZ_1K? And similarly all over the shop?

I'm a bit less sure that makes the code more readable - these structures
are a bit of a pain because they are somewhat sparse. I've left a
comment where the beginning of each union is, and personally I find it
easier to see 0x0 + 0x400 == 0x400 rather than trying to work out what
SZ_1K is in hex. This is particularly the case in terms of:

> struct rec_params {
> 	union { /* 0x0 */
> 		u64 flags;
> 		u8 padding0[0x100];
> 	};
> 	union { /* 0x100 */
> 		u64 mpidr;
> 		u8 padding1[0x100];
> 	};
> 	union { /* 0x200 */
> 		u64 pc;
> 		u8 padding2[0x100];
> 	};
> 	union { /* 0x300 */
> 		u64 gprs[REC_CREATE_NR_GPRS];
> 		u8 padding3[0xd00];
> 	};
> };

Where 0xd00 doesn't even have a correspoding SZ_ define.

The RMM deals with this with macro magic:

> struct rmi_rec_params {
>         /* Flags */
>         SET_MEMBER_RMI(unsigned long flags, 0, 0x100);  /* Offset 0 */
>         /* MPIDR of the REC */
>         SET_MEMBER_RMI(unsigned long mpidr, 0x100, 0x200);      /* 0x100 */
>         /* Program counter */
>         SET_MEMBER_RMI(unsigned long pc, 0x200, 0x300); /* 0x200 */
>         /* General-purpose registers */
>         SET_MEMBER_RMI(unsigned long gprs[REC_CREATE_NR_GPRS], 0x300, 0x1000); /* 0x300 */
> };

where the offsets are just directly encoded in the macro - but it's not
an especially robust macro and I'm not convinced it's more readable.

I'm happy to hear other suggestions on how to encode this neatly.

> I haven't checked the details of the encodings (life is too short),
> but I wonder how much of this exists as an MRS and could be
> automatically generated?

Automatically generating this would be good - I'm not sure whether we
have a (public) source available to generate from at the moment. I have
tried to methodically work through the spec when updating this file, but
as Gavin has already pointed out there was at least one mistake (in
currently unused definitions) this time.

Thanks,
Steve


^ permalink raw reply

* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Aneesh Kumar K.V @ 2026-05-21 15:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jiri Pirko, dri-devel, linaro-mm-sig, iommu, linux-media,
	sumit.semwal, benjamin.gaignard, Brian.Starkey, jstultz,
	tjmercier, christian.koenig, m.szyprowski, robin.murphy, leon,
	sean.anderson, ptesarik, catalin.marinas, suzuki.poulose,
	steven.price, thomas.lendacky, john.allen, ashish.kalra,
	suravee.suthikulpanit, linux-coco
In-Reply-To: <20260426130531.GF804026@ziepe.ca>

Jason Gunthorpe <jgg@ziepe.ca> writes:

>> > static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>> > 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
>> > 		unsigned long attrs, bool flush)
>> > {
>> > 	dma_addr_t dma_addr;
>> > 
>> > 	if (is_swiotlb_force_bounce(dev)) {
>> > 		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
>> > 			return DMA_MAPPING_ERROR;
>> > 
>> > 		return swiotlb_map(dev, phys, size, dir, attrs);
>> > 	}
>> > 
>> > 	if (attrs & DMA_ATTR_MMIO) {
>> > 		dma_addr = phys;
>> > 		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
>> > 			goto err_overflow;
>> > 		goto dma_mapped;
>> 
>> I suspect P2P is probably broken on CC because this doesn't make
>> sense..
>
> Actually, I suppose it is fully broken because it will jump to swiotlb
> and then should fail.
>
>> This should flow into the
>> phys_to_dma_unencrypted/phys_to_dma_encrypted block as well AFAICT, it
>> shouldn't just assign phys. Assigning phys to dma on a CC system is
>> always wrong, right?
>> 
>> It is is more like
>> 
>>         /* To be updated, callers should specify MMIO | CC_SHARED instead of
>> 	   * implying it. */
>>         if (attrs & DMA_ATTR_MMIO)
>> 	   attrs |= DMA_ATTR_CC_SHARED;
>
> So no need for this if, we can go directly to marking the MMIO callers
> with DMA_ATTR_CC_SHARED once this is fixed for mmio:
>
>>         if (attrs & DMA_ATTR_CC_SHARED) {
>>  		dma_addr = phys_to_dma_unencrypted(dev, phys);
>>  	} else {
>>  		dma_addr = phys_to_dma_encrypted(dev, phys);
>>  	}
>
> Jasn

I am wondering whether this is better

static inline dma_addr_t dma_direct_map_phys(struct device *dev,
		phys_addr_t phys, size_t size, enum dma_data_direction dir,
		unsigned long attrs, bool flush)
{
	dma_addr_t dma_addr;

	/*
	 * For a device requiring unencrypted DMA, MMIO memory is treated
	 * as shared.
	 */
	if (force_dma_unencrypted(dev) && (attrs & DMA_ATTR_MMIO))
		attrs |= DMA_ATTR_CC_SHARED;

.....

	if (attrs & DMA_ATTR_CC_SHARED)
		dma_addr = phys_to_dma_unencrypted(dev, phys);
	else
		dma_addr = phys_to_dma_encrypted(dev, phys);

	if (attrs & DMA_ATTR_MMIO) {
		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
			goto err_overflow;
		goto dma_mapped;
	}


....
-aneesh

^ permalink raw reply

* Re: [PATCH v4 07/13] dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-21 15:37 UTC (permalink / raw)
  To: iommu, linux-arm-kernel, linux-kernel, linux-coco
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun, linuxppc-dev,
	linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260512090408.794195-8-aneesh.kumar@kernel.org>

"Aneesh Kumar K.V (Arm)" <aneesh.kumar@kernel.org> writes:

> diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
> index e05dc7649366..4e35264ab6f8 100644
> --- a/kernel/dma/direct.h
> +++ b/kernel/dma/direct.h
> @@ -89,36 +89,32 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
>  	dma_addr_t dma_addr;
>  
>  	if (is_swiotlb_force_bounce(dev)) {
> -		if (!(attrs & DMA_ATTR_CC_SHARED)) {
> -			if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> -				return DMA_MAPPING_ERROR;
> +		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
> +			return DMA_MAPPING_ERROR;
>  
> -			return swiotlb_map(dev, phys, size, dir, attrs);
> -		}
> -	} else if (attrs & DMA_ATTR_CC_SHARED) {
> -		return DMA_MAPPING_ERROR;
> +		return swiotlb_map(dev, phys, size, dir, attrs);
>  	}
>  
> -	if (attrs & DMA_ATTR_MMIO) {
> -		dma_addr = phys;
> -		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
> -			goto err_overflow;
> -	} else if (attrs & DMA_ATTR_CC_SHARED) {
> +	if (attrs & DMA_ATTR_CC_SHARED)
>  		dma_addr = phys_to_dma_unencrypted(dev, phys);
> +	else
> +		dma_addr = phys_to_dma_encrypted(dev, phys);
> +
> +	if (attrs & DMA_ATTR_MMIO) {
>  		if (unlikely(!dma_capable(dev, dma_addr, size, false, attrs)))
>  			goto err_overflow;
> -	} else {
> -		dma_addr = phys_to_dma(dev, phys);
> -		if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)) ||
> -		    dma_kmalloc_needs_bounce(dev, size, dir)) {
> -			if (is_swiotlb_active(dev) &&
> -			    !(attrs & DMA_ATTR_REQUIRE_COHERENT))
> -				return swiotlb_map(dev, phys, size, dir, attrs);
> +		goto dma_mapped;
> +	}
>  
> -			goto err_overflow;
> -		}
> +	if (unlikely(!dma_capable(dev, dma_addr, size, true, attrs)) ||
> +	    dma_kmalloc_needs_bounce(dev, size, dir)) {
> +		if (is_swiotlb_active(dev) &&
> +		    !(attrs & DMA_ATTR_REQUIRE_COHERENT))
> +			return swiotlb_map(dev, phys, size, dir, attrs);
> +		goto err_overflow;
>  	}
>  
> +dma_mapped:
>  	if (!dev_is_dma_coherent(dev) &&
>  	    !(attrs & (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_MMIO))) {
>  		arch_sync_dma_for_device(phys, size, dir);
> -- 
> 2.43.0

I guess we need this change on top of the above

modified   kernel/dma/direct.h
@@ -88,6 +88,13 @@ static inline dma_addr_t dma_direct_map_phys(struct device *dev,
 {
 	dma_addr_t dma_addr;
 
+	/*
+	 * For a device requiring unencrypted DMA, MMIO memory is treated
+	 * as shared by default.
+	 */
+	if (force_dma_unencrypted(dev) && (attrs & DMA_ATTR_MMIO))
+		attrs |= DMA_ATTR_CC_SHARED;
+
 	if (is_swiotlb_force_bounce(dev)) {
 		if (attrs & (DMA_ATTR_MMIO | DMA_ATTR_REQUIRE_COHERENT))
 			return DMA_MAPPING_ERROR;

^ permalink raw reply

* Re: [PATCH v14 08/44] arm64: RMI: Ensure that the RMM has GPT entries for memory
From: Suzuki K Poulose @ 2026-05-21 15:39 UTC (permalink / raw)
  To: Marc Zyngier, Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <868q9cx4ac.wl-maz@kernel.org>

On 21/05/2026 14:47, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:16 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> The RMM maintains the state of all the granules in the system to make
>> sure that the host is abiding by the rules. This state can be maintained
>> at different granularity, per page (TRACKING_FINE) or per region
>> (TRACKING_COARSE). The region size depends on the underlying
>> "RMI_GRANULE_SIZE". For a "coarse" region all pages in the region must
>> be of the same state, this implies we need to have "fine" tracking for
>> DRAM, so that we can delegated individual pages.
>>
>> For now we only support a statically carved out memory for tracking
>> granules for the "fine" regions. This can be extended in the future to
>> allow modifying the tracking granularity and remove the need for a
>> static allocation.
>>
>> Similarly, the firmware may create L0 GPT entries describing the total
>> address space. But if we change the "PAS" (Physical Address Space) of a
>> granule then the firmware may need to create L1 tables to track the PAS
>> at a finer granularity.
>>
>> Note: support is currently missing for SROs which means that if the RMM
>> needs memory donating this will fail (and render CCA unusable in Linux).
>> This effectively means that the L1 GPT tables must be created before
>> Linux starts.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Moved out of KVM
>> ---
>>   arch/arm64/include/asm/rmi_cmds.h |   2 +
>>   arch/arm64/kernel/rmi.c           | 103 ++++++++++++++++++++++++++++++
>>   2 files changed, 105 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
>> index 9179934925c5..9078a2920a7c 100644
>> --- a/arch/arm64/include/asm/rmi_cmds.h
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -33,6 +33,8 @@ struct rmi_sro_state {
>>   } while (RMI_RETURN_STATUS(res.a0) == RMI_BUSY ||			\
>>   	 RMI_RETURN_STATUS(res.a0) == RMI_BLOCKED)
>>   
>> +bool rmi_is_available(void);
>> +
>>   unsigned long rmi_sro_execute(struct rmi_sro_state *sro, gfp_t gfp);
>>   void rmi_sro_free(struct rmi_sro_state *sro);
>>   
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> index a14ead5dedda..52a415e99500 100644
>> --- a/arch/arm64/kernel/rmi.c
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -7,6 +7,8 @@
>>   
>>   #include <asm/rmi_cmds.h>
>>   
>> +static bool arm64_rmi_is_available;
>> +
>>   unsigned long rmm_feat_reg0;
>>   unsigned long rmm_feat_reg1;
>>   
>> @@ -88,6 +90,102 @@ static int rmi_configure(void)
>>   	return 0;
>>   }
>>   
>> +/*
>> + * For now we set the tracking_region_size to 0 for RMI_RMM_CONFIG_SET().
>> + * TODO: Support other tracking sizes (via Kconfig option).
>> + */
>> +#ifdef CONFIG_PAGE_SIZE_4KB
>> +#define RMM_GRANULE_TRACKING_SIZE	SZ_1G
>> +#elif defined(CONFIG_PAGE_SIZE_16KB)
>> +#define RMM_GRANULE_TRACKING_SIZE	SZ_32M
>> +#elif defined(CONFIG_PAGE_SIZE_64KB)
>> +#define RMM_GRANULE_TRACKING_SIZE	SZ_512M
>> +#endif
> 
> Basically, a level 2 mapping. Which means this whole block really is:
> 
> #define RMM_GRANULE_TRAKING_SIZE	(2 * PAGE_SHIFT - 3)
> 
> (adjust for D128 as needed).

True,

> 
>> +
>> +/*
>> + * Make sure the area is tracked by RMM at FINE granularity.
>> + * We do not support changing the tracking yet.
>> + */
>> +static int rmi_verify_memory_tracking(phys_addr_t start, phys_addr_t end)
>> +{
>> +	while (start < end) {
>> +		unsigned long ret, category, state, next;
>> +
>> +		ret = rmi_granule_tracking_get(start, end, &category, &state, &next);
>> +		if (ret != RMI_SUCCESS ||
>> +		    state != RMI_TRACKING_FINE ||
>> +		    category != RMI_MEM_CATEGORY_CONVENTIONAL) {
>> +			/* TODO: Set granule tracking in this case */
>> +			pr_err("Granule tracking for region isn't fine/conventional: %llx",
>> +			       start);
>> +			return -ENODEV;
> 
> How is this triggered? Do we really need to spam the console with
> this? A PA doesn't mean much, and there is no context (stack trace).

This could be triggered if the RMM doesn't have static carveout
for tracking the DRAM granules. (state != RMI_TRACKING_FINE).
This not worth WARN_ONCE(), we could simply not enable KVM.
We plan to add support for donating memory to the RMM in
the future. (Primarily we don't yet have an RMM implementation
that does dynamic management via SRO. This can be added later
as a separate series)

> 
> If that's not expected, turn this into a WARN_ONCE().




> 
>> +		}
>> +		start = next;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static unsigned long rmi_l0gpt_size(void)
>> +{
>> +	return 1UL << (30 + FIELD_GET(RMI_FEATURE_REGISTER_1_L0GPTSZ,
>> +				      rmm_feat_reg1));
>> +}
>> +
>> +static int rmi_create_gpts(phys_addr_t start, phys_addr_t end)
>> +{
>> +	unsigned long l0gpt_sz = rmi_l0gpt_size();
>> +
>> +	start = ALIGN_DOWN(start, l0gpt_sz);
>> +	end = ALIGN(end, l0gpt_sz);
>> +
>> +	while (start < end) {
>> +		int ret = rmi_gpt_l1_create(start);
>> +
>> +		/*
>> +		 * Make sure the L1 GPT tables are created for the region.
>> +		 * RMI_ERROR_GPT indicates the L1 table already exists.
>> +		 */
>> +		if (ret && ret != RMI_ERROR_GPT) {
>> +			/*
>> +			 * FIXME: Handle SRO so that memory can be donated for
>> +			 * the tables.
>> +			 */
>> +			pr_err("GPT Level1 table missing for %llx\n", start);
>> +			return -ENOMEM;
> 
> If any of this fails, where is the cleanup done? Is that part of the
> missing SRO support that's indicated in the commit message?
> 

For now, there is no cleanup required. What we essentially do here is
making sure that the GPT tables have been created upto L1 (i.e.,
by checking ret == RMI_ERROR_GPT).

We do not donate any memory now, but only support RMMs with static 
memory carved out for L1 GPT. Support for dynamic RMMs could be added as
a separate series, at which point, we could defer the table creation to
the actual use case (e.g, RMI_GRANULE_DELEGATE).

Clean up would be required when we donate memory to the RMM.

>> +		}
>> +		start += l0gpt_sz;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int rmi_init_metadata(void)
>> +{
>> +	phys_addr_t start, end;
>> +	const struct memblock_region *r;
>> +
>> +	for_each_mem_region(r) {
>> +		int ret;
>> +
>> +		start = memblock_region_memory_base_pfn(r) << PAGE_SHIFT;
>> +		end = memblock_region_memory_end_pfn(r) << PAGE_SHIFT;
>> +		ret = rmi_verify_memory_tracking(start, end);
>> +		if (ret)
>> +			return ret;
>> +		ret = rmi_create_gpts(start, end);
>> +		if (ret)
>> +			return ret;
>> +	}
> 
> How does this work with, say, memory hotplug?

Good point, we need a hook for hotpug to make sure this is taken care
of. As mentioned above, when we add support for RMM with support for
dynamic Tracking/GPT with SRO, this could be deferred to the actual
use (handling RMI return codes, RMI_ERROR_TRACKING/RMI_ERROR_GPT)

Suzuki


> 
>> +
>> +	return 0;
>> +}
>> +
>> +bool rmi_is_available(void)
>> +{
>> +	return arm64_rmi_is_available;
>> +}
>> +
>>   static int __init arm64_init_rmi(void)
>>   {
>>   	/* Continue without realm support if we can't agree on a version */
>> @@ -101,6 +199,11 @@ static int __init arm64_init_rmi(void)
>>   
>>   	if (rmi_configure())
>>   		return 0;
>> +	if (rmi_init_metadata())
>> +		return 0;
>> +
>> +	arm64_rmi_is_available = true;
>> +	pr_info("RMI configured");
>>   
>>   	return 0;
>>   }
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 05/44] arm64: RMI: Add wrappers for RMI calls
From: Steven Price @ 2026-05-21 15:44 UTC (permalink / raw)
  To: Aneesh Kumar K.V, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <yq5aecj8t10l.fsf@kernel.org>

On 19/05/2026 06:35, Aneesh Kumar K.V wrote:
> Steven Price <steven.price@arm.com> writes:
> 
>> The wrappers make the call sites easier to read and deal with the
>> boiler plate of handling the error codes from the RMM.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> +#define rmi_smccc(...) do {						\
>> +	arm_smccc_1_1_invoke(__VA_ARGS__);				\
>> +} while (RMI_RETURN_STATUS(res.a0) == RMI_BUSY ||			\
>> +	 RMI_RETURN_STATUS(res.a0) == RMI_BLOCKED)
>> +
> 
> I guess this is not used. Also, that would require the call site to have a struct arm_smccc_res res.

Ah good spot - yes this was replaced with a proper static inline
rmi_smccc_invoke() function. I missed removing this macro.

Thanks,
Steve

^ permalink raw reply

* Re: [PATCH v14 05/44] arm64: RMI: Add wrappers for RMI calls
From: Steven Price @ 2026-05-21 15:44 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <da87fa25-d979-4d22-99f1-3ba1d81cff23@redhat.com>

On 21/05/2026 01:21, Gavin Shan wrote:
> Hi Steven,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The wrappers make the call sites easier to read and deal with the
>> boiler plate of handling the error codes from the RMM.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes from v13:
>>   * Update to RMM v2.0-bet1 spec including some SRO support (there still
>>     some FIXMEs where SRO support is incomplete).
>> Changes from v12:
>>   * Update to RMM v2.0 specification
>> Changes from v8:
>>   * Switch from arm_smccc_1_2_smc() to arm_smccc_1_2_invoke() in
>>     rmi_rtt_read_entry() for consistency.
>> Changes from v7:
>>   * Minor renaming of parameters and updated comments
>> Changes from v5:
>>   * Further improve comments
>> Changes from v4:
>>   * Improve comments
>> Changes from v2:
>>   * Make output arguments optional.
>>   * Mask RIPAS value rmi_rtt_read_entry()
>>   * Drop unused rmi_rtt_get_phys()
>> ---
>>   arch/arm64/include/asm/rmi_cmds.h | 661 ++++++++++++++++++++++++++++++
>>   1 file changed, 661 insertions(+)
>>   create mode 100644 arch/arm64/include/asm/rmi_cmds.h
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/
>> asm/rmi_cmds.h
>> new file mode 100644
>> index 000000000000..04f7066894e9
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -0,0 +1,661 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#ifndef __ASM_RMI_CMDS_H
>> +#define __ASM_RMI_CMDS_H
>> +
>> +#include <linux/arm-smccc.h>
>> +
> 
> [...]
> 
>> +
>> +/**
>> + * rmi_rtt_destroy() - Destroy an RTT
>> + * @rd: PA of the RD
>> + * @ipa: Base of the IPA range described by the RTT
>> + * @level: Depth of the RTT within the tree
>> + * @out_rtt: Pointer to write the PA of the RTT which was destroyed
>> + * @out_top: Pointer to write the top IPA of non-live RTT entries
>> + *
> 
> In most cases, the parameters are well explained in RMM-v2.0-bet1 spec,
> I think
> it's nice to keep the code and the spec synchronized. For those specific
> parameters
> of this function, they're well explained in RMM-v2.0-bet1 spec as below.
> 
>    @rd: PA of the RD for the target realm
>    @ipa: Base of the IPA range described by the RTT
>    @level: RTT level
>    @out_rtt: PA of the RTT which was destroyed
>    @out_top: Top IPA of non-live RTT entries, from entry at which the
> RTT walk terminated

I have attempted to keep the descriptions consistent with the spec - I'm
not quite sure what you think the issue is here. The @rd parameter gains
a "for the target realm" - which isn't really very informative (clearly
rmi_rtt_destroy() is targetting the realm which is being passed into the
function). @level is less informative. @out_xxx are prefixed with
"Pointer to write the" because the C function does indeed take a pointer
for the output parameter to be written.

But fair enough I can align them more precisely. In some cases I've
written the code before the final spec wording has been available which
might explain some differences.

Thanks,
Steve

>> + * Destroys an RTT. The RTT must be non-live, i.e. none of the
>> entries in the
>> + * table are in ASSIGNED or TABLE state.
>> + *
>> + * Return: RMI return code.
>> + */
>> +static inline int rmi_rtt_destroy(unsigned long rd,
>> +                  unsigned long ipa,
>> +                  long level,
>> +                  unsigned long *out_rtt,
>> +                  unsigned long *out_top)
>> +{
>> +    struct arm_smccc_res res;
>> +
>> +    arm_smccc_1_1_invoke(SMC_RMI_RTT_DESTROY, rd, ipa, level, &res);
>> +
>> +    if (out_rtt)
>> +        *out_rtt = res.a1;
>> +    if (out_top)
>> +        *out_top = res.a2;
>> +
>> +    return res.a0;
>> +}
>> +
> 
> [...]
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 05/44] arm64: RMI: Add wrappers for RMI calls
From: Steven Price @ 2026-05-21 15:44 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86cxypvsfy.wl-maz@kernel.org>

On 21/05/2026 13:49, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:13 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> The wrappers make the call sites easier to read and deal with the
>> boiler plate of handling the error codes from the RMM.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes from v13:
>>  * Update to RMM v2.0-bet1 spec including some SRO support (there still
>>    some FIXMEs where SRO support is incomplete).
>> Changes from v12:
>>  * Update to RMM v2.0 specification
>> Changes from v8:
>>  * Switch from arm_smccc_1_2_smc() to arm_smccc_1_2_invoke() in
>>    rmi_rtt_read_entry() for consistency.
>> Changes from v7:
>>  * Minor renaming of parameters and updated comments
>> Changes from v5:
>>  * Further improve comments
>> Changes from v4:
>>  * Improve comments
>> Changes from v2:
>>  * Make output arguments optional.
>>  * Mask RIPAS value rmi_rtt_read_entry()
>>  * Drop unused rmi_rtt_get_phys()
>> ---
>>  arch/arm64/include/asm/rmi_cmds.h | 661 ++++++++++++++++++++++++++++++
>>  1 file changed, 661 insertions(+)
>>  create mode 100644 arch/arm64/include/asm/rmi_cmds.h
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
>> new file mode 100644
>> index 000000000000..04f7066894e9
>> --- /dev/null
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -0,0 +1,661 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#ifndef __ASM_RMI_CMDS_H
>> +#define __ASM_RMI_CMDS_H
>> +
>> +#include <linux/arm-smccc.h>
>> +
>> +#include <asm/rmi_smc.h>
>> +
>> +struct rtt_entry {
>> +	unsigned long walk_level;
>> +	unsigned long desc;
>> +	int state;
>> +	int ripas;
>> +};
>> +
>> +#define RMI_MAX_ADDR_LIST	256
>> +
>> +struct rmi_sro_state {
>> +	struct arm_smccc_1_2_regs regs;
>> +	unsigned long addr_count;
>> +	unsigned long addr_list[RMI_MAX_ADDR_LIST];
>> +};
>> +
>> +#define rmi_smccc(...) do {						\
>> +	arm_smccc_1_1_invoke(__VA_ARGS__);				\
>> +} while (RMI_RETURN_STATUS(res.a0) == RMI_BUSY ||			\
>> +	 RMI_RETURN_STATUS(res.a0) == RMI_BLOCKED)
>> +
>> +unsigned long rmi_sro_execute(struct rmi_sro_state *sro, gfp_t gfp);
>> +void rmi_sro_free(struct rmi_sro_state *sro);
>> +
>> +/**
>> + * rmi_rmm_config_set() - Configure the RMM
>> + * @cfg_ptr: PA of a struct rmm_config
>> + *
>> + * Sets configuration options on the RMM.
>> + *
>> + * Return: RMI return code
>> + */
>> +static inline int rmi_rmm_config_set(unsigned long cfg_ptr)
>> +{
>> +	struct arm_smccc_res res;
>> +
>> +	arm_smccc_1_1_invoke(SMC_RMI_RMM_CONFIG_SET, cfg_ptr, &res);
>> +
>> +	return res.a0;
>> +}
>> +
>> +/**
>> + * rmi_rmm_activate() - Activate the RMM
>> + *
>> + * Return: RMI return code
>> + */
>> +static inline int rmi_rmm_activate(void)
>> +{
>> +	struct arm_smccc_res res;
>> +
>> +	arm_smccc_1_1_invoke(SMC_RMI_RMM_ACTIVATE, &res);
>> +
>> +	return res.a0;
>> +}
>> +
>> +/**
>> + * rmi_granule_tracking_get() - Get configuration of a Granule tracking region
>> + * @start: Base PA of the tracking region
>> + * @end: End of the PA region
>> + * @out_category: Memory category
>> + * @out_state: Tracking region state
>> + * @out_top: Top of the memory region
>> + *
>> + * Return: RMI return code
>> + */
>> +static inline int rmi_granule_tracking_get(unsigned long start,
>> +					   unsigned long end,
>> +					   unsigned long *out_category,
>> +					   unsigned long *out_state,
>> +					   unsigned long *out_top)
>> +{
>> +	struct arm_smccc_res res;
>> +
>> +	arm_smccc_1_1_invoke(SMC_RMI_GRANULE_TRACKING_GET, start, end, &res);
>> +
>> +	if (out_category)
>> +		*out_category = res.a1;
>> +	if (out_state)
>> +		*out_state = res.a2;
>> +	if (out_top)
>> +		*out_top = res.a3;
>> +
>> +	return res.a0;
>> +}
>> +
>> +/**
>> + * rmi_gpt_l1_create() - Create a Level 1 GPT
>> + * @addr: Base of physical address region described by the L1GPT
>> + *
>> + * Return: RMI return code
>> + */
>> +static inline int rmi_gpt_l1_create(unsigned long addr)
>> +{
>> +	struct arm_smccc_res res;
>> +
>> +	arm_smccc_1_1_invoke(SMC_RMI_GPT_L1_CREATE, addr, &res);
>> +
>> +	if (RMI_RETURN_STATUS(res.a0) == RMI_INCOMPLETE) {
>> +		/* FIXME */
> 
> Is that part of the SRO stuff you're talking about in the notes?
> What is the ETA for fixing all these FIXMEs?

Yes, RMI_INCOMPLETE is the return for SRO. Fixing all this up is on the
plan for my next posting which I expect to be after 7.2-rc1 (so July).
There were some changes in the beta 2 spec and the RMM doesn't implement
most of this yet so I didn't want to rush out completely untested code
which might change.

Thanks,
Steve

> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v14 06/44] arm64: RMI: Check for RMI support at init
From: Steven Price @ 2026-05-21 15:49 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <ee494ecd-8979-40f2-896e-82137abbf440@redhat.com>

On 21/05/2026 01:39, Gavin Shan wrote:
> Hi Steven,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> Query the RMI version number and check if it is a compatible version.
>> The first two feature registers are read and exposed for future code to
>> use.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> v14:
>>   * This moves the basic RMI setup into the 'kernel' directory. This is
>>     because RMI will be used for some features outside of KVM so should
>>     be available even if KVM isn't compiled in.
>> ---
>>   arch/arm64/include/asm/rmi_cmds.h |  3 ++
>>   arch/arm64/kernel/Makefile        |  2 +-
>>   arch/arm64/kernel/cpufeature.c    |  1 +
>>   arch/arm64/kernel/rmi.c           | 65 +++++++++++++++++++++++++++++++
>>   4 files changed, 70 insertions(+), 1 deletion(-)
>>   create mode 100644 arch/arm64/kernel/rmi.c
>>
> 
> [...]
> 
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> new file mode 100644
>> index 000000000000..99c1ccc35c11
>> --- /dev/null
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -0,0 +1,65 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2023-2025 ARM Ltd.
>> + */
>> +
>> +#include <linux/memblock.h>
>> +
>> +#include <asm/rmi_cmds.h>
>> +
>> +unsigned long rmm_feat_reg0;
>> +unsigned long rmm_feat_reg1;
>> +
>> +static int rmi_check_version(void)
>> +{
>> +    struct arm_smccc_res res;
>> +    unsigned short version_major, version_minor;
>> +    unsigned long host_version = RMI_ABI_VERSION(RMI_ABI_MAJOR_VERSION,
>> +                             RMI_ABI_MINOR_VERSION);
>> +    unsigned long aa64pfr0 =
>> read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
>> +
>> +    /* If RME isn't supported, then RMI can't be */
>> +    if (cpuid_feature_extract_unsigned_field(aa64pfr0,
>> ID_AA64PFR0_EL1_RME_SHIFT) == 0)
>> +        return -ENXIO;
>> +
>> +    arm_smccc_1_1_invoke(SMC_RMI_VERSION, host_version, &res);
>> +
>> +    if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
>> +        return -ENXIO;
>> +
>> +    version_major = RMI_ABI_VERSION_GET_MAJOR(res.a1);
>> +    version_minor = RMI_ABI_VERSION_GET_MINOR(res.a1);
>> +
>> +    if (res.a0 != RMI_SUCCESS) {
>> +        unsigned short high_version_major, high_version_minor;
>> +
>> +        high_version_major = RMI_ABI_VERSION_GET_MAJOR(res.a2);
>> +        high_version_minor = RMI_ABI_VERSION_GET_MINOR(res.a2);
>> +
>> +        pr_err("Unsupported RMI ABI (v%d.%d - v%d.%d) we want v%d.%d\n",
>> +               version_major, version_minor,
>> +               high_version_major, high_version_minor,
>> +               RMI_ABI_MAJOR_VERSION,
>> +               RMI_ABI_MINOR_VERSION);
>> +        return -ENXIO;
>> +    }
>> +
>> +    pr_info("RMI ABI version %d.%d\n", version_major, version_minor);
>> +
>> +    return 0;
>> +}
>> +
>> +static int __init arm64_init_rmi(void)
>> +{
>> +    /* Continue without realm support if we can't agree on a version */
>> +    if (rmi_check_version())
>> +        return 0;
> 
> Is this still a valid point that we have to return zero on errors returned
> from rmi_check_version() or other other function calls like rmi_features()?
> arm64_init_rmi() is triggered by subsys_initcall() where the return value
> needs to indicate success or failure. It's fine to return error code from
> arm64_init_rmi() in the path.

Hmm, I guess now this is moved to arm64 code this indeed doesn't need
to. Within a module I believe an error return can fail the module loading.

I'm not sure it really makes much difference though - if this
initialisation fails then it's not really an error - it just means the
feature is unavailable.

Thanks,
Steve

>> +
>> +    if (WARN_ON(rmi_features(0, &rmm_feat_reg0)))
>> +        return 0;
>> +    if (WARN_ON(rmi_features(1, &rmm_feat_reg1)))
>> +        return 0;
>> +
>> +    return 0;
>> +}
>> +subsys_initcall(arm64_init_rmi);
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 09/44] arm64: RMI: Provide functions to delegate/undelegate ranges of memory
From: Suzuki K Poulose @ 2026-05-21 16:01 UTC (permalink / raw)
  To: Marc Zyngier, Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <867bowx3qx.wl-maz@kernel.org>

On 21/05/2026 14:59, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:17 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> The RMM requires memory is 'delegated' to it so that it can be used
>> either for a realm guest or for various tracking purposes within the RMM
>> (e.g. for metadata or page tables). Memory that has been delegated
>> cannot be accessed by the host (it will result in a Granule Protection
>> Fault).
>>
>> Undelegation may fail if the memory is still in use by the RMM. This
>> shouldn't happen (Linux should ensure it has destroyed the RMM objects
>> before attempting to undelegate). In the event that it does happen this
>> points to a programming bug and the only reasonable approach is for the
>> physical pages to be leaked - it is up to the caller of
>> rmi_undelegate_range() to handle this.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> v14:
>>   * Split into separate patch and moved out of KVM
>> ---
>>   arch/arm64/include/asm/rmi_cmds.h | 13 +++++++++++
>>   arch/arm64/kernel/rmi.c           | 36 +++++++++++++++++++++++++++++++
>>   2 files changed, 49 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
>> index 9078a2920a7c..eb213c8e6f26 100644
>> --- a/arch/arm64/include/asm/rmi_cmds.h
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -33,6 +33,19 @@ struct rmi_sro_state {
>>   } while (RMI_RETURN_STATUS(res.a0) == RMI_BUSY ||			\
>>   	 RMI_RETURN_STATUS(res.a0) == RMI_BLOCKED)
>>   
>> +int rmi_delegate_range(phys_addr_t phys, unsigned long size);
>> +int rmi_undelegate_range(phys_addr_t phys, unsigned long size);
>> +
>> +static inline int rmi_delegate_page(phys_addr_t phys)
>> +{
>> +	return rmi_delegate_range(phys, PAGE_SIZE);
>> +}
>> +
>> +static inline int rmi_undelegate_page(phys_addr_t phys)
>> +{
>> +	return rmi_undelegate_range(phys, PAGE_SIZE);
>> +}
>> +
>>   bool rmi_is_available(void);
>>   
>>   unsigned long rmi_sro_execute(struct rmi_sro_state *sro, gfp_t gfp);
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> index 52a415e99500..08cef54acadb 100644
>> --- a/arch/arm64/kernel/rmi.c
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -12,6 +12,42 @@ static bool arm64_rmi_is_available;
>>   unsigned long rmm_feat_reg0;
>>   unsigned long rmm_feat_reg1;
>>   
>> +int rmi_delegate_range(phys_addr_t phys, unsigned long size)
>> +{
>> +	unsigned long ret = 0;
>> +	unsigned long top = phys + size;
>> +	unsigned long out_top;
>> +
>> +	while (phys < top) {
>> +		ret = rmi_granule_range_delegate(phys, top, &out_top);
>> +		if (ret == RMI_SUCCESS)
>> +			phys = out_top;
>> +		else if (ret != RMI_BUSY && ret != RMI_BLOCKED)
>> +			return ret;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +int rmi_undelegate_range(phys_addr_t phys, unsigned long size)
>> +{
>> +	unsigned long ret = 0;
>> +	unsigned long top = phys + size;
>> +	unsigned long out_top;
>> +
>> +	WARN_ON(size == 0);
> 
> I find it odd to warn on size = 0. After all, free(NULL) is not an
> error. But even then, you continue feeding this to the RMM.
> 
> You also don't seem to be bothered with that on the delegation side...
> 
>> +
>> +	while (phys < top) {
>> +		ret = rmi_granule_range_undelegate(phys, top, &out_top);
>> +		if (ret == RMI_SUCCESS)
>> +			phys = out_top;
> 
> and size==0 doesn't violate any of the failure conditions listed in
> B4.5.18.2 (beta2). Will you end-up looping around forever?

That is not true ? It triggers, top_bound error condition, for both.


pre: UInt(top) <= UInt(base)
post: result.status == RMI_ERROR_INPUT


Suzuki
> 
> Same questions for the delegation, obviously.
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Mostafa Saleh @ 2026-05-21 17:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V (Arm)
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <20260512090408.794195-5-aneesh.kumar@kernel.org>

On Tue, May 12, 2026 at 10:05 AM Aneesh Kumar K.V (Arm)
<aneesh.kumar@kernel.org> wrote:
> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>         if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>                 pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>
> +       /*
> +        * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
> +        * but the device is forcing decryption, use decrypted io_tlb_mem
> +        */
> +       if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))

I don't think swiotlb needs to know about force_dma_unencrypted(), the
dma/direct caller should have all the information to pass the
appropriate flags.

Thanks.
Mostafa

> +               require_decrypted = true;
> +
> +       if (require_decrypted != mem->unencrypted)
> +               return (phys_addr_t)DMA_MAPPING_ERROR;
> +

^ permalink raw reply

* Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-05-21 17:20 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-arm-kernel, linux-kernel, linux-coco, Robin Murphy,
	Marek Szyprowski, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Jason Gunthorpe,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev, linux-s390, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86
In-Reply-To: <CAFgf54o4ZnvnJ3369bHb10tvJJVP+5YWq=ec4Jh5K6S6U9uNEA@mail.gmail.com>

Mostafa Saleh <smostafa@google.com> writes:

> On Tue, May 12, 2026 at 10:05 AM Aneesh Kumar K.V (Arm)
> <aneesh.kumar@kernel.org> wrote:
>> @@ -1411,6 +1436,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
>>         if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>>                 pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
>>
>> +       /*
>> +        * if we are trying to swiotlb map a decrypted paddr or the paddr is encrypted
>> +        * but the device is forcing decryption, use decrypted io_tlb_mem
>> +        */
>> +       if ((attrs & DMA_ATTR_CC_SHARED) || force_dma_unencrypted(dev))
>
> I don't think swiotlb needs to know about force_dma_unencrypted(), the
> dma/direct caller should have all the information to pass the
> appropriate flags.
>
> Thanks.
> Mostafa
>
>> +               require_decrypted = true;
>> +
>> +       if (require_decrypted != mem->unencrypted)
>> +               return (phys_addr_t)DMA_MAPPING_ERROR;
>> +

Based on other email threads, this is now updated to

@@ -1372,9 +1417,19 @@ static unsigned long mem_used(struct io_tlb_mem *mem)
  *			any pre- or post-padding for alignment
  * @alloc_align_mask:	Required start and end alignment of the allocated buffer
  * @dir:		DMA direction
- * @attrs:		Optional DMA attributes for the map operation
+ * @attrs:		Optional DMA attributes for the map operation, updated
+ *			to match the selected SWIOTLB pool
  *
  * Find and allocate a suitable sequence of IO TLB slots for the request.
+ * The device's SWIOTLB pool must match the device's current DMA encryption
+ * requirements. If the device requires decrypted DMA, bouncing is done through
+ * an unencrypted pool and the mapping is marked shared. If the device can DMA
+ * to encrypted memory, bouncing is done through an encrypted pool even when the
+ * original DMA address was unencrypted. Enabling encrypted DMA for a device is
+ * therefore expected to update its default io_tlb_mem to an encrypted pool, so
+ * later bounce mappings for both encrypted and decrypted original memory use
+ * that encrypted pool.
+ *
  * The allocated space starts at an alignment specified by alloc_align_mask,
  * and the size of the allocated space is rounded up so that the total amount
  * of allocated space is a multiple of (alloc_align_mask + 1). If
@@ -1411,6 +1466,16 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
 	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
 		pr_warn_once("Memory encryption is active and system is using DMA bounce buffers\n");
 
+	/* swiotlb pool is incorrect for this device */
+	if (unlikely(mem->unencrypted != force_dma_unencrypted(dev)))
+		return (phys_addr_t)DMA_MAPPING_ERROR;
+
+	/* Force attrs to match the kind of memory in the pool */
+	if (mem->unencrypted)
+		*attrs |= DMA_ATTR_CC_SHARED;
+	else
+		*attrs &= ~DMA_ATTR_CC_SHARED;
+
 	/*
 	 * The default swiotlb memory pool is allocated with PAGE_SIZE
 	 * alignment. If a mapping is requested with larger alignment,
@@ -1608,8 +1673,11 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
 	if (swiotlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
 		return DMA_MAPPING_ERROR;
 
-	/* Ensure that the address returned is DMA'ble */
-	dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
+	if (attrs & DMA_ATTR_CC_SHARED)
+		dma_addr = phys_to_dma_unencrypted(dev, swiotlb_addr);
+	else
+		dma_addr = phys_to_dma_encrypted(dev, swiotlb_addr);
+
 	if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
 		__swiotlb_tbl_unmap_single(dev, swiotlb_addr, size, dir,
 			attrs | DMA_ATTR_SKIP_CPU_SYNC,
@@ -1773,7 +1841,7 @@ static inline void swiotlb_create_debugfs_files(struct io_tlb_mem *mem,

^ permalink raw reply

* Re: [PATCH v5 1/2] dma-mapping: introduce DMA_ATTR_CC_SHARED for shared memory
From: Jason Gunthorpe @ 2026-05-21 17:54 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Jiri Pirko, dri-devel, linaro-mm-sig, iommu, linux-media,
	sumit.semwal, benjamin.gaignard, Brian.Starkey, jstultz,
	tjmercier, christian.koenig, m.szyprowski, robin.murphy, leon,
	sean.anderson, ptesarik, catalin.marinas, suzuki.poulose,
	steven.price, thomas.lendacky, john.allen, ashish.kalra,
	suravee.suthikulpanit, linux-coco
In-Reply-To: <yq5azf1s6aic.fsf@kernel.org>

On Thu, May 21, 2026 at 09:05:39PM +0530, Aneesh Kumar K.V wrote:
> I am wondering whether this is better
> 
> static inline dma_addr_t dma_direct_map_phys(struct device *dev,
> 		phys_addr_t phys, size_t size, enum dma_data_direction dir,
> 		unsigned long attrs, bool flush)
> {
> 	dma_addr_t dma_addr;
> 
> 	/*
> 	 * For a device requiring unencrypted DMA, MMIO memory is treated
> 	 * as shared.
> 	 */
> 	if (force_dma_unencrypted(dev) && (attrs & DMA_ATTR_MMIO))
> 		attrs |= DMA_ATTR_CC_SHARED;

It is an option, I would be happier if we went and fixed the few
callers to properly pass the shared. CC did this with the
pgprot_decrypted() stuff, same reasoning:

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index bfdb9ed7074116..e77f6404caa3db 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -90,7 +90,7 @@ static bool blk_dma_map_direct(struct request *req, struct device *dma_dev,
 	unsigned int attrs = 0;
 
 	if (iter->p2pdma.map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-		attrs |= DMA_ATTR_MMIO;
+		attrs |= iter->p2pdma.mem->dma_mapping_flags;
 
 	iter->addr = dma_map_phys(dma_dev, vec->paddr, vec->len,
 			rq_dma_dir(req), attrs);
@@ -115,7 +115,7 @@ static bool blk_rq_dma_map_iova(struct request *req, struct device *dma_dev,
 	iter->len = dma_iova_size(state);
 
 	if (iter->p2pdma.map == PCI_P2PDMA_MAP_THRU_HOST_BRIDGE)
-		attrs |= DMA_ATTR_MMIO;
+		attrs |= iter->p2pdma.mem->dma_mapping_flags;
 
 	do {
 		error = dma_iova_link(dma_dev, state, vec->paddr, mapped,
diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index 794acff2546a34..96022fadc48245 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -147,7 +147,7 @@ struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
 			ret = dma_iova_link(attach->dev, dma->state,
 					    phys_vec[i].paddr, 0,
 					    phys_vec[i].len, dir,
-					    DMA_ATTR_MMIO);
+					    provider->dma_mapping_flags);
 			if (ret)
 				goto err_unmap_dma;
 
@@ -155,7 +155,7 @@ struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
 		} else {
 			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
 					    phys_vec[i].len, dir,
-					    DMA_ATTR_MMIO);
+					    provider->dma_mapping_flags);
 			ret = dma_mapping_error(attach->dev, addr);
 			if (ret)
 				goto err_unmap_dma;
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 7c898542af8d5e..e4229b4d35c767 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -282,6 +282,8 @@ int pcim_p2pdma_init(struct pci_dev *pdev)
 			continue;
 
 		p2p->mem[i].owner = &pdev->dev;
+		p2p->mem[i].dma_mapping_flags =
+			DMA_ATTR_MMIO | DMA_ATTR_CC_SHARED;
 		p2p->mem[i].bus_offset =
 			pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
 	}
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 873de20a224759..402dc5e5d62b0a 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -21,10 +21,12 @@ struct scatterlist;
  *
  * A p2pdma provider is a range of MMIO address space available to the CPU.
  * @owner: Device to which this provider belongs.
+ * @dma_mapping_flags: DMA attributes to use for host bridge mappings.
  * @bus_offset: Bus offset for p2p communication.
  */
 struct p2pdma_provider {
 	struct device *owner;
+	unsigned long dma_mapping_flags;
 	u64 bus_offset;
 };
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 5955f2f0c83db1..c3f445acddf873 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 	case PCI_P2PDMA_MAP_NONE:
 		break;
 	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-		attrs |= DMA_ATTR_MMIO;
+		attrs |= p2pdma_state->mem->dma_mapping_flags;
 		pfns[idx] |= HMM_PFN_P2PDMA;
 		break;
 	case PCI_P2PDMA_MAP_BUS_ADDR:

^ permalink raw reply related

* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Sean Christopherson @ 2026-05-21 18:47 UTC (permalink / raw)
  To: Kai Huang
  Cc: dwmw2@infradead.org, Rick P Edgecombe, x86@kernel.org,
	binbin.wu@linux.intel.com, kas@kernel.org,
	dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <0ed747418eaef45a8c161ab5a9e28a12c604f9ea.camel@intel.com>

On Wed, May 20, 2026, Kai Huang wrote:
> On Wed, 2026-05-20 at 11:11 -0700, Sean Christopherson wrote:
> > On Wed, May 20, 2026, Kai Huang wrote:
> > > But if we want to hide KVM internal structures, I don't see any other options
> > > except virt/kvm/include/ is the place to go?
> > 
> > arch/$(ARCH)/kvm/kvm_arch.h is the obvious approach.  Code in virt/kvm can reach
> > arch/$(ARCH)/kvm, we just need to add it to the include path.  That's why I was
> > working on unifying the include definitions.
> 
> Yeah, for asm/kvm_host.h.
> 
> But if I am still following you we still need a place for linux/kvm_host.h, for
> which I thought virt/kvm/include/ would be the place at first glance.

Oh, yes, the KVM-internal pieces of linuy/kvm_host.h would live in virt/kvm.

^ permalink raw reply

* Re: [PATCH v3 36/41] x86/kvmclock: Get local APIC bus frequency from PV CPUID Timing Info
From: Sean Christopherson @ 2026-05-21 20:34 UTC (permalink / raw)
  To: David Woodhouse
  Cc: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	pbonzini@redhat.com, kys@microsoft.com, decui@microsoft.com,
	daniel.lezcano@kernel.org, wei.liu@kernel.org,
	peterz@infradead.org, jgross@suse.com, boris.ostrovsky@oracle.com,
	linux-coco@lists.linux.dev, kvm@vger.kernel.org,
	mhklinux@outlook.com, thomas.lendacky@amd.com,
	linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <7489ff3cc1ff402bf0ade38272fc52dcbcc75fc1.camel@amazon.co.uk>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > When running as a KVM guest with kvmclock support enabled, stuff the APIC
> > timer period/frequency with the local APIC bus frequency reported in
> > CPUID.0x40000010.EBX instead of trying to calibrate/guess the frequency.
> > 
> > See Documentation/virt/kvm/x86/cpuid.rst for details.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> I still don't much like the way this is done inside kvm_get_tsc_khz().

Yeah, I don't like it either (understatement).  Aha!  native_calibrate_tsc() is
the oddball, all of the PV flows stuff lapic_timer_period when parsing the initial
timing info.  I'll just do that.  Blindly writing a global is all kinds of fugly,
but that's a future
problem to solve.

> We also probably ought to be looking for the timing leaf on other
> hypervisors including VMware 

VMware gets the frequency via hypercall.  Why, I have no idea.  I'll let the
VMware folks deal with that.

	eax = vmware_hypercall3(VMWARE_CMD_GETHZ, UINT_MAX, &ebx, &ecx);

> and probably Bhyve too.  Should it be done somewhere else?

I'm not opposed to that, though I don't know that it'd be a net positive. The
"hard" part of getting the info is finding the CPUID base and checking if the
leaf is available.  Unlike the native CPUID leaf, no math is necessary, and so
once the leaf is obtained, getting the frequency is trivial.

Regardless, I definitely don't want to take it on in this series. :-)


^ permalink raw reply

* Re: [PATCH v3 29/41] x86/paravirt: Plumb a return code into __paravirt_set_sched_clock()
From: Sean Christopherson @ 2026-05-21 20:35 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <13d79ba1e0450068c9573ccd8deb3ec007aea8d6.camel@infradead.org>

On Thu, May 21, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Add a return code to __paravirt_set_sched_clock() so that the kernel can
> > reject attempts to use a PV sched_clock without breaking the caller.  E.g.
> > when running as a CoCo VM with a secure TSC, using a PV clock is generally
> > undesirable.
> > 
> > Note, kvmclock is the only PV clock that does anything "extra" beyond
> > simply registering itself as sched_clock, i.e. is the only caller that
> > needs to check the new return value.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Oooh... can we use this to reject the kvmclock when we have a stable
> and reliable TSC even for non-CoCo guests?

Yes, but I would much rather "fix" kvmclock to not even attempt to register itself
as the sched_clock (which this series does).

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-21 20:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <44e0d60548d317fd59895f18bd17220dfb2f834b.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Add a helper to register non-native, i.e. PV and CoCo, CPU and TSC
> > frequency calibration routines.  This will allow consolidating handling
> > of common TSC properties that are forced by hypervisor (PV routines),
> > and will also allow adding sanity checks to guard against overriding a
> > TSC calibration routine with a routine that is less robust/trusted.
> > 
> > Make the CPU calibration routine optional, as Xen (very sanely) doesn't
> > assume the CPU runs as the same frequency as the TSC.
> > 
> > Wrap the helper in an #ifdef to document that the kernel overrides
> > the native routines when running as a VM, and to guard against unwanted
> > usage.  Add a TODO to call out that AMD_MEM_ENCRYPT is a mess and doesn't
> > depend on HYPERVISOR_GUEST because it gates both guest and host code.
> > 
> > No functional change intended.
> > 
> > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> > Tested-by: Michael Kelley <mhklinux@outlook.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Mildly concerned that we might want to support multiple options — does
> it have CPUID 0x15? Does it have 0x40000x10? Does it have a pvclock?
> There are various permutations of those which are perhaps best handled
> by *trying* each one, in some order, and populating a struct with the
> answers?
> 
> But on the basis that perfect is the enemy of good,

This has been bothering me too.

Aha!  AHA!  Idea.

... 4 hours later ...

Mhahahaahah, victory is mine!!!!

TL;DR: Overriding x86_platform_ops hooks is dumb.

To your point about making an informed decision, that's essentialy what this series
is already doing, just in a very roundabout way:

  1. x86_platform.calibrate_{cpu,tsc}() are initialized to "native" versions
  2. Hypervisor init code runs and conditionally overrides calibrate_{cpu,tsc}()
  3. CoCo init code runs and conditionally overrides calibrate_{cpu,tsc}()

So the ordering you want is already there, as is "trying" each source to some
extent, in the form of steps #2 and #3 overriding the hooks if and only if their
source of information is valid.  For all intents and purposes, the hardening I
was adding by formalizing the calibration overrides was to enforce the above ordering.

But that's obviously all but impossible to follow, _and_ it's pointless.

For every PV case, including TDX and SNP, "calibration" is simply information
retrieval, i.e. it never changes (barring broken hypervisors/firmware), and the
information is always available during early boot.

Contrast that with the pre-CPUID CPU frequency calibration, where the frequency
might change, the kernel is making a best guest based on other timekeeping sources,
and not all timekeeping sources are available during early boot.

And so overriding x86_platform.calibrate_{cpu,tsc}() for PV code is completely
unecessary, because steps #2 and #3 already know the frequency when they override
the hooks, and "success" is guaranteed, i.e. the kernel won't have to switch to a
"late" calibration flow.

If we provide x86_hyper_init hooks:

	unsigned int (*get_tsc_khz)(void);
	unsigned int (*get_cpu_khz)(void);

then we can kill off x86_platform.calibrate_{cpu,tsc}() entirely, explicitly
define the preferred ordering (user-forced => CoCo => Hypervisor => native), and
depup some of the hypervisor code.

E.g. this is what I've got for the early flow.  Testing now. 

  void __init tsc_early_init(void)
  {
	unsigned int known_cpu_khz = 0, known_tsc_khz = 0;

	if (!boot_cpu_has(X86_FEATURE_TSC))
		return;
	/* Don't change UV TSC multi-chassis synchronization */
	if (is_early_uv_system())
		return;

	if (x86_init.hyper.get_cpu_khz)
		known_cpu_khz = x86_init.hyper.get_cpu_khz();

	if (tsc_early_khz)
		known_tsc_khz = tsc_early_khz;
	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
		known_tsc_khz = snp_secure_tsc_init();
	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
		known_tsc_khz = tdx_tsc_init();

	/*
	 * If the TSC frequency is still unknown, i.e. not provided by the user
	 * or by trusted firmware, try to get it from the hypervisor (which is
	 * untrusted when running as a CoCo guest).
	 */
	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
		known_tsc_khz = x86_init.hyper.get_tsc_khz();

	if (known_tsc_khz)
		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);

	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
		return;
	tsc_enable_sched_clock();
  }

^ permalink raw reply

* Re: [PATCH v3 37/41] x86/kvmclock: Use TSC for sched_clock if it's constant and non-stop
From: Sean Christopherson @ 2026-05-21 21:01 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: kvm, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner, David Woodhouse
In-Reply-To: <c54fd01b-fe22-4c9c-8d5f-5b317de07a40@oracle.com>

On Thu, May 21, 2026, Dongli Zhang wrote:
> On 2026-05-15 12:19 PM, Sean Christopherson wrote:
> > Prefer the TSC over kvmclock for sched_clock if the TSC is constant,
> > nonstop, and not marked unstable via command line.  I.e. use the same
> > criteria as tweaking the clocksource rating so that TSC is preferred over
> > kvmclock.  Per the below comment from native_sched_clock(), sched_clock
> > is more tolerant of slop than clocksource; using TSC for clocksource but
> > not sched_clock makes little to no sense, especially now that KVM CoCo
> > guests with a trusted TSC use TSC, not kvmclock.
> > 
> >         /*
> >          * Fall back to jiffies if there's no TSC available:
> >          * ( But note that we still use it if the TSC is marked
> >          *   unstable. We do this because unlike Time Of Day,
> >          *   the scheduler clock tolerates small errors and it's
> >          *   very important for it to be as fast as the platform
> >          *   can achieve it. )
> >          */
> > 
> > The only advantage of using kvmclock is that doing so allows for early
> > and common detection of PVCLOCK_GUEST_STOPPED, but that code has been
> > broken for over two years with nary a complaint, i.e. it can't be
> > _that_ valuable.  And as above, certain types of KVM guests are losing
> > the functionality regardless, i.e. acknowledging PVCLOCK_GUEST_STOPPED
> > needs to be decoupled from sched_clock() no matter what.
> 
> Has it been broken for two years because of pvclock_clocksource_read_nowd()?

Yep.  Because pvclock_clocksource_read_nowd() ignores PVCLOCK_GUEST_STOPPED, the
flag only ever gets recognized when the kernel reads WALL_CLOCK, which AFAICT
only happens at initial boot, and during suspend and resume.

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-21 21:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag9wz3RiJOtVZrK0@google.com>

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Thu, 2026-05-21 at 13:53 -0700, Sean Christopherson wrote:
> 
> E.g. this is what I've got for the early flow.  Testing now. 
> 
>   void __init tsc_early_init(void)
>   {
> 	unsigned int known_cpu_khz = 0, known_tsc_khz = 0;
> 
> 	if (!boot_cpu_has(X86_FEATURE_TSC))
> 		return;
> 	/* Don't change UV TSC multi-chassis synchronization */
> 	if (is_early_uv_system())
> 		return;
> 
> 	if (x86_init.hyper.get_cpu_khz)
> 		known_cpu_khz = x86_init.hyper.get_cpu_khz();
> 
> 	if (tsc_early_khz)
> 		known_tsc_khz = tsc_early_khz;
> 	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> 		known_tsc_khz = snp_secure_tsc_init();
> 	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> 		known_tsc_khz = tdx_tsc_init();
> 
> 	/*
> 	 * If the TSC frequency is still unknown, i.e. not provided by the user
> 	 * or by trusted firmware, try to get it from the hypervisor (which is
> 	 * untrusted when running as a CoCo guest).
> 	 */
> 	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
> 		known_tsc_khz = x86_init.hyper.get_tsc_khz();
> 
> 	if (known_tsc_khz)
> 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> 
> 	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
> 		return;
> 	tsc_enable_sched_clock();
>   }

That seems reasonable. Where does the call to native_calibrate_tsc()
happen; is that from determine_cpu_tsc_frequencies()? 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-21 21:17 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <342098f6bfe1e4c7b233433df8f79713b4220614.camel@infradead.org>

On Thu, May 21, 2026, David Woodhouse wrote:
> On Thu, 2026-05-21 at 13:53 -0700, Sean Christopherson wrote:
> > 
> > E.g. this is what I've got for the early flow.  Testing now. 
> > 
> >   void __init tsc_early_init(void)
> >   {
> > 	unsigned int known_cpu_khz = 0, known_tsc_khz = 0;
> > 
> > 	if (!boot_cpu_has(X86_FEATURE_TSC))
> > 		return;
> > 	/* Don't change UV TSC multi-chassis synchronization */
> > 	if (is_early_uv_system())
> > 		return;
> > 
> > 	if (x86_init.hyper.get_cpu_khz)
> > 		known_cpu_khz = x86_init.hyper.get_cpu_khz();
> > 
> > 	if (tsc_early_khz)
> > 		known_tsc_khz = tsc_early_khz;
> > 	else if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> > 		known_tsc_khz = snp_secure_tsc_init();
> > 	else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> > 		known_tsc_khz = tdx_tsc_init();
> > 
> > 	/*
> > 	 * If the TSC frequency is still unknown, i.e. not provided by the user
> > 	 * or by trusted firmware, try to get it from the hypervisor (which is
> > 	 * untrusted when running as a CoCo guest).
> > 	 */
> > 	if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
> > 		known_tsc_khz = x86_init.hyper.get_tsc_khz();
> > 
> > 	if (known_tsc_khz)
> > 		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > 
> > 	if (!determine_cpu_tsc_frequencies(true, known_cpu_khz, known_tsc_khz))
> > 		return;
> > 	tsc_enable_sched_clock();
> >   }
> 
> That seems reasonable. Where does the call to native_calibrate_tsc()
> happen; is that from determine_cpu_tsc_frequencies()? 

Yep.

static bool __init determine_cpu_tsc_frequencies(bool early,
						 unsigned int known_cpu_khz,
						 unsigned int known_tsc_khz)
{
	/* Make sure that cpu and tsc are not already calibrated */
	WARN_ON(cpu_khz || tsc_khz);

	if (early) {
		/*
		 * Early CPU calibration can only use methods that are available
		 * early in boot (obviously).
		 */
		if (known_cpu_khz)
			cpu_khz = known_cpu_khz;
		else
			cpu_khz = native_calibrate_cpu_early();
		if (known_tsc_khz)
			tsc_khz = known_tsc_khz;
		else
			tsc_khz = native_calibrate_tsc();
	} else {
		cpu_khz = pit_hpet_ptimer_calibrate_cpu();
	}

	...

^ permalink raw reply

* Re: [PATCH v6 21/43] KVM: SEV: Make 'uaddr' parameter optional for KVM_SEV_SNP_LAUNCH_UPDATE
From: Ackerley Tng @ 2026-05-21 21:27 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
	linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <ag8G7Wq5PbEdKloG@google.com>

Sean Christopherson <seanjc@google.com> writes:

> On Thu, May 21, 2026, Fuad Tabba wrote:
>> Hi,
>>
>> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
>> <devnull+ackerleytng.google.com@kernel.org> wrote:
>> >
>> > From: Michael Roth <michael.roth@amd.com>
>> >
>> > For vm_memory_attributes=1, in-place conversion/population is not
>> > supported, so the initial contents necessarily must need to come
>> > from a separate src address, which is enforced by the current
>> > implementation. However, for vm_memory_attributes=0, it is possible for
>> > guest memory to be initialized directly from userspace by mmap()'ing the
>> > guest_memfd and writing to it while the corresponding GPA ranges are in
>> > a 'shared' state before converting them to the 'private' state expected
>> > by KVM_SEV_SNP_LAUNCH_UPDATE.
>> >
>> > Update the handling/documentation for KVM_SEV_SNP_LAUNCH_UPDATE to allow
>> > for 'uaddr' to be set to NULL when vm_memory_attributes=0, which
>> > SNP_LAUNCH_UPDATE will then use to determine when it should/shouldn't
>> > copy in data from a separate memory location. Continue to enforce
>> > non-NULL for the original vm_memory_attributes=1 case.
>> >
>> > Signed-off-by: Michael Roth <michael.roth@amd.com>
>> > [Added src_page check in error handling path when the firmware command fails]
>> > [Dropped ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES]
>> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> I'm not very familiar with the SEV-SNP populate flows, but it looks
>> like Sashiko is on to something:
>> https://sashiko.dev/#/patchset/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4%40google.com?part=21
>>
>> - a potential read-only page overwrite, because src_page is acquired
>> via get_user_pages_fast() without the FOLL_WRITE flag, but is then
>> overwritten via memcpy
>
> Oof, yeah, that's bad.  Adding FOLL_WRITE to kvm_gmem_populate() feels wrong, and
> could break uABI, but doing gup() in SNP code would reintroduce the AB-BA issue
> with filemap_invalidate_lock().
>
> Aha!  Not if we use get_user_page_fast_only().  Ugh, but then we'd have to plumb
> the userspace address into the post-populated callback.
>
> Hrm.  Given that no one has yelled about overwriting their CPUID page, and given
> that the CPUID page is likely dynamically created and thus is unlikely to be a
> read-only mapping (e.g. versus the initial image), maybe this?
>

Overwriting the CPUID page is by design, I think. IIUC if the SNP
firmware doesn't like something about the CPUID page, it can update
src_page and then return an error to userspace.

Userspace should then check if it agrees with the updated CPUID contents
and then retry if it agrees.

> diff --git arch/x86/kvm/svm/sev.c arch/x86/kvm/svm/sev.c
> index 37d4cfa5d980..c73c028d72c1 100644
> --- arch/x86/kvm/svm/sev.c
> +++ arch/x86/kvm/svm/sev.c
> @@ -2456,6 +2456,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
>         sev_populate_args.type = params.type;
>
>         count = kvm_gmem_populate(kvm, params.gfn_start, src, npages,
> +                                 params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID,

I think this makes sense given that writing to src_page can only happen
when params.type == KVM_SEV_SNP_PAGE_TYPE_CPUID (this is explicitly one
of the guards in sev_gmem_post_populate()):

	/*
	 * If the firmware command failed handle the reclaim and cleanup of that
	 * PFN before reporting an error.
	 *
	 * Additionally, when invalid CPUID function entries are detected,
	 * firmware writes the expected values into the page and leaves it
	 * unencrypted so it can be used for debugging and error-reporting.
	 *
	 * Copy this page back into the source buffer so userspace can use this
	 * information to provide information on which CPUID leaves/fields
	 * failed CPUID validation.
	 */
	if (ret && !snp_page_reclaim(kvm, pfn) &&
	    sev_populate_args->type == KVM_SEV_SNP_PAGE_TYPE_CPUID &&
	    sev_populate_args->fw_error == SEV_RET_INVALID_PARAM && src_page) {
		void *src_vaddr = kmap_local_page(src_page);
		void *dst_vaddr = kmap_local_pfn(pfn);

		memcpy(src_vaddr, dst_vaddr, PAGE_SIZE);

		kunmap_local(src_vaddr);
		kunmap_local(dst_vaddr);
	}

>                                   sev_gmem_post_populate, &sev_populate_args);
>         if (count < 0) {
>                 argp->error = sev_populate_args.fw_error;
> diff --git arch/x86/kvm/vmx/tdx.c arch/x86/kvm/vmx/tdx.c
> index f97bcf580e6d..33f35be4455b 100644
> --- arch/x86/kvm/vmx/tdx.c
> +++ arch/x86/kvm/vmx/tdx.c
> @@ -3188,7 +3188,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *c
>                 };
>                 gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
>                                              u64_to_user_ptr(region.source_addr),
> -                                            1, tdx_gmem_post_populate, &arg);
> +                                            1, false, tdx_gmem_post_populate, &arg);

And TDX doesn't try to write src_page, so this is good too.

>                 if (gmem_ret < 0) {
>                         ret = gmem_ret;
>                         break;
> diff --git include/linux/kvm_host.h include/linux/kvm_host.h
> index 61a3430957f2..b83cda2870ba 100644
> --- include/linux/kvm_host.h
> +++ include/linux/kvm_host.h
> @@ -2596,7 +2596,8 @@ int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_ord
>  typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>                                     struct page *page, void *opaque);
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> +                      long npages, bool writable,

What do you think of need_writable_src instead of just writable for the
variable name?

>                        kvm_gmem_populate_cb post_populate, void *opaque);
>  #endif
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index a35a55571a2d..6553d4e032ce 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -858,7 +858,8 @@ static long __kvm_gmem_populate(struct kvm *kvm, struct kvm_memory_slot *slot,
>         return ret;
>  }
>
> -long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
> +long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src,
> +                      long npages, bool writable,
>                        kvm_gmem_populate_cb post_populate, void *opaque)
>  {
>         struct kvm_memory_slot *slot;
> @@ -892,8 +893,9 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>
>                 if (src) {
>                         unsigned long uaddr = (unsigned long)src + i * PAGE_SIZE;
> +                       unsigned int flags = writable ? FOLL_WRITE : 0;

How about using FOLL_WRITE | FOLL_NOFAULT so if it weren't writable to
start with, don't CoW, just error out?

Like you said above the CPUID page provided as src_page would have been
written to before, so it should have been mapped as writable.

>
> -                       ret = get_user_pages_fast(uaddr, 1, 0, &src_page);
> +                       ret = get_user_pages_fast(uaddr, 1, flags, &src_page);

If we stick with FOLL_WRITE, this also solves the case where a read-only
mapping or global zero page are provided as src_page, since
get_user_pages_fast() will do a copy-on-write if those were the inputs,
making it writable before the write happens (on failure) in
sev_gmem_post_populate().

>                         if (ret < 0)
>                                 break;
>                         if (ret != 1) {
>
>> - an ordering violation with the kunmap_local() calls
>
> Yeesh, that's a new one for me.  Thankfully this is 64-bit only, so it's not an
> issue.
>
>> These predate this patch series and are just being touched by the
>> 'src_page' addition, but if Sashiko's right, these should probably be
>> fixed sooner rather than later.
>
> Yeah, ditto with the offset wrapping case.

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-21 21:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag92Ze_FADmL1llo@google.com>

[-- Attachment #1: Type: text/plain, Size: 1648 bytes --]

On Thu, 2026-05-21 at 14:17 -0700, Sean Christopherson wrote:
>  
> > That seems reasonable. Where does the call to
> > native_calibrate_tsc()
> > happen; is that from determine_cpu_tsc_frequencies()? 
> 
> Yep.

Great, thanks.

> static bool __init determine_cpu_tsc_frequencies(bool early,
> 						 unsigned int
> known_cpu_khz,
> 						 unsigned int
> known_tsc_khz)
> {
> 	/* Make sure that cpu and tsc are not already calibrated */
> 	WARN_ON(cpu_khz || tsc_khz);
> 
> 	if (early) {
> 		/*
> 		 * Early CPU calibration can only use methods that
> are available
> 		 * early in boot (obviously).
> 		 */
> 		if (known_cpu_khz)
> 			cpu_khz = known_cpu_khz;
> 		else
> 			cpu_khz = native_calibrate_cpu_early();
> 		if (known_tsc_khz)
> 			tsc_khz = known_tsc_khz;
> 		else
> 			tsc_khz = native_calibrate_tsc();
> 	} else {
> 		cpu_khz = pit_hpet_ptimer_calibrate_cpu();
> 	}


If, after all that, we still end up in the case where we *do* have to
calibrate it against a legacy timer (which sadly IIRC is the case even
on some fairly modern AMD generations), could we round the answer?

We currently have *far* more precision than accuracy, leading to values
like 2399997kHz which change every boot (and end up being what gets
*advertised* to guests on such a host... and then unless we're careful
to avoid it, we end up trying to *scale* a different host's TSC down
from 2399998 to 2399997 for a guest which is migrated from the first
host...)

We should just fix them (e.g. to 2400000kHz) and let NTP sort them out.

Something like "round to the nearest MHz if that's within ±10PPM"?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox