Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v14 06/44] arm64: RMI: Check for RMI support at init
From: Steven Price @ 2026-06-03 10:57 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86bje8x6dj.wl-maz@kernel.org>

On 21/05/2026 14:02, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:14 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> Query the RMI version number and check if it is a compatible version.
>> The first two feature registers are read and exposed for future code to
>> use.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> v14:
>>  * This moves the basic RMI setup into the 'kernel' directory. This is
>>    because RMI will be used for some features outside of KVM so should
>>    be available even if KVM isn't compiled in.
>> ---
>>  arch/arm64/include/asm/rmi_cmds.h |  3 ++
>>  arch/arm64/kernel/Makefile        |  2 +-
>>  arch/arm64/kernel/cpufeature.c    |  1 +
>>  arch/arm64/kernel/rmi.c           | 65 +++++++++++++++++++++++++++++++
>>  4 files changed, 70 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm64/kernel/rmi.c
>>
>> diff --git a/arch/arm64/include/asm/rmi_cmds.h b/arch/arm64/include/asm/rmi_cmds.h
>> index 04f7066894e9..9179934925c5 100644
>> --- a/arch/arm64/include/asm/rmi_cmds.h
>> +++ b/arch/arm64/include/asm/rmi_cmds.h
>> @@ -10,6 +10,9 @@
>>  
>>  #include <asm/rmi_smc.h>
>>  
>> +extern unsigned long rmm_feat_reg0;
>> +extern unsigned long rmm_feat_reg1;
>> +
>>  struct rtt_entry {
>>  	unsigned long walk_level;
>>  	unsigned long desc;
>> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
>> index 74b76bb70452..d68f351aae75 100644
>> --- a/arch/arm64/kernel/Makefile
>> +++ b/arch/arm64/kernel/Makefile
>> @@ -34,7 +34,7 @@ obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
>>  			   cpufeature.o alternative.o cacheinfo.o		\
>>  			   smp.o smp_spin_table.o topology.o smccc-call.o	\
>>  			   syscall.o proton-pack.o idle.o patching.o pi/	\
>> -			   rsi.o jump_label.o
>> +			   rsi.o jump_label.o rmi.o
>>  
>>  obj-$(CONFIG_COMPAT)			+= sys32.o signal32.o			\
>>  					   sys_compat.o
>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
>> index 6d53bb15cf7b..8bdd95a8c2de 100644
>> --- a/arch/arm64/kernel/cpufeature.c
>> +++ b/arch/arm64/kernel/cpufeature.c
>> @@ -292,6 +292,7 @@ static const struct arm64_ftr_bits ftr_id_aa64isar3[] = {
>>  static const struct arm64_ftr_bits ftr_id_aa64pfr0[] = {
>>  	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_CSV3_SHIFT, 4, 0),
>>  	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_CSV2_SHIFT, 4, 0),
>> +	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_RME_SHIFT, 4, 0),
>>  	ARM64_FTR_BITS(FTR_VISIBLE, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_DIT_SHIFT, 4, 0),
>>  	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_AMU_SHIFT, 4, 0),
>>  	ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_LOWER_SAFE, ID_AA64PFR0_EL1_MPAM_SHIFT, 4, 0),
>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>> new file mode 100644
>> index 000000000000..99c1ccc35c11
>> --- /dev/null
>> +++ b/arch/arm64/kernel/rmi.c
>> @@ -0,0 +1,65 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2023-2025 ARM Ltd.
>> + */
>> +
>> +#include <linux/memblock.h>
>> +
>> +#include <asm/rmi_cmds.h>
>> +
>> +unsigned long rmm_feat_reg0;
>> +unsigned long rmm_feat_reg1;
> 
> What is the requirement for making those globally accessible? Can't
> they be made static and use an accessor that returns them? Can the
> variables be made __ro_after_init?

Good point - there's no requirement. Also the name isn't quite right - 
these should be named rmi_ as there is a different set for RSI.

>> +
>> +static int rmi_check_version(void)
>> +{
>> +	struct arm_smccc_res res;
>> +	unsigned short version_major, version_minor;
>> +	unsigned long host_version = RMI_ABI_VERSION(RMI_ABI_MAJOR_VERSION,
>> +						     RMI_ABI_MINOR_VERSION);
>> +	unsigned long aa64pfr0 = read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
>> +
>> +	/* If RME isn't supported, then RMI can't be */
>> +	if (cpuid_feature_extract_unsigned_field(aa64pfr0, ID_AA64PFR0_EL1_RME_SHIFT) == 0)
>> +		return -ENXIO;
>> +
>> +	arm_smccc_1_1_invoke(SMC_RMI_VERSION, host_version, &res);
>> +
>> +	if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
>> +		return -ENXIO;
>> +
>> +	version_major = RMI_ABI_VERSION_GET_MAJOR(res.a1);
>> +	version_minor = RMI_ABI_VERSION_GET_MINOR(res.a1);
>> +
>> +	if (res.a0 != RMI_SUCCESS) {
>> +		unsigned short high_version_major, high_version_minor;
>> +
>> +		high_version_major = RMI_ABI_VERSION_GET_MAJOR(res.a2);
>> +		high_version_minor = RMI_ABI_VERSION_GET_MINOR(res.a2);
>> +
>> +		pr_err("Unsupported RMI ABI (v%d.%d - v%d.%d) we want v%d.%d\n",
>> +		       version_major, version_minor,
>> +		       high_version_major, high_version_minor,
>> +		       RMI_ABI_MAJOR_VERSION,
>> +		       RMI_ABI_MINOR_VERSION);
>> +		return -ENXIO;
>> +	}
>> +
>> +	pr_info("RMI ABI version %d.%d\n", version_major, version_minor);
>> +
>> +	return 0;
>> +}
>> +
>> +static int __init arm64_init_rmi(void)
>> +{
>> +	/* Continue without realm support if we can't agree on a version */
>> +	if (rmi_check_version())
>> +		return 0;
>> +
>> +	if (WARN_ON(rmi_features(0, &rmm_feat_reg0)))
>> +		return 0;
>> +	if (WARN_ON(rmi_features(1, &rmm_feat_reg1)))
>> +		return 0;
>> +
>> +	return 0;
>> +}
>> +subsys_initcall(arm64_init_rmi);
> 
> Is there any reliance on this being executed before or after KVM's own
> initialisation? If so, this should be captured.

Yes I'm expecting this to be called before KVM's initialisation. 
kvm_init_rmi() alls rmi_is_available() to check if CCA is supported and 
only enables the KVM side of things if that check passes. So if the 
initialisation was the other way round then Realm guests would be 
unsupported. I'll add a comment

/*
 * Note arm64_init_rmi() must be called before kvm_init_rmi() otherwise KVM
 * will not support realm guests. subsys_initcall() is called before
 * module_init() (used for KVM) so this is OK.
 */

Thanks,
Steve

^ permalink raw reply

* Re: [PATCH v14 06/44] arm64: RMI: Check for RMI support at init
From: Steven Price @ 2026-06-03 10:57 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <3a0f6277-2b68-45db-a07f-16a177b0586d@redhat.com>

On 25/05/2026 07:58, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/22/26 1:49 AM, Steven Price wrote:
>> On 21/05/2026 01:39, Gavin Shan wrote:
>>> On 5/13/26 11:17 PM, Steven Price wrote:
>>>> Query the RMI version number and check if it is a compatible version.
>>>> The first two feature registers are read and exposed for future code to
>>>> use.
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>> v14:
>>>>    * This moves the basic RMI setup into the 'kernel' directory.
>>>> This is
>>>>      because RMI will be used for some features outside of KVM so
>>>> should
>>>>      be available even if KVM isn't compiled in.
>>>> ---
>>>>    arch/arm64/include/asm/rmi_cmds.h |  3 ++
>>>>    arch/arm64/kernel/Makefile        |  2 +-
>>>>    arch/arm64/kernel/cpufeature.c    |  1 +
>>>>    arch/arm64/kernel/rmi.c           | 65 ++++++++++++++++++++++++++
>>>> +++++
>>>>    4 files changed, 70 insertions(+), 1 deletion(-)
>>>>    create mode 100644 arch/arm64/kernel/rmi.c
>>>>
>>>
>>> [...]
>>>
>>>> diff --git a/arch/arm64/kernel/rmi.c b/arch/arm64/kernel/rmi.c
>>>> new file mode 100644
>>>> index 000000000000..99c1ccc35c11
>>>> --- /dev/null
>>>> +++ b/arch/arm64/kernel/rmi.c
>>>> @@ -0,0 +1,65 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Copyright (C) 2023-2025 ARM Ltd.
>>>> + */
>>>> +
>>>> +#include <linux/memblock.h>
>>>> +
>>>> +#include <asm/rmi_cmds.h>
>>>> +
>>>> +unsigned long rmm_feat_reg0;
>>>> +unsigned long rmm_feat_reg1;
>>>> +
>>>> +static int rmi_check_version(void)
>>>> +{
>>>> +    struct arm_smccc_res res;
>>>> +    unsigned short version_major, version_minor;
>>>> +    unsigned long host_version =
>>>> RMI_ABI_VERSION(RMI_ABI_MAJOR_VERSION,
>>>> +                             RMI_ABI_MINOR_VERSION);
>>>> +    unsigned long aa64pfr0 =
>>>> read_sanitised_ftr_reg(SYS_ID_AA64PFR0_EL1);
>>>> +
>>>> +    /* If RME isn't supported, then RMI can't be */
>>>> +    if (cpuid_feature_extract_unsigned_field(aa64pfr0,
>>>> ID_AA64PFR0_EL1_RME_SHIFT) == 0)
>>>> +        return -ENXIO;
>>>> +
>>>> +    arm_smccc_1_1_invoke(SMC_RMI_VERSION, host_version, &res);
>>>> +
>>>> +    if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
>>>> +        return -ENXIO;
>>>> +
>>>> +    version_major = RMI_ABI_VERSION_GET_MAJOR(res.a1);
>>>> +    version_minor = RMI_ABI_VERSION_GET_MINOR(res.a1);
>>>> +
>>>> +    if (res.a0 != RMI_SUCCESS) {
>>>> +        unsigned short high_version_major, high_version_minor;
>>>> +
>>>> +        high_version_major = RMI_ABI_VERSION_GET_MAJOR(res.a2);
>>>> +        high_version_minor = RMI_ABI_VERSION_GET_MINOR(res.a2);
>>>> +
>>>> +        pr_err("Unsupported RMI ABI (v%d.%d - v%d.%d) we want v%d.
>>>> %d\n",
>>>> +               version_major, version_minor,
>>>> +               high_version_major, high_version_minor,
>>>> +               RMI_ABI_MAJOR_VERSION,
>>>> +               RMI_ABI_MINOR_VERSION);
>>>> +        return -ENXIO;
>>>> +    }
>>>> +
>>>> +    pr_info("RMI ABI version %d.%d\n", version_major, version_minor);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int __init arm64_init_rmi(void)
>>>> +{
>>>> +    /* Continue without realm support if we can't agree on a
>>>> version */
>>>> +    if (rmi_check_version())
>>>> +        return 0;
>>>
>>> Is this still a valid point that we have to return zero on errors
>>> returned
>>> from rmi_check_version() or other other function calls like
>>> rmi_features()?
>>> arm64_init_rmi() is triggered by subsys_initcall() where the return
>>> value
>>> needs to indicate success or failure. It's fine to return error code
>>> from
>>> arm64_init_rmi() in the path.
>>
>> Hmm, I guess now this is moved to arm64 code this indeed doesn't need
>> to. Within a module I believe an error return can fail the module
>> loading.
>>
>> I'm not sure it really makes much difference though - if this
>> initialisation fails then it's not really an error - it just means the
>> feature is unavailable.
>>
> 
> I think the return value would be consistent to the value of
> 'arm64_rmi_is_available'.
> 'arm64_rmi_is_available' is true when zero is returned, otherwise,
> 'arm64_rmi_is_available'
> is false.
> 
> With the consistency between the return value and
> 'arm64_rmi_is_available', users are
> able to know the value of 'arm64_rmi_is_available' through kernel
> parameter 'initcall_debug'.
> With the kernel parameter, the initcalls including arm64_init_rmi() are
> traced and its
> return value is outputted in the traced messages, seeing
> do_trace_initcall_start().

Fair enough, and actually refactoring this function to pass error codes
up the call stack I think does improve the look.

Thanks,
Steve

>> Thanks,
>> Steve
>>
>>>> +
>>>> +    if (WARN_ON(rmi_features(0, &rmm_feat_reg0)))
>>>> +        return 0;
>>>> +    if (WARN_ON(rmi_features(1, &rmm_feat_reg1)))
>>>> +        return 0;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +subsys_initcall(arm64_init_rmi);
>>>
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 04/44] arm64: RMI: Add SMC definitions for calling the RMM
From: Steven Price @ 2026-06-03 10:15 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <87jysvahpb.wl-maz@kernel.org>

On 22/05/2026 10:58, Marc Zyngier wrote:
> On Thu, 21 May 2026 16:33:09 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> On 21/05/2026 13:40, Marc Zyngier wrote:
>>> On Wed, 13 May 2026 14:17:12 +0100,
>>> Steven Price <steven.price@arm.com> wrote:
>>>>
>>>> The RMM (Realm Management Monitor) provides functionality that can be
>>>> accessed by SMC calls from the host.
>>>>
>>>> The SMC definitions are based on DEN0137[1] version 2.0-bet1
>>>>
>>>> [1] https://developer.arm.com/documentation/den0137/2-0bet1/
>>>>
>>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>>> ---
>>>> Changes since v13:
>>>>  * Updated to RMM spec v2.0-bet1
>>>> Changes since v12:
>>>>  * Updated to RMM spec v2.0-bet0
>>>> Changes since v9:
>>>>  * Corrected size of 'ripas_value' in struct rec_exit. The spec states
>>>>    this is an 8-bit type with padding afterwards (rather than a u64).
>>>> Changes since v8:
>>>>  * Added RMI_PERMITTED_GICV3_HCR_BITS to define which bits the RMM
>>>>    permits to be modified.
>>>> Changes since v6:
>>>>  * Renamed REC_ENTER_xxx defines to include 'FLAG' to make it obvious
>>>>    these are flag values.
>>>> Changes since v5:
>>>>  * Sorted the SMC #defines by value.
>>>>  * Renamed SMI_RxI_CALL to SMI_RMI_CALL since the macro is only used for
>>>>    RMI calls.
>>>>  * Renamed REC_GIC_NUM_LRS to REC_MAX_GIC_NUM_LRS since the actual
>>>>    number of available list registers could be lower.
>>>>  * Provided a define for the reserved fields of FeatureRegister0.
>>>>  * Fix inconsistent names for padding fields.
>>>> Changes since v4:
>>>>  * Update to point to final released RMM spec.
>>>>  * Minor rearrangements.
>>>> Changes since v3:
>>>>  * Update to match RMM spec v1.0-rel0-rc1.
>>>> Changes since v2:
>>>>  * Fix specification link.
>>>>  * Rename rec_entry->rec_enter to match spec.
>>>>  * Fix size of pmu_ovf_status to match spec.
>>>> ---
>>>>  arch/arm64/include/asm/rmi_smc.h | 448 +++++++++++++++++++++++++++++++
>>>>  1 file changed, 448 insertions(+)
>>>>  create mode 100644 arch/arm64/include/asm/rmi_smc.h
>>>>
>>>> diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
>>>> new file mode 100644
>>>> index 000000000000..a09b7a631fef
>>>> --- /dev/null
>>>> +++ b/arch/arm64/include/asm/rmi_smc.h
>>>> @@ -0,0 +1,448 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * Copyright (C) 2023-2026 ARM Ltd.
>>>> + *
>>>> + * The values and structures in this file are from the Realm Management Monitor
>>>> + * specification (DEN0137) version 2.0-bet1:
>>>> + * https://developer.arm.com/documentation/den0137/2-0bet1/
>>>
>>> How long is this spec going to be available on the ARM web site, which
>>> has a tendency of being reorganised every other week? And there is
>>> already a beta2.
>>
>> Obviously I can't predict the next reorganisation - but at least it's a
>> link that could be fed into archive.org or similar.
> 
> I found that the PDF spec was less susceptible to creative nonsense,
> and people can download it for future reference, whereas ARM has
> happily *deleted* specs from the website over time (try to find PSCI
> 0.1, for example...).

Sadly the nearest I found to a link directly to the PDF is:

https://documentation-service.arm.com/static/69cb945ac1586b7c59b1c00c

But I have 0 confidence that that link will work for long (if indeed it
even works for others now!). If you know of any way of getting a better
link out of the Arm website that I'm all ears!

> [...]
> 
>>>> +struct realm_params {
>>>> +	union { /* 0x0 */
>>>> +		struct {
>>>> +			u64 flags;
>>>> +			u64 s2sz;
>>>> +			u64 sve_vl;
>>>> +			u64 num_bps;
>>>> +			u64 num_wps;
>>>> +			u64 pmu_num_ctrs;
>>>> +			u64 hash_algo;
>>>> +			u64 num_aux_planes;
>>>> +		};
>>>> +		u8 padding0[0x400];
>>>
>>> SZ_1K? And similarly all over the shop?
>>
>> I'm a bit less sure that makes the code more readable - these structures
>> are a bit of a pain because they are somewhat sparse. I've left a
>> comment where the beginning of each union is, and personally I find it
>> easier to see 0x0 + 0x400 == 0x400 rather than trying to work out what
>> SZ_1K is in hex. This is particularly the case in terms of:
>>
>>> struct rec_params {
>>> 	union { /* 0x0 */
>>> 		u64 flags;
>>> 		u8 padding0[0x100];
>>> 	};
>>> 	union { /* 0x100 */
>>> 		u64 mpidr;
>>> 		u8 padding1[0x100];
>>> 	};
>>> 	union { /* 0x200 */
>>> 		u64 pc;
>>> 		u8 padding2[0x100];
>>> 	};
>>> 	union { /* 0x300 */
>>> 		u64 gprs[REC_CREATE_NR_GPRS];
>>> 		u8 padding3[0xd00];
>>> 	};
>>> };
>>
>> Where 0xd00 doesn't even have a correspoding SZ_ define.
> 
> Indeed, but it is (SZ_4K - SZ_256 * 3).

Do you really think

 		u8 padding3[SZ_4K - SZ_256 * 3];

is better? I certainly don't. I'll give you (SZ_4K - 0x300) is tempting.
Although it then makes the BUILD_BUG_ON idea below somewhat pointless.

> And a lot of these structures> seem to be designed to form a 4kB blob.
I'm sure we can make use of
> that information (BUILD_BUG_ON?).

BUILD_BUG_ON requires being in a function. But static_assert() can be
used in the header by the struct definitions - I'll add that, thanks for
the suggestion.

>>
>> The RMM deals with this with macro magic:
>>
>>> struct rmi_rec_params {
>>>         /* Flags */
>>>         SET_MEMBER_RMI(unsigned long flags, 0, 0x100);  /* Offset 0 */
>>>         /* MPIDR of the REC */
>>>         SET_MEMBER_RMI(unsigned long mpidr, 0x100, 0x200);      /* 0x100 */
>>>         /* Program counter */
>>>         SET_MEMBER_RMI(unsigned long pc, 0x200, 0x300); /* 0x200 */
>>>         /* General-purpose registers */
>>>         SET_MEMBER_RMI(unsigned long gprs[REC_CREATE_NR_GPRS], 0x300, 0x1000); /* 0x300 */
>>> };
>>
>> where the offsets are just directly encoded in the macro - but it's not
>> an especially robust macro and I'm not convinced it's more readable.
> 
> I think this is just as horrible, but at least it seems to take the
> boundaries of the structure into account.
> 
>>
>> I'm happy to hear other suggestions on how to encode this neatly.
> 
> Honestly, I wouldn't mind having the structures described in a more
> abstract way and then pre-processed to generate the include files. If
> the architectural MRS wasn't so huge, I would have added it to the
> kernel and used that directly for KVM.
> 
>>
>>> I haven't checked the details of the encodings (life is too short),
>>> but I wonder how much of this exists as an MRS and could be
>>> automatically generated?
>>
>> Automatically generating this would be good - I'm not sure whether we
>> have a (public) source available to generate from at the moment. I have
>> tried to methodically work through the spec when updating this file, but
>> as Gavin has already pointed out there was at least one mistake (in
>> currently unused definitions) this time.
> 
> I'm slightly baffled that even the RMM is written this way. Given the
> formalism used in the RMM spec, I was expecting that you'd have a
> bunch of JSON at hand and able to generate any output from that. Doing
> this stuff by hand is both incredibly dull work *and* extremely error
> prone.

I'll look into the possibility of generating the headers. While dull and
error prone I have found it is sometimes useful for forcing a review of
the spec itself. There have been a number of bugs I've found (and have
been corrected) in the spec while writing the header files - it's very
easy to skim read those parts of the document otherwise.

Writing the structures out in a "more abstract way" might be a good
idea, but I'm just a little wary of writing another tool which is only
used in this one spot. The RMM structures are somewhat unusual in being
so sparse.

Thanks,
Steve

> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v4 07/47] x86/tdx: Force TSC frequency with CPUID-based info provided by the TDX-Module
From: Kiryl Shutsemau @ 2026-06-03 10:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-8-seanjc@google.com>

On Fri, May 29, 2026 at 07:43:54AM -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly set the TSC frequency to a known
> value, using CPUID-based information, instead of potentially relying on a
> hypervisor-controlled PV routine.  For TDX guests, CPUID.0x15 is always
> emulated by the TDX-Module, i.e. the information from CPUID is more
> trustworthy than the information provided by the hypervisor.

Right. EBX is configurable by TD_PARAMS.TSC_FREQUENCY at TD build. The
rest is fixed.

> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency.  While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
> 
>   The APIC timer frequency will be the processor’s bus clock or core
>   crystal clock frequency (when TSC/core crystal clock ratio is enumerated
>   in CPUID leaf 0x15).
> 
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.

Agreed.

> Deliberately leave CPU frequency calibration as is, since the TDX-Module
> doesn't provide any guarantees with respect to CPUID.0x16.

It is fixed to zeros. Sounds like a guarantee to me :P

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Looks sane to me. Including your reasoning about tsc_early_khz= in reply
to Sashiko.

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-03  8:58 UTC (permalink / raw)
  To: Ackerley Tng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CAEvNRgE1dCVAxJWd_hyFa8N=m9JLfn97ip9tAmvHxspWJ50oGg@mail.gmail.com>

On 02/06/2026 23:41, Ackerley Tng wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> 
>>
>> [...snip...]
>>
>>>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>>>> kvm_memory_slot *slot,
>>>>            folio_mark_uptodate(folio);
>>>>        }
>>>> -    r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>>> +    if (kvm_gmem_is_private_mem(inode, index))
>>>
>>> Don't we need to make sure the entire folio is private ? Not just the
>>> page at the index ?
>>>       if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?
> 
> I was thinking to fix this when I do huge pages, for now guest_memfd is
> always just PAGE_SIZE, so just looking up index is fine.
> 
> Is that okay?

Thats fine, but would be good to enforce that here, so that we don't 
miss out when we add support for multi page folios.

> 
>>
>> Or rather, we should go through the individual pages and apply the
>> prepare for ones that are private ?
>>
>> Suzuki
>>
> 
> IIRC the plan was to make kvm_gmem_prepare_folio() idempotent, as in, if
> a page is already private, just skip. Currently sev_gmem_prepare() does
> a pr_debug(), which I guess is technically still idempotent.
> 
> I'm thinking that the information tha needs tracking to make
> .gmem_prepare() idempotent should be tracked by arch code.
> 
> Does this work for ARM CCA?

We don't hook into the prepare yet, but have plans to do that. We should
be able to handle the pages that are already private. (For CCA context,
RMI_GRANULE_DELEGATE_RANGE can skip over already REALM pages). So this
should be fine.

My point is, in a given folio, there may be pages that are shared.
Like you said, this could be dealt with when we support hugepages.

Suzuki


> 
>>>
>>> [...snip...]
>>>


^ permalink raw reply

* Re: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Jason Gunthorpe @ 2026-06-03  0:54 UTC (permalink / raw)
  To: Michael Kelley
  Cc: Aneesh Kumar K.V, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Mostafa Saleh, Petr Tesarik, Alexey Kardashevskiy, Dan Williams,
	Xu Yilun, linuxppc-dev@lists.ozlabs.org,
	linux-s390@vger.kernel.org, Madhavan Srinivasan, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy (CS GROUP), Alexander Gordeev,
	Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <SN6PR02MB4157D9955A93244014AB7978D4122@SN6PR02MB4157.namprd02.prod.outlook.com>

On Tue, Jun 02, 2026 at 02:24:40PM +0000, Michael Kelley wrote:

> Except that in a normal VM, the "unencrypted" pool attribute does *not*
> describe the state of the memory itself.  In a normal VM, the memory is
> unencrypted, but the "unencrypted" pool attribute is false. That
> contradiction is the essence of my concern.

I would argue no..

When CC is enabled the default state of memory in a Linux environment
is "encrypted". You have to take a special action to "decrypt" it.
 
Thus the default state of memory in a non-CC environment is also
paradoxically "encrypted" too. "decryption" is impossible.

Therefore the "unencrypted" state is a special state that only memory
inside a CC VM can have. A normal VM can never have "unencrypted"
memory at all, so having it be false in the pool is accurate as far as
the APIs go.

un-encrypted = true means "the memory in this pool was transformed with
set_memory_decrypted()" - which is impossible on a normal VM.

Jason

^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Ackerley Tng @ 2026-06-02 22:41 UTC (permalink / raw)
  To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <144bbb9f-39a2-4c90-8903-51521e022da0@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

>
> [...snip...]
>
>>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct
>>> kvm_memory_slot *slot,
>>>           folio_mark_uptodate(folio);
>>>       }
>>> -    r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>> +    if (kvm_gmem_is_private_mem(inode, index))
>>
>> Don't we need to make sure the entire folio is private ? Not just the
>> page at the index ?
>>      if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?

I was thinking to fix this when I do huge pages, for now guest_memfd is
always just PAGE_SIZE, so just looking up index is fine.

Is that okay?

>
> Or rather, we should go through the individual pages and apply the
> prepare for ones that are private ?
>
> Suzuki
>

IIRC the plan was to make kvm_gmem_prepare_folio() idempotent, as in, if
a page is already private, just skip. Currently sev_gmem_prepare() does
a pr_debug(), which I guess is technically still idempotent.

I'm thinking that the information tha needs tracking to make
.gmem_prepare() idempotent should be tracked by arch code.

Does this work for ARM CCA?

>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v7 34/42] KVM: selftests: Test conversion with elevated page refcount
From: Askar Safin @ 2026-06-02 21:26 UTC (permalink / raw)
  To: devnull+ackerleytng.google.com
  Cc: ackerleytng, aik, akpm, andrew.jones, aneesh.kumar, axelrasmussen,
	baohua, bhe, binbin.wu, bp, brauner, chao.p.peng, chrisl, corbet,
	dave.hansen, david, forkloop, hpa, ira.weiny, jgg, jmattson,
	jthoughton, kas, kasong, kvm, liam, linux-coco, linux-doc,
	linux-kernel, linux-kselftest, linux-mm, linux-trace-kernel,
	mathieu.desnoyers, mhiramat, michael.roth, mingo, nphamcs, oupton,
	pankaj.gupta, pbonzini, pratyush, qi.zheng, qperret,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shakeel.butt,
	shikemeng, shivankg, shuah, skhan, steven.price, suzuki.poulose,
	tabba, tglx, vannapurve, vbabka, weixugc, willy, wyihan, x86,
	yan.y.zhao, youngjun.park, yuanchu
In-Reply-To: <20260522-gmem-inplace-conversion-v7-34-2f0fae496530@google.com>

Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>:
> This test uses vmsplice to increment the refcount of a specific page

I recently submitted a patch, which makes vmsplice equivalent to
preadv2/pwritev2, and it was accepted to next.

For now it is just an experiment, it is possible it will be reverted.

https://lore.kernel.org/all/20260601-aufweichen-dissens-ausrechnen-0d9b84728113@brauner/

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Ackerley Tng @ 2026-06-02 20:46 UTC (permalink / raw)
  To: Suzuki K Poulose, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
>
> nit: s/non-CoCo/CoCo ?
>

Yes, thanks!

>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
>
> nit: Missing Co-Developed-by: ?
>

IIRC this should have been

Suggested-by: Michael Roth <michael.roth@amd.com>

IIRC Michael suggested this on one of the guest_memfd calls, Michael
please let me know if you remember otherwise!

>>
>> [...snip...]
>>

^ permalink raw reply

* [PATCH v6 6/6] x86/sev: Add debugfs support for RMPOPT
From: Ashish Kalra @ 2026-06-02 20:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a debugfs interface to report per-CPU RMPOPT status across all
system RAM.

To dump the per-CPU RMPOPT status for all system RAM:

/sys/kernel/debug/rmpopt# cat rmpopt-table

Memory @  0GB: CPU(s): none
Memory @  1GB: CPU(s): none
Memory @  2GB: CPU(s): 0-1023
Memory @  3GB: CPU(s): 0-1023
Memory @  4GB: CPU(s): none
Memory @  5GB: CPU(s): 0-1023
Memory @  6GB: CPU(s): 0-1023
Memory @  7GB: CPU(s): 0-1023
...
Memory @1025GB: CPU(s): 0-1023
Memory @1026GB: CPU(s): 0-1023
Memory @1027GB: CPU(s): 0-1023
Memory @1028GB: CPU(s): 0-1023
Memory @1029GB: CPU(s): 0-1023
Memory @1030GB: CPU(s): 0-1023
Memory @1031GB: CPU(s): 0-1023
Memory @1032GB: CPU(s): 0-1023
Memory @1033GB: CPU(s): 0-1023
Memory @1034GB: CPU(s): 0-1023
Memory @1035GB: CPU(s): 0-1023
Memory @1036GB: CPU(s): 0-1023
Memory @1037GB: CPU(s): 0-1023
Memory @1038GB: CPU(s): none

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 128 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 128 insertions(+)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 4442ecae3d18..29695bb18991 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -20,6 +20,8 @@
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
 #include <linux/workqueue.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -144,6 +146,15 @@ static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
 static unsigned long snp_nr_leaked_pages;
 
+/* All users of rmpopt_report_cpumask must hold rmpopt_show_mutex. */
+static cpumask_t rmpopt_report_cpumask;
+static struct dentry *rmpopt_debugfs;
+static DEFINE_MUTEX(rmpopt_show_mutex);
+
+struct seq_paddr {
+	phys_addr_t next_seq_paddr;
+};
+
 #undef pr_fmt
 #define pr_fmt(fmt)	"SEV-SNP: " fmt
 
@@ -585,6 +596,8 @@ static void rmpopt_cleanup(void)
 
 	cancel_delayed_work_sync(&rmpopt_delayed_work);
 	destroy_workqueue(rmpopt_wq);
+	debugfs_remove_recursive(rmpopt_debugfs);
+	rmpopt_debugfs = NULL;
 
 	cpus_read_lock();
 
@@ -622,6 +635,10 @@ static inline bool __rmpopt(u64 pa_start, u64 op_type)
 		     : "a" (pa_start), "c" (op_type)
 		     : "memory", "cc");
 
+	if (op_type == RMPOPT_FUNC_REPORT_STATUS)
+		assign_cpu(smp_processor_id(), &rmpopt_report_cpumask,
+			   optimized);
+
 	return optimized;
 }
 
@@ -641,6 +658,115 @@ static void rmpopt_smp(void *val)
 	rmpopt((u64)val);
 }
 
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_report_status(void *val)
+{
+	u64 pa_start = ALIGN_DOWN((u64)val, SZ_1G);
+	u64 op_type = RMPOPT_FUNC_REPORT_STATUS;
+
+	__rmpopt(pa_start, op_type);
+}
+
+/*
+ * start() can be called multiple times if allocated buffer has overflowed
+ * and bigger buffer is allocated.
+ */
+static void *rmpopt_table_seq_start(struct seq_file *seq, loff_t *pos)
+{
+	phys_addr_t end_paddr = rmpopt_pa_end;
+	struct seq_paddr *p = seq->private;
+
+	if (*pos == 0) {
+		p->next_seq_paddr = rmpopt_pa_start;
+		if (p->next_seq_paddr >= end_paddr)
+			return NULL;
+		return &p->next_seq_paddr;
+	}
+
+	if (p->next_seq_paddr >= end_paddr)
+		return NULL;
+
+	return &p->next_seq_paddr;
+}
+
+static void *rmpopt_table_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	phys_addr_t end_paddr = rmpopt_pa_end;
+	phys_addr_t *curr_paddr = v;
+
+	(*pos)++;
+	*curr_paddr += SZ_1G;
+	if (*curr_paddr >= end_paddr)
+		return NULL;
+
+	return curr_paddr;
+}
+
+static void rmpopt_table_seq_stop(struct seq_file *seq, void *v)
+{
+}
+
+static int rmpopt_table_seq_show(struct seq_file *seq, void *v)
+{
+	phys_addr_t *curr_paddr = v;
+
+	guard(mutex)(&rmpopt_show_mutex);
+
+	seq_printf(seq, "Memory @%3lluGB: ",
+		   *curr_paddr >> (get_order(SZ_1G) + PAGE_SHIFT));
+
+	/*
+	 * Query all online CPUs rather than just rmpopt_cpumask (primary
+	 * threads only). The RMPOPT instruction only needs to run on one
+	 * thread per core for the optimization to take effect, but debugfs
+	 * reporting requires the RMPOPT status across all CPUs.
+	 * Performance is not a concern for this diagnostic interface.
+	 *
+	 * This is safe because RMPOPT_BASE MSR is per-core and
+	 * snp_prepare() ensures all CPUs are online when the MSR is
+	 * programmed during snp_setup_rmpopt().
+	 */
+	cpumask_clear(&rmpopt_report_cpumask);
+	on_each_cpu_mask(cpu_online_mask, rmpopt_report_status,
+			 (void *)*curr_paddr, true);
+
+	if (cpumask_empty(&rmpopt_report_cpumask))
+		seq_puts(seq, "CPU(s): none\n");
+	else
+		seq_printf(seq, "CPU(s): %*pbl\n", cpumask_pr_args(&rmpopt_report_cpumask));
+
+	return 0;
+}
+
+static const struct seq_operations rmpopt_table_seq_ops = {
+	.start = rmpopt_table_seq_start,
+	.next = rmpopt_table_seq_next,
+	.stop = rmpopt_table_seq_stop,
+	.show = rmpopt_table_seq_show
+};
+
+static int rmpopt_table_open(struct inode *inode, struct file *file)
+{
+	return seq_open_private(file, &rmpopt_table_seq_ops, sizeof(struct seq_paddr));
+}
+
+static const struct file_operations rmpopt_table_fops = {
+	.open = rmpopt_table_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = seq_release_private,
+};
+
+static void rmpopt_debugfs_setup(void)
+{
+	rmpopt_debugfs = debugfs_create_dir("rmpopt", arch_debugfs_dir);
+
+	debugfs_create_file("rmpopt-table", 0400, rmpopt_debugfs,
+			    NULL, &rmpopt_table_fops);
+}
+
 /*
  * RMPOPT optimizations skip RMP checks at 1GB granularity if this
  * range of memory does not contain any SNP guest memory.
@@ -833,6 +959,8 @@ void snp_setup_rmpopt(void)
 	 * optimizations on all physical memory.
 	 */
 	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
+
+	rmpopt_debugfs_setup();
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 5/6] KVM: SEV: Perform RMP optimizations on SNP guest shutdown
From: Ashish Kalra @ 2026-06-02 20:02 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Pages are converted from shared to private as SNP guests are launched.
This destroys exisiting RMPOPT optimizations in the regions where
pages are converted.

Conversely, guest pages are converted back to shared during SNP guest
termination and their region may become eligible for RMPOPT
optimization.

To take advantage of this, perform RMPOPT after guest termination.
Do it after a delay so that a single RMPOPT pass can be done if
multiple guests terminate in a short period of time.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/kvm/svm/sev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index e107f368ed2d..29af6f6e603c 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3005,6 +3005,8 @@ void sev_vm_destroy(struct kvm *kvm)
 		 */
 		if (snp_decommission_context(kvm))
 			return;
+
+		snp_rmpopt_all_physmem();
 	} else {
 		sev_unbind_asid(kvm, sev->handle);
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 4/6] x86/sev: Add interface to re-enable RMP optimizations.
From: Ashish Kalra @ 2026-06-02 20:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

RMPOPT table is a per-CPU table which indicates if 1GB regions of
physical memory are entirely hypervisor-owned or not.

When performing host memory accesses in hypervisor mode as well as
non-SNP guest mode, the processor may consult the RMPOPT table to
potentially skip an RMP access and improve performance.

Events such as RMPUPDATE can clear RMP optimizations. Add an interface
to re-enable those optimizations.

Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/sev.h |  2 ++
 arch/x86/virt/svm/sev.c    | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 6fd72a44a51e..09b1c5d33790 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_rmpopt_all_physmem(void);
 void snp_setup_rmpopt(void);
 void snp_shutdown(void);
 #else
@@ -681,6 +682,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_rmpopt_all_physmem(void) {}
 static inline void snp_setup_rmpopt(void) {}
 static inline void snp_shutdown(void) {}
 #endif
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index d7e40a5fe5ca..4442ecae3d18 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -741,6 +741,21 @@ static void rmpopt_work_handler(struct work_struct *work)
 	free_cpumask_var(follower_mask);
 }
 
+void snp_rmpopt_all_physmem(void)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+		return;
+
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work,
+			   msecs_to_jiffies(RMPOPT_WORK_TIMEOUT));
+}
+EXPORT_SYMBOL_GPL(snp_rmpopt_all_physmem);
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 3/6] x86/sev: Add support to perform RMP optimizations asynchronously
From: Ashish Kalra @ 2026-06-02 20:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

When SEV-SNP is enabled, all writes to memory are checked to ensure
integrity of SNP guest memory. This imposes performance overhead on the
whole system.

RMPOPT is a new instruction that minimizes the performance overhead of
RMP checks on the hypervisor and on non-SNP guests by allowing RMP
checks to be skipped for 1GB regions of memory that are known not to
contain any SEV-SNP guest memory.

Add support for performing RMP optimizations asynchronously using a
dedicated workqueue.

Enable RMPOPT optimizations for up to 2TB of system RAM starting from
the lowest physical memory address aligned down to a 1GB boundary at
RMP initialization time. RMP checks can initially be skipped for 1GB
memory ranges that do not contain SEV-SNP guest memory (excluding
preassigned pages such as the RMP table and firmware pages). As SNP
guests are launched, RMPUPDATE will disable the corresponding RMPOPT
optimizations.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/virt/svm/sev.c | 196 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 193 insertions(+), 3 deletions(-)

diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 089c9a14edc7..d7e40a5fe5ca 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -19,6 +19,7 @@
 #include <linux/iommu.h>
 #include <linux/amd-iommu.h>
 #include <linux/nospec.h>
+#include <linux/workqueue.h>
 
 #include <asm/sev.h>
 #include <asm/processor.h>
@@ -125,7 +126,18 @@ static void *rmp_bookkeeping __ro_after_init;
 static u64 probed_rmp_base, probed_rmp_size;
 
 static cpumask_t rmpopt_cpumask;
-static phys_addr_t rmpopt_pa_start;
+static phys_addr_t rmpopt_pa_start, rmpopt_pa_end;
+
+enum rmpopt_function {
+	RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS,
+	RMPOPT_FUNC_REPORT_STATUS
+};
+
+#define RMPOPT_WORK_TIMEOUT	10000
+
+static struct workqueue_struct *rmpopt_wq;
+static struct delayed_work rmpopt_delayed_work;
+static DEFINE_MUTEX(rmpopt_wq_mutex);
 
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
@@ -566,6 +578,14 @@ static void rmpopt_cleanup(void)
 {
 	int cpu;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	if (!rmpopt_wq)
+		return;
+
+	cancel_delayed_work_sync(&rmpopt_delayed_work);
+	destroy_workqueue(rmpopt_wq);
+
 	cpus_read_lock();
 
 	for_each_cpu(cpu, &rmpopt_cpumask)
@@ -574,7 +594,8 @@ static void rmpopt_cleanup(void)
 	cpus_read_unlock();
 
 	cpumask_clear(&rmpopt_cpumask);
-	rmpopt_pa_start = 0;
+	rmpopt_pa_start = rmpopt_pa_end = 0;
+	rmpopt_wq = NULL;
 }
 
 void snp_shutdown(void)
@@ -592,6 +613,134 @@ void snp_shutdown(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+static inline bool __rmpopt(u64 pa_start, u64 op_type)
+{
+	bool optimized;
+
+	asm volatile(".byte 0xf2, 0x0f, 0x01, 0xfc"
+		     : "=@ccc" (optimized)
+		     : "a" (pa_start), "c" (op_type)
+		     : "memory", "cc");
+
+	return optimized;
+}
+
+static void rmpopt(u64 pa)
+{
+	u64 pa_start = ALIGN_DOWN(pa, SZ_1G);
+	u64 op_type = RMPOPT_FUNC_VERIFY_AND_REPORT_STATUS;
+
+	__rmpopt(pa_start, op_type);
+}
+
+/*
+ * 'val' is a system physical address.
+ */
+static void rmpopt_smp(void *val)
+{
+	rmpopt((u64)val);
+}
+
+/*
+ * RMPOPT optimizations skip RMP checks at 1GB granularity if this
+ * range of memory does not contain any SNP guest memory.
+ */
+static void rmpopt_work_handler(struct work_struct *work)
+{
+	cpumask_var_t follower_mask;
+	phys_addr_t pa;
+	int this_cpu;
+
+	pr_info("Attempt RMP optimizations on physical address range @1GB alignment [0x%016llx - 0x%016llx]\n",
+		rmpopt_pa_start, rmpopt_pa_end);
+
+	if (!alloc_cpumask_var(&follower_mask, GFP_KERNEL))
+		return;
+
+	/*
+	 * RMPOPT scans the RMP table, stores the result of the scan in the
+	 * reserved processor memory. The RMP scan is the most expensive
+	 * part. If a second RMPOPT occurs, it can skip the expensive scan
+	 * if they can see a cached result in the reserved processor memory.
+	 *
+	 * Do RMPOPT on one CPU alone. Then, follow that up with RMPOPT
+	 * on every other primary thread. Followers are "designed to"
+	 * skip the scan if they see the "cached" scan results.
+	 */
+	cpumask_copy(follower_mask, &rmpopt_cpumask);
+
+	/*
+	 * Pin the worker to the current CPU for the leader loop so that
+	 * this_cpu remains valid and the RMPOPT instruction executes on
+	 * the correct CPU.
+	 *
+	 * Use migrate_disable() rather than get_cpu() to prevent
+	 * migration while still allowing preemption.
+	 */
+	migrate_disable();
+	this_cpu = smp_processor_id();
+
+	if (cpumask_test_cpu(this_cpu, follower_mask)) {
+		/*
+		 * Current CPU is a primary thread in rmpopt_cpumask.
+		 * Run leader locally and remove from follower mask.
+		 */
+		cpumask_clear_cpu(this_cpu, follower_mask);
+
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+			rmpopt(pa);
+	} else if (cpumask_intersects(topology_sibling_cpumask(this_cpu),
+				      follower_mask)) {
+		/*
+		 * Current CPU is a sibling thread whose primary is in
+		 * rmpopt_cpumask.  RMPOPT_BASE MSR is per-core, so it
+		 * is safe to run the leader locally.  Remove the sibling's
+		 * primary from the follower mask as this core is already
+		 * covered by the leader.
+		 */
+		cpumask_andnot(follower_mask, follower_mask,
+			       topology_sibling_cpumask(this_cpu));
+
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+			rmpopt(pa);
+	} else {
+		/*
+		 * Current CPU does not have RMPOPT_BASE MSR programmed.
+		 * Pick an explicit leader from the cpumask to avoid #UD.
+		 */
+		int leader_cpu = cpumask_first(follower_mask);
+
+		if (WARN_ON_ONCE(leader_cpu >= nr_cpu_ids)) {
+			migrate_enable();
+			goto out;
+		}
+
+		cpumask_clear_cpu(leader_cpu, follower_mask);
+
+		cpus_read_lock();
+		for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G)
+			smp_call_function_single(leader_cpu, rmpopt_smp,
+						 (void *)pa, true);
+		cpus_read_unlock();
+	}
+
+	migrate_enable();
+
+	/* Followers: run RMPOPT on remaining cores */
+	cpus_read_lock();
+	for (pa = rmpopt_pa_start; pa < rmpopt_pa_end; pa += SZ_1G) {
+		on_each_cpu_mask(follower_mask, rmpopt_smp,
+				 (void *)pa, true);
+
+		 /* Give a chance for other threads to run */
+		cond_resched();
+	}
+	cpus_read_unlock();
+
+out:
+	free_cpumask_var(follower_mask);
+}
+
 void snp_setup_rmpopt(void)
 {
 	u64 rmpopt_base;
@@ -600,11 +749,35 @@ void snp_setup_rmpopt(void)
 	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
 		return;
 
+	guard(mutex)(&rmpopt_wq_mutex);
+
+	/*
+	 * Guard against re-initialization.  When SNP_SHUTDOWN_EX is issued
+	 * with x86_snp_shutdown=0, snp_shutdown() is not called and
+	 * rmpopt_cleanup() is skipped, but snp_initialized is still cleared.
+	 * A subsequent __sev_snp_init_locked() would call snp_setup_rmpopt()
+	 * again, leaking the existing workqueue, delayed work, debugfs
+	 * entries, and cpumask state.
+	 */
+	if (rmpopt_wq)
+		return;
+
+	/*
+	 * Create an RMPOPT-specific workqueue to avoid scheduling
+	 * RMPOPT workitem on the global system workqueue.
+	 */
+	rmpopt_wq = alloc_workqueue("rmpopt_wq", WQ_UNBOUND, 1);
+	if (!rmpopt_wq) {
+		pr_err("Failed to allocate RMPOPT workqueue\n");
+		return;
+	}
+
 	cpus_read_lock();
 
 	/*
 	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
-	 * to set up the RMPOPT_BASE MSR.
+	 * to set up the RMPOPT_BASE MSR. Likewise, only one thread per core
+	 * needs to issue the RMPOPT instruction.
 	 *
 	 * Note: only online primary threads are included.  If a core's
 	 * primary thread is offline, that core is not covered.  CPU hotplug
@@ -628,6 +801,23 @@ void snp_setup_rmpopt(void)
 		wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
 
 	cpus_read_unlock();
+
+	INIT_DELAYED_WORK(&rmpopt_delayed_work, rmpopt_work_handler);
+
+	rmpopt_pa_end = ALIGN(PFN_PHYS(max_pfn), SZ_1G);
+
+	/* Limit memory scanning to 2TB of RAM */
+	if ((rmpopt_pa_end - rmpopt_pa_start) > SZ_2T) {
+		pr_info("RMPOPT coverage limited to 2TB; memory above 0x%llx not optimized\n",
+			rmpopt_pa_start + SZ_2T);
+		rmpopt_pa_end = rmpopt_pa_start + SZ_2T;
+	}
+
+	/*
+	 * Once all per-CPU RMPOPT tables have been configured, enable RMPOPT
+	 * optimizations on all physical memory.
+	 */
+	queue_delayed_work(rmpopt_wq, &rmpopt_delayed_work, 0);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 2/6] x86/sev: Initialize RMPOPT configuration MSRs
From: Ashish Kalra @ 2026-06-02 20:01 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

The new RMPOPT instruction helps manage per-CPU RMP optimization
structures inside the CPU. It takes a 1GB-aligned physical address
and either returns the status of the optimizations or tries to enable
the optimizations.

Per-CPU RMPOPT tables support at most 2 TB of addressable memory for
RMP optimizations.

Initialize the per-CPU RMPOPT table base to the starting physical
address. This enables RMP optimization for up to 2 TB of system RAM on
all CPUs.

Additionally, add support to setup and enable RMPOPT once SNP is
enabled and initialized.

Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/coco/core.c             |  1 +
 arch/x86/include/asm/msr-index.h |  3 ++
 arch/x86/include/asm/sev.h       |  2 +
 arch/x86/virt/svm/sev.c          | 65 +++++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c     |  3 ++
 5 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/arch/x86/coco/core.c b/arch/x86/coco/core.c
index 989ca9f72ba3..7fdef00ca8f2 100644
--- a/arch/x86/coco/core.c
+++ b/arch/x86/coco/core.c
@@ -172,6 +172,7 @@ static void amd_cc_platform_clear(enum cc_attr attr)
 	switch (attr) {
 	case CC_ATTR_HOST_SEV_SNP:
 		cc_flags.host_sev_snp = 0;
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		break;
 	default:
 		break;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 86554de9a3f5..28540744f1eb 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -761,6 +761,9 @@
 #define MSR_AMD64_SEG_RMP_ENABLED_BIT	0
 #define MSR_AMD64_SEG_RMP_ENABLED	BIT_ULL(MSR_AMD64_SEG_RMP_ENABLED_BIT)
 #define MSR_AMD64_RMP_SEGMENT_SHIFT(x)	(((x) & GENMASK_ULL(13, 8)) >> 8)
+#define MSR_AMD64_RMPOPT_BASE		0xc0010139
+#define MSR_AMD64_RMPOPT_ENABLE_BIT	0
+#define MSR_AMD64_RMPOPT_ENABLE		BIT_ULL(MSR_AMD64_RMPOPT_ENABLE_BIT)
 
 #define MSR_SVSM_CAA			0xc001f000
 
diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
index 594cfa19cbd4..6fd72a44a51e 100644
--- a/arch/x86/include/asm/sev.h
+++ b/arch/x86/include/asm/sev.h
@@ -662,6 +662,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int pages)
 	__snp_leak_pages(pfn, pages, true);
 }
 int snp_prepare(void);
+void snp_setup_rmpopt(void);
 void snp_shutdown(void);
 #else
 static inline bool snp_probe_rmptable_info(void) { return false; }
@@ -680,6 +681,7 @@ static inline void snp_leak_pages(u64 pfn, unsigned int npages) {}
 static inline void kdump_sev_callback(void) { }
 static inline void snp_fixup_e820_tables(void) {}
 static inline int snp_prepare(void) { return -ENODEV; }
+static inline void snp_setup_rmpopt(void) {}
 static inline void snp_shutdown(void) {}
 #endif
 
diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
index 8bcdce98f6dc..089c9a14edc7 100644
--- a/arch/x86/virt/svm/sev.c
+++ b/arch/x86/virt/svm/sev.c
@@ -124,6 +124,9 @@ static void *rmp_bookkeeping __ro_after_init;
 
 static u64 probed_rmp_base, probed_rmp_size;
 
+static cpumask_t rmpopt_cpumask;
+static phys_addr_t rmpopt_pa_start;
+
 static LIST_HEAD(snp_leaked_pages_list);
 static DEFINE_SPINLOCK(snp_leaked_pages_list_lock);
 
@@ -488,9 +491,13 @@ static bool __init setup_segmented_rmptable(void)
 static bool __init setup_rmptable(void)
 {
 	if (rmp_cfg & MSR_AMD64_SEG_RMP_ENABLED) {
-		if (!setup_segmented_rmptable())
+		if (!setup_segmented_rmptable()) {
+			setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 			return false;
+		}
 	} else {
+		/* Note that Segmented RMP must be enabled to enable RMPOPT. */
+		setup_clear_cpu_cap(X86_FEATURE_RMPOPT);
 		if (!setup_contiguous_rmptable())
 			return false;
 	}
@@ -555,6 +562,21 @@ int snp_prepare(void)
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_prepare, "ccp");
 
+static void rmpopt_cleanup(void)
+{
+	int cpu;
+
+	cpus_read_lock();
+
+	for_each_cpu(cpu, &rmpopt_cpumask)
+		wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, 0);
+
+	cpus_read_unlock();
+
+	cpumask_clear(&rmpopt_cpumask);
+	rmpopt_pa_start = 0;
+}
+
 void snp_shutdown(void)
 {
 	u64 syscfg;
@@ -563,11 +585,52 @@ void snp_shutdown(void)
 	if (syscfg & MSR_AMD64_SYSCFG_SNP_EN)
 		return;
 
+	rmpopt_cleanup();
+
 	clear_rmp();
 	on_each_cpu(mfd_reconfigure, NULL, 1);
 }
 EXPORT_SYMBOL_FOR_MODULES(snp_shutdown, "ccp");
 
+void snp_setup_rmpopt(void)
+{
+	u64 rmpopt_base;
+	int cpu;
+
+	if (!cpu_feature_enabled(X86_FEATURE_RMPOPT))
+		return;
+
+	cpus_read_lock();
+
+	/*
+	 * The RMPOPT_BASE MSR is per-core, so only one thread per core needs
+	 * to set up the RMPOPT_BASE MSR.
+	 *
+	 * Note: only online primary threads are included.  If a core's
+	 * primary thread is offline, that core is not covered.  CPU hotplug
+	 * is not currently supported with SNP enabled.
+	 */
+
+	for_each_online_cpu(cpu)
+		if (topology_is_primary_thread(cpu))
+			cpumask_set_cpu(cpu, &rmpopt_cpumask);
+
+	rmpopt_pa_start = ALIGN_DOWN(PFN_PHYS(min_low_pfn), SZ_1G);
+	rmpopt_base = rmpopt_pa_start | MSR_AMD64_RMPOPT_ENABLE;
+
+	/*
+	 * Per-CPU RMPOPT tables support at most 2 TB of addressable memory
+	 * for RMP optimizations. Initialize the per-CPU RMPOPT table base
+	 * to the starting physical address to enable RMP optimizations for
+	 * up to 2 TB of system RAM on all CPUs.
+	 */
+	for_each_cpu(cpu, &rmpopt_cpumask)
+		wrmsrq_on_cpu(cpu, MSR_AMD64_RMPOPT_BASE, rmpopt_base);
+
+	cpus_read_unlock();
+}
+EXPORT_SYMBOL_FOR_MODULES(snp_setup_rmpopt, "ccp");
+
 /*
  * Do the necessary preparations which are verified by the firmware as
  * described in the SNP_INIT_EX firmware command description in the SNP
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 78f98aee7a66..217b6b19802e 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -1478,6 +1478,9 @@ static int __sev_snp_init_locked(int *error, unsigned int max_snp_asid)
 	}
 
 	snp_hv_fixed_pages_state_update(sev, HV_FIXED);
+
+	snp_setup_rmpopt();
+
 	sev->snp_initialized = true;
 	dev_dbg(sev->dev, "SEV-SNP firmware initialized, SEV-TIO is %s\n",
 		data.tio_en ? "enabled" : "disabled");
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 1/6] x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
From: Ashish Kalra @ 2026-06-02 20:00 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco
In-Reply-To: <cover.1780427587.git.ashish.kalra@amd.com>

From: Ashish Kalra <ashish.kalra@amd.com>

Add a flag indicating whether RMPOPT instruction is supported.

RMPOPT is a new instruction that reduces the performance overhead of
RMP checks for the hypervisor and non-SNP guests by allowing those
checks to be skipped when 1-GB memory regions are known to contain no
SEV-SNP guest memory.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

Suggested-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
---
 arch/x86/include/asm/cpufeatures.h | 2 +-
 arch/x86/kernel/cpu/scattered.c    | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..794cc96b8493 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -76,7 +76,7 @@
 #define X86_FEATURE_K8			( 3*32+ 4) /* Opteron, Athlon64 */
 #define X86_FEATURE_ZEN5		( 3*32+ 5) /* CPU based on Zen5 microarchitecture */
 #define X86_FEATURE_ZEN6		( 3*32+ 6) /* CPU based on Zen6 microarchitecture */
-/* Free                                 ( 3*32+ 7) */
+#define X86_FEATURE_RMPOPT		( 3*32+ 7) /* Support for AMD RMPOPT instruction */
 #define X86_FEATURE_CONSTANT_TSC	( 3*32+ 8) /* "constant_tsc" TSC ticks at a constant rate */
 #define X86_FEATURE_UP			( 3*32+ 9) /* "up" SMP kernel running on UP */
 #define X86_FEATURE_ART			( 3*32+10) /* "art" Always running timer (ART) */
diff --git a/arch/x86/kernel/cpu/scattered.c b/arch/x86/kernel/cpu/scattered.c
index 937129ce6a96..021c0bf22de2 100644
--- a/arch/x86/kernel/cpu/scattered.c
+++ b/arch/x86/kernel/cpu/scattered.c
@@ -67,6 +67,7 @@ static const struct cpuid_bit cpuid_bits[] = {
 	{ X86_FEATURE_PERFMON_V2,		CPUID_EAX,  0, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_V2,		CPUID_EAX,  1, 0x80000022, 0 },
 	{ X86_FEATURE_AMD_LBR_PMC_FREEZE,	CPUID_EAX,  2, 0x80000022, 0 },
+	{ X86_FEATURE_RMPOPT,			CPUID_EDX,  0, 0x80000025, 0 },
 	{ X86_FEATURE_AMD_HTR_CORES,		CPUID_EAX, 30, 0x80000026, 0 },
 	{ 0, 0, 0, 0, 0 }
 };
-- 
2.43.0


^ permalink raw reply related

* [PATCH v6 0/6] Add RMPOPT support.
From: Ashish Kalra @ 2026-06-02 20:00 UTC (permalink / raw)
  To: tglx, mingo, bp, dave.hansen, x86, hpa, seanjc, peterz,
	thomas.lendacky, herbert, davem, ardb
  Cc: pbonzini, aik, Michael.Roth, KPrateek.Nayak, Tycho.Andersen,
	Nathan.Fontenot, ackerleytng, jackyli, pgonda, rientjes, jacobhxu,
	xin, pawan.kumar.gupta, babu.moger, dyoung, nikunj, john.allen,
	darwi, linux-kernel, linux-crypto, kvm, linux-coco

From: Ashish Kalra <ashish.kalra@amd.com>

In the SEV-SNP architecture, hypervisor and non-SNP guests are subject
to RMP checks on writes to provide integrity of SEV-SNP guest memory.

The RMPOPT architecture enables optimizations whereby the RMP checks
can be skipped if 1GB regions of memory are known to not contain any
SNP guest memory.

RMPOPT is a new instruction designed to minimize the performance
overhead of RMP checks for the hypervisor and non-SNP guests.

RMPOPT instruction currently supports two functions. In case of the
verify and report status function the CPU will read the RMP contents,
verify the entire 1GB region starting at the provided SPA is HV-owned.
For the entire 1GB region it checks that all RMP entries in this region
are HV-owned (i.e, not in assigned state) and then accordingly updates
the RMPOPT table to indicate if optimization has been enabled and
provide indication to software if the optimization was successful.

In case of report status function, the CPU returns the optimization
status for the 1GB region.

The RMPOPT table is managed by a combination of software and hardware.
Software uses the RMPOPT instruction to set bits in the table,
indicating that regions of memory are entirely HV-owned.  Hardware
automatically clears bits in the RMPOPT table when RMP contents are
changed during RMPUPDATE instruction.

For more information on the RMPOPT instruction, see the AMD64 RMPOPT
technical documentation.

As SNP is enabled by default the hypervisor and non-SNP guests are
subject to RMP write checks to provide integrity of SNP guest memory.

This patch-series adds support to enable RMP optimizations for up to
2TB of system RAM across the system and allow RMPUPDATE to disable
those optimizations as SNP guests are launched.

Support for RAM larger than 2 TB will be added in follow-on series.

This series also introduces support to re-enable RMP optimizations
during SNP guest termination, after guest pages have been converted
back to shared.

RMP optimizations are performed asynchronously by queuing work on a
dedicated workqueue after a 10 second delay.

Delaying work allows batching of multiple SNP guest terminations.

Once 1GB hugetlb guest_memfd support is merged, support for
re-enabling RMPOPT optimizations during 1GB page cleanup will be added
in follow-on series.

Additionally add debugfs interface to report per-CPU RMPOPT status
across all system RAM.

v6:
- Drop wrmsrq_on_cpus() helper; use for_each_cpu() with wrmsrq_on_cpu()
  instead, as RMPOPT_BASE MSR programming is not performance-critical.
- Rewrite rmpopt_work_handler() leader selection to use a local
  follower_mask copy instead of modifying the global rmpopt_cpumask.
  This eliminates the current_cpu_cleared tracking and the restore at
  the end, and removes the need for synchronization comments about
  transient cpumask inconsistency.
- Add three-way leader selection in rmpopt_work_handler():
  1. Current CPU is a primary thread in cpumask: run leader locally.
  2. Current CPU is a sibling thread whose primary is in cpumask:
     run leader locally (RMPOPT_BASE MSR is per-core), remove the
     primary from followers via cpumask_andnot(topology_sibling_cpumask).
  3. Current CPU's core has no RMPOPT_BASE MSR programmed: pick an
     explicit leader via cpumask_first() + smp_call_function_single()
     to avoid #UD, with cpus_read_lock() around the IPI loop.
- Add WARN_ON_ONCE guard for empty cpumask in the explicit leader
  fallback path, with migrate_enable() before goto out.
- Add .llseek = seq_lseek to rmpopt_table_fops for consistency with
  other seq_file-based debugfs files and to support tools like "less".
- Change debugfs file permissions from 0444 to 0400 to restrict access
  to root only.
- Add comment in rmpopt_table_seq_show() explaining why cpu_online_mask
  is safe: RMPOPT_BASE MSR is per-core and snp_prepare() ensures all
  CPUs are online when the MSR is programmed.

  Sashiko AI code review identified several of the above issues.

v5:
- Introduce rmpopt_cleanup() to tear down workqueue, debugfs, cpumask,
  and MSR state, called from snp_shutdown().
- Introduce rmpopt_wq_mutex to serialize snp_setup_rmpopt(),
  snp_rmpopt_all_physmem(), and rmpopt_cleanup().
- Introduce rmpopt_show_mutex to serialize debugfs reporting of
  rmpopt_report_cpumask.
- Move snp_rmpopt_all_physmem() call after SNP DECOMMISSION during
  guest shutdown.
- Use migrate_disable()/migrate_enable() for CPU pinning in the
  rmpopt_work_handler() leader loop to maintain CPU affinity without
  disabling preemption for the entire RMPOPT scan.
- Add cpus_read_lock()/cpus_read_unlock() around the follower
  on_each_cpu_mask() loop in rmpopt_work_handler().
- Guard snp_setup_rmpopt() against re-initialization when
  SNP_SHUTDOWN_EX with x86_snp_shutdown=0 skips rmpopt_cleanup()
  but clears snp_initialized, preventing workqueue and resource
  leaks on repeated init/shutdown cycles.
- Replace setup_clear_cpu_cap() with pr_err() on alloc_workqueue()
  failure in snp_setup_rmpopt(), as setup_clear_cpu_cap() cannot be
  used after alternatives are patched; callers check rmpopt_wq != NULL
  as the runtime guard instead.
- Add pr_info() when RMPOPT coverage is capped at 2TB.
- Add comments noting CPU hotplug is not supported with SNP enabled
  and only online primary threads are covered by rmpopt_cpumask.
- Add comment in setup_rmptable() noting Segmented RMP must be
  enabled to enable RMPOPT.
- Simplify cpumask setup loop to set if primary thread rather than
  skip if not primary.
- Improve grammar and clarity in snp_setup_rmpopt() comments.
- Added Reviewed-by's.

  Sashiko AI code review identified several of the above issues.

v4:
- Add new wrmsrq_on_cpus() helper to write same u64 value to a
  per-CPU MSR across a cpumask without per-cpu struct allocation
  overhead. 
- Rename configure_and_enable_rmpopt() to snp_setup_rmpopt().
- Use wrmsrq_on_cpus() instead of wrmsrq_on_cpu() loop for
  programming RMPOPT_BASE MSRs.
- Add setup_clear_cpu_cap(X86_FEATURE_RMPOPT) if segmented RMP
  setup fails or workqueue allocation fails.
- Add X86_FEATURE_RMPOPT feature clear logic in amd_cc_platform_clear()
  for CC_ATTR_HOST_SEV_SNP.
- All of the above allow checking for only X86_FEATURE_RMPOPT for both
  RMPOPT setup/enable and RMP re-optimizations.
- Rename snp_perform_rmp_optimization() to snp_rmpopt_all_physmem().
- Split rmpopt() into rmpopt() and rmpopt_smp() for SMP callback use.
- Introduce separate rmpopt_report_cpumask for debugfs reporting,
  distinct from rmpopt_cpumask used for primary thread tracking.
- Remove snp_perform_rmp_optimization() call from __sev_snp_init_locked() 
  and instead setup and enable RMPOPT after SNP is enabled and 
  initialized.

v3:
- Drop all RMPOPT kthread support and introduce adding custom and
  dedicated workqueue to schedule delayed and asynchronous RMPOPT work.
- Drop the guest_memfd inode cleanup interface and add support to
  re-enable RMP optimizations during guest shutdown using the
  asynchronous and delayed workqueue interface.
- Introduce new __rmpopt() helper and rmpopt() and
  rmpopt_report_status() wrappers on top which use rax and rcx
  parameters to closely match RMPOPT specs.
- Use new optimized RMPOPT loop to issue RMPOPT instructions on all
  system RAM upto 2TB and all CPUs, by optimizing each range on one CPU
  first, then let other CPUs execute RMPOPT in parallel so they can skip
  most work as the range has already been optimized.
- Also add support for running the optimized RMPOPT loop only on
  one thread per core.
- Replace all PUD_SIZE references with SZ_1G to conform to 1GB regions
  as specified by RMPOPT specifications and not be dependent on PUD_SIZE
  which makes the RMPOPT patch-set independent of x86 page table sizes.
- Use wrmsrq_on_cpu() to program the RMPOPT_BASE MSR registers on
  all CPUs that removes all ugly casting to use on_each_cpu_mask().
- Fix inline commits and patch commit messages


v2:
- Drop all NUMA and Socket configuration and enablement support and
  enable RMPOPT support for up to 2TB of system RAM.
- Drop get_cpumask_of_primary_threads() and enable per-core RMPOPT
  base MSRs and issue RMPOPT instruction on all CPUs.
- Drop the configfs interface to manually re-enable RMP optimizations.
- Add new guest_memfd cleanup interface to automatically re-enable
  RMP optimizations during guest shutdown.
- Include references to the public RMPOPT documentation.
- Move debugfs directory for RMPOPT under architecuture specific
  parent directory.

Ashish Kalra (6):
  x86/cpufeatures: Add X86_FEATURE_AMD_RMPOPT feature flag
  x86/sev: Initialize RMPOPT configuration MSRs
  x86/sev: Add support to perform RMP optimizations asynchronously
  x86/sev: Add interface to re-enable RMP optimizations.
  KVM: SEV: Perform RMP optimizations on SNP guest shutdown
  x86/sev: Add debugfs support for RMPOPT

 arch/x86/coco/core.c               |   1 +
 arch/x86/include/asm/cpufeatures.h |   2 +-
 arch/x86/include/asm/msr-index.h   |   3 +
 arch/x86/include/asm/sev.h         |   4 +
 arch/x86/kernel/cpu/scattered.c    |   1 +
 arch/x86/kvm/svm/sev.c             |   2 +
 arch/x86/virt/svm/sev.c            | 398 ++++++++++++++++++++++++++++-
 drivers/crypto/ccp/sev-dev.c       |   3 +
 8 files changed, 412 insertions(+), 2 deletions(-)

-- 
2.43.0


^ permalink raw reply

* SVSM Development Call June 3rd, 2026
From: Jörg Rödel @ 2026-06-02 16:07 UTC (permalink / raw)
  To: coconut-svsm, linux-coco

Hi,

Here is the call for agenda items for this weeks SVSM development call.  Please
send any agenda items you have in mind as a reply to this email or raise them
in the meeting.

We will use the LF Zoom instance. Details of the meeting  can be found in our
governance repository at:

	https://github.com/coconut-svsm/governance

The link to the COCONUT-SVSM calendar is:

	https://zoom-lfx.platform.linuxfoundation.org/meetings/coconut-svsm?view=week

The meeting will be recorded and the recording eventually published.

Regards,

	Jörg

^ permalink raw reply

* Re: [PATCH v14 14/44] arm64: RMI: Basic infrastructure for creating a realm.
From: Suzuki K Poulose @ 2026-06-02 14:49 UTC (permalink / raw)
  To: Marc Zyngier, Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86ik88ui0g.wl-maz@kernel.org>

Hi Marc

On 28/05/2026 08:10, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:22 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> Introduce the skeleton functions for creating and destroying a realm.
>> The IPA size requested is checked against what the RMM supports.
>>
>> The actual work of constructing the realm will be added in future
>> patches.
> 
> Again, $SUBJECT doesn't reflect that this is purely a KVM patch.
> 
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Rebased and updated to RMM-v2.0-bet1.
>>   * Auxiliary granules have been removed in RMM-v2.0-bet1
>> Changes since v12:
>>   * Drop the RMM_PAGE_{SHIFT,SIZE} defines - the RMM is now configured to
>>     be the same as the host's page size.
>>   * Rework delegate/undelegate functions to use the new RMI range based
>>     operations.
>> Changes since v11:
>>   * Major rework to drop the realm configuration and make the
>>     construction of realms implicit rather than driven by the VMM
>>     directly.
>>   * The code to create RDs, handle VMIDs etc is moved to later patches.
>> Changes since v10:
>>   * Rename from RME to RMI.
>>   * Move the stage2 cleanup to a later patch.
>> Changes since v9:
>>   * Avoid walking the stage 2 page tables when destroying the realm -
>>     the real ones are not accessible to the non-secure world, and the RMM
>>     may leave junk in the physical pages when returning them.
>>   * Fix an error path in realm_create_rd() to actually return an error value.
>> Changes since v8:
>>   * Fix free_delegated_granule() to not call kvm_account_pgtable_pages();
>>     a separate wrapper will be introduced in a later patch to deal with
>>     RTTs.
>>   * Minor code cleanups following review.
>> Changes since v7:
>>   * Minor code cleanup following Gavin's review.
>> Changes since v6:
>>   * Separate RMM RTT calculations from host PAGE_SIZE. This allows the
>>     host page size to be larger than 4k while still communicating with an
>>     RMM which uses 4k granules.
>> Changes since v5:
>>   * Introduce free_delegated_granule() to replace many
>>     undelegate/free_page() instances and centralise the comment on
>>     leaking when the undelegate fails.
>>   * Several other minor improvements suggested by reviews - thanks for
>>     the feedback!
>> Changes since v2:
>>   * Improved commit description.
>>   * Improved return failures for rmi_check_version().
>>   * Clear contents of PGD after it has been undelegated in case the RMM
>>     left stale data.
>>   * Minor changes to reflect changes in previous patches.
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h | 29 ++++++++++++++
>>   arch/arm64/include/asm/kvm_rmi.h     | 51 +++++++++++++++++++++++++
>>   arch/arm64/kvm/arm.c                 | 12 ++++++
>>   arch/arm64/kvm/mmu.c                 | 12 +++++-
>>   arch/arm64/kvm/rmi.c                 | 57 ++++++++++++++++++++++++++++
>>   5 files changed, 159 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
>> index 5bf3d7e1d92c..82fd777bd9bb 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -688,4 +688,33 @@ static inline void vcpu_set_hcrx(struct kvm_vcpu *vcpu)
>>   			vcpu->arch.hcrx_el2 |= HCRX_EL2_EnASR;
>>   	}
>>   }
>> +
>> +static inline bool kvm_is_realm(struct kvm *kvm)
>> +{
>> +	if (static_branch_unlikely(&kvm_rmi_is_available))
>> +		return kvm->arch.is_realm;
>> +	return false;
>> +}
>> +
>> +static inline enum realm_state kvm_realm_state(struct kvm *kvm)
>> +{
>> +	return READ_ONCE(kvm->arch.realm.state);
>> +}
>> +
>> +static inline void kvm_set_realm_state(struct kvm *kvm,
>> +				       enum realm_state new_state)
>> +{
>> +	WRITE_ONCE(kvm->arch.realm.state, new_state);
>> +}
>> +
>> +static inline bool kvm_realm_is_created(struct kvm *kvm)
>> +{
>> +	return kvm_is_realm(kvm) && kvm_realm_state(kvm) != REALM_STATE_NONE;
>> +}
>> +
>> +static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>> +{
>> +	return false;
>> +}
>> +
>>   #endif /* __ARM64_KVM_EMULATE_H__ */
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/asm/kvm_rmi.h
>> index 4936007947fd..9de34983ee52 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -6,12 +6,63 @@
>>   #ifndef __ASM_KVM_RMI_H
>>   #define __ASM_KVM_RMI_H
>>   
>> +#include <asm/rmi_smc.h>
>> +
>> +/**
>> + * enum realm_state - State of a Realm
>> + */
>> +enum realm_state {
>> +	/**
>> +	 * @REALM_STATE_NONE:
>> +	 *      Realm has not yet been created. rmi_realm_create() has not
>> +	 *      yet been called.
>> +	 */
>> +	REALM_STATE_NONE,
>> +	/**
>> +	 * @REALM_STATE_NEW:
>> +	 *      Realm is under construction, rmi_realm_create() has been
>> +	 *      called, but it is not yet activated. Pages may be populated.
>> +	 */
>> +	REALM_STATE_NEW,
>> +	/**
>> +	 * @REALM_STATE_ACTIVE:
>> +	 *      Realm has been created and is eligible for execution with
>> +	 *      rmi_rec_enter(). Pages may no longer be populated with
>> +	 *      rmi_data_create().
>> +	 */
>> +	REALM_STATE_ACTIVE,
>> +	/**
>> +	 * @REALM_STATE_DYING:
>> +	 *      Realm is in the process of being destroyed or has already been
>> +	 *      destroyed.
>> +	 */
>> +	REALM_STATE_DYING,
>> +	/**
>> +	 * @REALM_STATE_DEAD:
>> +	 *      Realm has been destroyed.
>> +	 */
>> +	REALM_STATE_DEAD
>> +};
> 
> What is the ABI status of this state? Is it purely internal to KVM? Or
> is it something that the RMM actively tracks?

The states are in line with what the RMM maintains for the Realm state,
(Section A2.2.5 Realm Lifecycle)
except for :

1. REALM_STATE_DYING is really a KVM internal state to indicate, we
are in the process of destroying the Realm and no further requests
needs to be serviced

2. We don't track the REALM_SYSTEM_OFF, REALM_ZOMBIE states separately
as we :
  a) Always TERMINATE the Realm, just before the DESTROY
  b) SYSTEM_OFF is naturally triggering the tear down path, leading to 
DYING.




> 
>> +
>>   /**
>>    * struct realm - Additional per VM data for a Realm
>> + *
>> + * @state: The lifetime state machine for the realm
>> + * @rd: Kernel mapping of the Realm Descriptor (RD)
>> + * @params: Parameters for the RMI_REALM_CREATE command
>> + * @ia_bits: Number of valid Input Address bits in the IPA
>>    */
>>   struct realm {
>> +	enum realm_state state;
>> +	void *rd;
> 
> Why is this void? Doesn't it have a proper type?

Not really. This is an object that RMM manages (Realm Descriptor)
in the Realm world. We use it as a parameter to address the Realm.


> 
>> +	struct realm_params *params;
>> +	unsigned int ia_bits;
> 
> Consider reordering this structure to avoid holes.
> 
>>   };
>>   
>>   void kvm_init_rmi(void);
>> +u32 kvm_realm_ipa_limit(void);
> 
> The use of 'realm' is confusing. This is not a per-realm property, but
> something global. I'd rather reserve the term 'realm' for CCA VMs (cue
> the two prototypes below).

Agreed. Perhaps, kvm_rmm_ipa_limit() ?


> 
>> +
>> +int kvm_init_realm(struct kvm *kvm);
>> +void kvm_destroy_realm(struct kvm *kvm);
>>   
>>   #endif /* __ASM_KVM_RMI_H */
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 247e03b33035..18251e561524 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -264,6 +264,13 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>>   
>>   	bitmap_zero(kvm->arch.vcpu_features, KVM_VCPU_MAX_FEATURES);
>>   
>> +	/* Initialise the realm bits after the generic bits are enabled */
>> +	if (kvm_is_realm(kvm)) {
>> +		ret = kvm_init_realm(kvm);
>> +		if (ret)
>> +			goto err_uninit_mmu;
>> +	}
>> +
>>   	return 0;
>>   
>>   err_uninit_mmu:
>> @@ -326,6 +333,8 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>>   	kvm_unshare_hyp(kvm, kvm + 1);
>>   
>>   	kvm_arm_teardown_hypercalls(kvm);
>> +	if (kvm_is_realm(kvm))
>> +		kvm_destroy_realm(kvm);
>>   }
>>   
>>   static bool kvm_has_full_ptr_auth(void)
>> @@ -486,6 +495,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>   		else
>>   			r = kvm_supports_cacheable_pfnmap();
>>   		break;
>> +	case KVM_CAP_ARM_RMI:
>> +		r = static_key_enabled(&kvm_rmi_is_available);
>> +		break;
>>   
>>   	default:
>>   		r = 0;
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index d089c107d9b7..ba8286472286 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -877,10 +877,14 @@ static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
>>   
>>   static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
>>   {
>> +	struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
>>   	u32 kvm_ipa_limit = get_kvm_ipa_limit();
>>   	u64 mmfr0, mmfr1;
>>   	u32 phys_shift;
>>   
>> +	if (kvm_is_realm(kvm))
>> +		kvm_ipa_limit = kvm_realm_ipa_limit();
>> +
>>   	phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type);
>>   	if (is_protected_kvm_enabled()) {
>>   		phys_shift = kvm_ipa_limit;
>> @@ -974,6 +978,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>>   		return -EINVAL;
>>   	}
>>   
>> +	mmu->arch = &kvm->arch;
>> +
>>   	err = kvm_init_ipa_range(mmu, type);
>>   	if (err)
>>   		return err;
>> @@ -982,7 +988,6 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
>>   	if (!pgt)
>>   		return -ENOMEM;
>>   
>> -	mmu->arch = &kvm->arch;
> 
> Why moving this init?

Because, we need to know the "kvm" instance for kvm_init_ipa_range to
detect the limit that applies to Realms.

> 
>>   	err = KVM_PGT_FN(kvm_pgtable_stage2_init)(pgt, mmu, &kvm_s2_mm_ops);
>>   	if (err)
>>   		goto out_free_pgtable;
>> @@ -1114,7 +1119,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
>>   	write_unlock(&kvm->mmu_lock);
>>   
>>   	if (pgt) {
>> -		kvm_stage2_destroy(pgt);
>> +		if (!kvm_is_realm(kvm))
>> +			kvm_stage2_destroy(pgt);
>> +		else
>> +			kvm_pgtable_stage2_destroy_pgd(pgt);
> 
> Why can't you make kvm_stage2_destroy() do the right thing? Surely the
> PTs have to be reclaimed one way or another.

Actually yes, we could make it work. We need to skip walking the page
table for Realms. We may be able to do the checks via 
pgt->mmu->arch->kvm and skip the walking for Realms. ( The S2 is 
unmapped and torn
down before the RD is destroyed in kvm_destroy_realm(). We can't
rely on the contents of the PGDs to be zero - e.g., with MEC.)



> 
>>   		kfree(pgt);
>>   	}
>>   }
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 6e28b669ded2..f51ec667445e 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -5,6 +5,8 @@
>>   
>>   #include <linux/kvm_host.h>
>>   
>> +#include <asm/kvm_emulate.h>
>> +#include <asm/kvm_mmu.h>
>>   #include <asm/kvm_pgtable.h>
>>   #include <asm/rmi_cmds.h>
>>   #include <asm/virt.h>
>> @@ -14,6 +16,61 @@ static bool rmi_has_feature(unsigned long feature)
>>   	return !!u64_get_bits(rmm_feat_reg0, feature);
>>   }
>>   
>> +u32 kvm_realm_ipa_limit(void)
>> +{
>> +	return u64_get_bits(rmm_feat_reg0, RMI_FEATURE_REGISTER_0_S2SZ);
>> +}
>> +
>> +void kvm_destroy_realm(struct kvm *kvm)
>> +{
>> +	struct realm *realm = &kvm->arch.realm;
>> +	size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +
>> +	if (realm->params) {
>> +		free_page((unsigned long)realm->params);
>> +		realm->params = NULL;
>> +	}
>> +
>> +	if (!kvm_realm_is_created(kvm))
>> +		return;
>> +
>> +	kvm_set_realm_state(kvm, REALM_STATE_DYING);
>> +
>> +	write_lock(&kvm->mmu_lock);
>> +	kvm_stage2_unmap_range(&kvm->arch.mmu, 0,
>> +			       BIT(realm->ia_bits - 1), true);
>> +	write_unlock(&kvm->mmu_lock);
>> +
>> +	if (realm->rd) {
>> +		phys_addr_t rd_phys = virt_to_phys(realm->rd);
>> +
>> +		if (WARN_ON(rmi_realm_terminate(rd_phys)))
>> +			return;
>> +
>> +		if (WARN_ON(rmi_realm_destroy(rd_phys)))
>> +			return;
>> +		free_delegated_page(rd_phys);
>> +		realm->rd = NULL;
>> +	}
>> +
>> +	if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size)))
>> +		return;
>> +
>> +	kvm_set_realm_state(kvm, REALM_STATE_DEAD);
>> +
>> +	/* Now that the Realm is destroyed, free the entry level RTTs */
>> +	kvm_free_stage2_pgd(&kvm->arch.mmu);
>> +}
> 
> This really needs documentation: what happens at each stage? What
> memory is reclaimed when?

Agreed.

> 
> But even more importantly, why is this built in a completely parallel
> way, potentially deviating from the existing KVM S2 management?


RMM requires a Realm is not live at the time of REALM_DESTROY.
(See section A2.2.4 Realm Liveness).
i.e., All RECs are destroyed, Root RTTs wiped clean (no live mappings)
before the RD is destroyed. So, we need to make sure all of this is
done at Realm Destroy. Hence we delay the kvm_free_stage2_pgd() until
we destroy the RD.

Does that help? May be we could improve the comments around it.


Suzuki



 > Thanks,>
> 	M.
> 


^ permalink raw reply

* RE: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Michael Kelley @ 2026-06-02 14:24 UTC (permalink / raw)
  To: Aneesh Kumar K.V, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <yq5afr35sciu.fsf@kernel.org>

From: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Sent: Monday, June 1, 2026 11:05 PM
> 
> Michael Kelley <mhklinux@outlook.com> writes:
> 
> > From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>Sent: Thursday, May 21, 2026 9:28 PM
> >>
> >> Teach the atomic DMA pool code to distinguish between encrypted and
> >> unencrypted pools, and make pool allocation select the matching pool based
> >> on DMA attributes.
> >>
> >> Introduce a dma_gen_pool wrapper that records whether a pool is
> >> unencrypted, initialize that state when the atomic pools are created, and
> >> use it when expanding and resizing the pools. Update dma_alloc_from_pool()
> >> to take attrs and skip pools whose encrypted state does not match
> >> DMA_ATTR_CC_SHARED. Update dma_free_from_pool() accordingly.
> >>
> >> Also pass DMA_ATTR_CC_SHARED from the swiotlb atomic allocation path so
> >> decrypted swiotlb allocations are taken from the correct atomic pool.
> >>
> >> Tested-by: Jiri Pirko <jiri@nvidia.com>
> >> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> >> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
> >> ---
> >>  drivers/iommu/dma-iommu.c   |   2 +-
> >>  include/linux/dma-map-ops.h |   2 +-
> >>  kernel/dma/direct.c         |  11 ++-
> >>  kernel/dma/pool.c           | 167 +++++++++++++++++++++++-------------
> >>  kernel/dma/swiotlb.c        |   7 +-
> >>  5 files changed, 123 insertions(+), 66 deletions(-)
> >>
> >
> > [snip]
> >
> >> +static __init struct dma_gen_pool *__dma_atomic_pool_init(struct dma_gen_pool *dma_pool,
> >> +		size_t pool_size, gfp_t gfp)
> >>  {
> >> -	struct gen_pool *pool;
> >>  	int ret;
> >>
> >> -	pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> >> -	if (!pool)
> >> +	dma_pool->pool = gen_pool_create(PAGE_SHIFT, NUMA_NO_NODE);
> >> +	if (!dma_pool->pool)
> >>  		return NULL;
> >>
> >> -	gen_pool_set_algo(pool, gen_pool_first_fit_order_align, NULL);
> >> +	gen_pool_set_algo(dma_pool->pool, gen_pool_first_fit_order_align, NULL);
> >> +
> >> +	/* if platform is using memory encryption atomic pools are by default decrypted. */
> >> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> >> +		dma_pool->unencrypted = true;
> >> +	else
> >> +		dma_pool->unencrypted = false;
> >
> > I'm curious about the name of the "unencrypted" field in struct dma_gen_pool,
> > and similarly in Patch 7 of the series for the swiotlb struct io_tlb_pool and
> > struct io_tlb_mem. Up through v3 of this series, you used "decrypted", but
> > starting in v4 switched to "unencrypted".
> >
> > To me, the above "if" statement has some cognitive dissonance in that if
> > CC_ATTR_MEM_ENCRYPT is false (i.e., a normal VM), "unencrypted" is set
> > to false. But I think of memory in a normal VM as "unencrypted" since it
> > was never encrypted. A similar "if" statement occurs in your swiotlb changes.
> >
> > Two related concepts are captured by the field:
> > 1) Is some action needed to put the memory into the unencrypted state,
> > and to remove it from that state? This applies when assigning memory to the
> > pool, or freeing the memory in the pool.
> > 2) Is the memory currently in the unencrypted state? This applies when
> > allocating memory from the pool to a caller.
> >
> > It's hard to capture all that in a short field name. But I think I prefer "decrypted"
> > over "unencrypted".  The former implies that some action was taken. It's a
> > little easier to think of a normal VM as *not* having decrypted memory. The
> > memory was never encrypted in the first place, so no decryption action was taken.
> >
> > Throughout the kernel, "decrypted" occurs much more frequently than
> > "unencrypted".  We have set_memory_encrypted() and set_memory_decrypted()
> > that are "take action" names.  But we also have force_dma_unencrypted(),
> > phys_to_dma_unencrypted(), and dma_addr_unencrypted(). So it's a bit
> > of a mess.
> >
> >
> > But maybe there's more background here that led to the change
> > between your v3 and v4.
> >
> > Michael
> 
> The current APIs, phys_to_dma_unencrypted() and dma_addr_unencrypted(),
> are the reason I changed the pool attribute name from decrypted to
> unencrypted. The rationale was that nobody actually decrypted the
> memory; the memory was already in an unencrypted state.
> 
> In other words, the DMA pool did not contain encrypted content that was
> later decrypted. Rather, the DMA pool itself was in an unencrypted
> state.
> 
> IMHO, set_memory_decrypted()/set_memory_encrypted() is the right naming
> because those APIs describe an operation that transitions memory between
> states. In contrast, the pool attribute describes the state of the
> memory itself, which is why I used unencrypted rather than decrypted.
>

Except that in a normal VM, the "unencrypted" pool attribute does *not*
describe the state of the memory itself.  In a normal VM, the memory is
unencrypted, but the "unencrypted" pool attribute is false. That
contradiction is the essence of my concern.

Michael

^ permalink raw reply

* Re: [PATCH v14 13/44] arm64: RMI: Define the user ABI
From: Suzuki K Poulose @ 2026-06-02 11:15 UTC (permalink / raw)
  To: Marc Zyngier, Steven Price
  Cc: kvm, kvmarm, Catalin Marinas, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Gavin Shan, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <86jysovpxf.wl-maz@kernel.org>

Hi Marc

On 27/05/2026 16:21, Marc Zyngier wrote:
> On Wed, 13 May 2026 14:17:21 +0100,
> Steven Price <steven.price@arm.com> wrote:
>>
>> There is one CAP which identified the presence of CCA, and one ioctl.
>> The ioctl is used to populate memory during creation of the realm as
>> this requires the RMM to copy data from an unprotected address to the
>> protected memory - CCA does not support memory conversion where the
>> memory contents is preserved as this is incompatible with memory
>> encryption.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * KVM_ARM_VCPU_RMI_PSCI_COMPLETE removed.
>>   * KVM_ARM_RMI_POPULATE documentation updated to reflect that the
>>     structure is written by the kernel.
>>   * CAP number bumped.
>> Changes since v12:
>>   * Change KVM_ARM_RMI_POPULATE to update the structure with the amount
>>     that has been progressed rather than return the number of bytes
>>     populated.
>>   * Describe the flag KVM_ARM_RMI_POPULATE_FLAGS_MEASURE.
>>   * CAP number is bumped.
>>   * NOTE: The PSCI ioctl may be removed in a future spec release.
>> Changes since v11:
>>   * Completely reworked to be more implicit. Rather than having explicit
>>     CAP operations to progress the realm construction these operations
>>     are done when needed (on populating and on first vCPU run).
>>   * Populate and PSCI complete are promoted to proper ioctls.
>> Changes since v10:
>>   * Rename symbols from RME to RMI.
>> Changes since v9:
>>   * Improvements to documentation.
>>   * Bump the magic number for KVM_CAP_ARM_RME to avoid conflicts.
>> Changes since v8:
>>   * Minor improvements to documentation following review.
>>   * Bump the magic numbers to avoid conflicts.
>> Changes since v7:
>>   * Add documentation of new ioctls
>>   * Bump the magic numbers to avoid conflicts
>> Changes since v6:
>>   * Rename some of the symbols to make their usage clearer and avoid
>>     repetition.
>> Changes from v5:
>>   * Actually expose the new VCPU capability (KVM_ARM_VCPU_REC) by bumping
>>     KVM_VCPU_MAX_FEATURES - note this also exposes KVM_ARM_VCPU_HAS_EL2!
>> ---
>>   Documentation/virt/kvm/api.rst | 40 ++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/kvm.h       | 13 +++++++++++
>>   2 files changed, 53 insertions(+)
> 
> $SUBJECT looks wrong. This is a KVM change, not an RMI change.
> 
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 52bbbb553ce1..ca68aae7faa2 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6553,6 +6553,37 @@ KVM_S390_KEYOP_SSKE
>>     Sets the storage key for the guest address ``guest_addr`` to the key
>>     specified in ``key``, returning the previous value in ``key``.
>>   
>> +4.145 KVM_ARM_RMI_POPULATE
>> +--------------------------
>> +
>> +:Capability: KVM_CAP_ARM_RMI
>> +:Architectures: arm64
>> +:Type: vm ioctl
>> +:Parameters: struct kvm_arm_rmi_populate (in/out)
>> +:Returns: 0 on success, < 0 on error
>> +
>> +::
>> +
>> +  struct kvm_arm_rmi_populate {
>> +	__u64 base;
>> +	__u64 size;
>> +	__u64 source_uaddr;
>> +	__u32 flags;
>> +	__u32 reserved;
>> +  };
>> +
>> +Populate a region of protected address space by copying the data from the
>> +(non-protected) user space pointer provided into a protected region (backed by
>> +guestmem_fd). It implicitly sets the destination region to RIPAS RAM. This is
>> +only valid before any VCPUs have been run. The ioctl might not populate the
>> +entire region and in this case the kernel updates the fields `base`, `size` and
>> +`source_uaddr`. User space may have to repeatedly call it until `size` is 0 to
>> +populate the entire region.
>> +
>> +`flags` can be set to `KVM_ARM_RMI_POPULATE_FLAGS_MEASURE` to request that the
>> +populated data is hashed and added to the guest's Realm Initial Measurement
>> +(RIM).
> 
> Where is that measurement stored? And retrieved? At least a pointer to
> that would help.

The measurement is stored by the RMM and is made available to the Guests
via RSI interface (RSI_ATTEST_TOKEN_{INIT,CONTINUE}) as part of the 
attestation report along with the Platform attestation. On Linux Guest,
this could be fetched using TSM report infrastructure. This could be 
added to the doc.


Suzuki



> 
>> +
>>   .. _kvm_run:
>>   
>>   5. The kvm_run structure
>> @@ -8904,6 +8935,15 @@ helpful if user space wants to emulate instructions which are not
>>   This capability can be enabled dynamically even if VCPUs were already
>>   created and are running.
>>   
>> +7.47 KVM_CAP_ARM_RMI
>> +--------------------
>> +
>> +:Architectures: arm64
>> +:Target: VM
>> +:Parameters: None
>> +
>> +This capability indicates that support for CCA realms is available.
>> +
>>   8. Other capabilities.
>>   ======================
>>   
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6c8afa2047bf..b8cff0938041 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -996,6 +996,7 @@ struct kvm_enable_cap {
>>   #define KVM_CAP_S390_USER_OPEREXEC 246
>>   #define KVM_CAP_S390_KEYOP 247
>>   #define KVM_CAP_S390_VSIE_ESAMODE 248
>> +#define KVM_CAP_ARM_RMI 249
>>   
>>   struct kvm_irq_routing_irqchip {
>>   	__u32 irqchip;
>> @@ -1669,4 +1670,16 @@ struct kvm_pre_fault_memory {
>>   	__u64 padding[5];
>>   };
>>   
>> +/* Available with KVM_CAP_ARM_RMI, only for VMs with KVM_VM_TYPE_ARM_REALM */
>> +#define KVM_ARM_RMI_POPULATE	_IOWR(KVMIO, 0xd7, struct kvm_arm_rmi_populate)
>> +#define KVM_ARM_RMI_POPULATE_FLAGS_MEASURE	(1 << 0)
>> +
>> +struct kvm_arm_rmi_populate {
>> +	__u64 base;
>> +	__u64 size;
>> +	__u64 source_uaddr;
>> +	__u32 flags;
>> +	__u32 reserved;
>> +};
>> +
>>   #endif /* __LINUX_KVM_H */
> 
> Thanks,
> 
> 	M.
> 


^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02  9:10 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <d01cf1ec-b85d-4af6-9810-8107c0e2a4ec@arm.com>



On 02/06/2026 09:55, Suzuki K Poulose wrote:
> On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> All-shared guest_memfd used to be only supported for non-CoCo VMs where
>> preparation doesn't apply. INIT_SHARED is about to be supported for
>> non-CoCo VMs in a later patch in this series.
> 
> nit: s/non-CoCo/CoCo ?
> 
>>
>> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
>> guest_memfd in a later patch in this series.
>>
>> This means that the kvm fault handler may now call kvm_gmem_get_pfn() 
>> on a
>> shared folio for a CoCo VM where preparation applies.
>>
>> Add a check to make sure that preparation is only performed for private
>> folios.
>>
>> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
>> conversion to shared.
>>
>> Signed-off-by: Michael Roth <michael.roth@amd.com>
> 
> nit: Missing Co-Developed-by: ?
> 
>> Reviewed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> ---
>>   virt/kvm/guest_memfd.c | 9 ++++++---
>>   1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 78e5435967341..adf57a3a1f5dd 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>                int *max_order)
>>   {
>>       pgoff_t index = kvm_gmem_get_index(slot, gfn);
>> +    struct inode *inode;
>>       struct folio *folio;
>>       int r = 0;
>> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>       if (!file)
>>           return -EFAULT;
>> -    filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
>> +    inode = file_inode(file);
>> +    filemap_invalidate_lock_shared(inode->i_mapping);
>>       folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>>       if (IS_ERR(folio)) {
>> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>           folio_mark_uptodate(folio);
>>       }
>> -    r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>> +    if (kvm_gmem_is_private_mem(inode, index))
> 
> Don't we need to make sure the entire folio is private ? Not just the 
> page at the index ?
>      if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?

Or rather, we should go through the individual pages and apply the
prepare for ones that are private ?

Suzuki

> 
> Suzuki
> 
>> +        r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>       folio_unlock(folio);
>> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct 
>> kvm_memory_slot *slot,
>>           folio_put(folio);
>>   out:
>> -    filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
>> +    filemap_invalidate_unlock_shared(inode->i_mapping);
>>       return r;
>>   }
>>   EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
>>
> 


^ permalink raw reply

* Re: [PATCH v7 07/42] KVM: guest_memfd: Only prepare folios for private pages
From: Suzuki K Poulose @ 2026-06-02  8:55 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, aneesh.kumar, liam, Paolo Bonzini, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
	Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
	Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
	Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-7-2f0fae496530@google.com>

On 23/05/2026 01:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> All-shared guest_memfd used to be only supported for non-CoCo VMs where
> preparation doesn't apply. INIT_SHARED is about to be supported for
> non-CoCo VMs in a later patch in this series.

nit: s/non-CoCo/CoCo ?

> 
> In addition, KVM_SET_MEMORY_ATTRIBUTES2 is about to be supported in
> guest_memfd in a later patch in this series.
> 
> This means that the kvm fault handler may now call kvm_gmem_get_pfn() on a
> shared folio for a CoCo VM where preparation applies.
> 
> Add a check to make sure that preparation is only performed for private
> folios.
> 
> Preparation will be undone on freeing (see kvm_gmem_free_folio()) and on
> conversion to shared.
> 
> Signed-off-by: Michael Roth <michael.roth@amd.com>

nit: Missing Co-Developed-by: ?

> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   virt/kvm/guest_memfd.c | 9 ++++++---
>   1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 78e5435967341..adf57a3a1f5dd 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -894,6 +894,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		     int *max_order)
>   {
>   	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> +	struct inode *inode;
>   	struct folio *folio;
>   	int r = 0;
>   
> @@ -901,7 +902,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   	if (!file)
>   		return -EFAULT;
>   
> -	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
> +	inode = file_inode(file);
> +	filemap_invalidate_lock_shared(inode->i_mapping);
>   
>   	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
>   	if (IS_ERR(folio)) {
> @@ -914,7 +916,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		folio_mark_uptodate(folio);
>   	}
>   
> -	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> +	if (kvm_gmem_is_private_mem(inode, index))

Don't we need to make sure the entire folio is private ? Not just the 
page at the index ?
	if (kvm_gmem_range_is_private(, index, folio_nr_pages(folio)) ?

Suzuki

> +		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>   
>   	folio_unlock(folio);
>   
> @@ -924,7 +927,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   		folio_put(folio);
>   
>   out:
> -	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
>   	return r;
>   }
>   EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
> 


^ permalink raw reply

* Re: [RFC PATCH v4 01/14] coco: host: arm64: Add host TSM callback and IDE stream allocation support
From: Aneesh Kumar K.V @ 2026-06-02  8:42 UTC (permalink / raw)
  To: Dan Williams (nvidia), linux-coco, kvmarm, linux-arm-kernel,
	linux-kernel
  Cc: Alexey Kardashevskiy, Catalin Marinas, Dan Williams,
	Jason Gunthorpe, Jonathan Cameron, Marc Zyngier, Samuel Ortiz,
	Steven Price, Suzuki K Poulose, Will Deacon, Xu Yilun
In-Reply-To: <6a17d6f1d6371_2b1fb710057@djbw-dev.notmuch>

"Dan Williams (nvidia)" <djbw@kernel.org> writes:

> Aneesh Kumar K.V (Arm) wrote:
>> Register the TSM callback when the DA feature is supported by KVM.
>> 
>> This driver handles IDE stream setup for both the root port and PCIe
>> endpoints. Root port IDE stream enablement itself is managed by RMM.
>> 
>> In addition, the driver registers pci_tsm_ops with the TSM subsystem.
>
> Do you want to call out that this is an infrastructure / scaffolding
> patch that only handles the PCI-TSM skeleton. The CCA meat comes later,
> in particular IDE key management. Tell a bit more of the story 
>
> Otherwise, mostly looks good.
>

Sure, I’ll update the commit message.

>
> Minor comments below...
>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>> ---
>>  arch/arm64/include/asm/rmi_smc.h         |   2 +
>>  drivers/firmware/smccc/rmm.c             |  12 ++
>>  drivers/firmware/smccc/rmm.h             |   8 +
>>  drivers/firmware/smccc/smccc.c           |   1 +
>>  drivers/virt/coco/Kconfig                |   2 +
>>  drivers/virt/coco/Makefile               |   1 +
>>  drivers/virt/coco/arm-cca-host/Kconfig   |  19 ++
>>  drivers/virt/coco/arm-cca-host/Makefile  |   5 +
>>  drivers/virt/coco/arm-cca-host/arm-cca.c | 225 +++++++++++++++++++++++
>>  drivers/virt/coco/arm-cca-host/rmi-da.h  |  46 +++++
>>  10 files changed, 321 insertions(+)
>>  create mode 100644 drivers/virt/coco/arm-cca-host/Kconfig
>>  create mode 100644 drivers/virt/coco/arm-cca-host/Makefile
>>  create mode 100644 drivers/virt/coco/arm-cca-host/arm-cca.c
>>  create mode 100644 drivers/virt/coco/arm-cca-host/rmi-da.h
>> 
>> diff --git a/arch/arm64/include/asm/rmi_smc.h b/arch/arm64/include/asm/rmi_smc.h
>> index fa23818e1b4c..109d6cc6ef37 100644
>> --- a/arch/arm64/include/asm/rmi_smc.h
>> +++ b/arch/arm64/include/asm/rmi_smc.h
> [..]
>> diff --git a/drivers/firmware/smccc/rmm.c b/drivers/firmware/smccc/rmm.c
>> index 2a6187df3285..7444cc3a588c 100644
>> --- a/drivers/firmware/smccc/rmm.c
>> +++ b/drivers/firmware/smccc/rmm.c
> [..]
>> diff --git a/drivers/firmware/smccc/rmm.h b/drivers/firmware/smccc/rmm.h
>> index a47a650d4f51..37d0d95a099e 100644
>> --- a/drivers/firmware/smccc/rmm.h
>> +++ b/drivers/firmware/smccc/rmm.h
> [..]
>> diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
>> index fc9b44b7c687..2bf2d59e686d 100644
>> --- a/drivers/firmware/smccc/smccc.c
>> +++ b/drivers/firmware/smccc/smccc.c
>> @@ -97,6 +97,7 @@ static int __init smccc_devices_init(void)
>>  		 * the required SMCCC function IDs at a supported revision.
>>  		 */
>>  		register_rsi_device(pdev);
>> +		register_rmi_device(pdev);
>>  	}
>
> Would splitting the above three hunks make this series stand on its own
> relative to the base CCA series? I assume likely not as soon as we get
> to patch2.
>
> Otherwise, just curious what your intended merge strategy is for this,
> tsm.git or arm.git, and what help this needs?
>
> [..]
> snip code that looks good.
>

Yes, I’ll split this into a separate patch.

>
>> diff --git a/drivers/virt/coco/arm-cca-host/Makefile b/drivers/virt/coco/arm-cca-host/Makefile
>> new file mode 100644
>> index 000000000000..c236827f002c
>> --- /dev/null
>> +++ b/drivers/virt/coco/arm-cca-host/Makefile
>> @@ -0,0 +1,5 @@
>> +# SPDX-License-Identifier: GPL-2.0-only
>> +#
>> +obj-$(CONFIG_ARM_CCA_HOST) += arm-cca-host.o
>> +
>> +arm-cca-host-y	+=  arm-cca.o
>> diff --git a/drivers/virt/coco/arm-cca-host/arm-cca.c b/drivers/virt/coco/arm-cca-host/arm-cca.c
>> new file mode 100644
>> index 000000000000..67f7e80106e8
>> --- /dev/null
>> +++ b/drivers/virt/coco/arm-cca-host/arm-cca.c
>> @@ -0,0 +1,225 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2026 ARM Ltd.
>> + */
>> +
>> +#include <linux/auxiliary_bus.h>
>> +#include <linux/pci-tsm.h>
>> +#include <linux/pci-ide.h>
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +#include <linux/tsm.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/cleanup.h>
>> +
>> +#include "rmi-da.h"
>> +
>> +/* Total number of stream id supported at root port level */
>> +#define MAX_STREAM_ID	256
>> +
>> +static struct pci_tsm *cca_tsm_pci_probe(struct tsm_dev *tsm_dev, struct pci_dev *pdev)
>> +{
>> +	int ret;
>> +
>> +	if (!is_pci_tsm_pf0(pdev)) {
>> +		struct cca_host_fn_dsc *fn_dsc __free(kfree) =
>> +			kzalloc(sizeof(*fn_dsc), GFP_KERNEL);
>
> kzalloc_obj(*fn_dsc)
>
>> +
>> +		if (!fn_dsc)
>> +			return NULL;
>> +
>> +		ret = pci_tsm_link_constructor(pdev, &fn_dsc->pci, tsm_dev);
>> +		if (ret)
>> +			return NULL;
>> +
>> +		return &no_free_ptr(fn_dsc)->pci;
>> +	}
>> +
>> +	if (!pdev->ide_cap)
>> +		return NULL;
>
> Bailing early?
>
> Maybe the RMM knows something about this device not needing IDE? I have
> a similar question in patch2 around trusted sources for whether a device
> is internal or not.
>

Yes. This get updated later in
https://lore.kernel.org/all/20260427065121.916615-14-aneesh.kumar@kernel.org

>
>> +
>> +	struct cca_host_pf0_ep_dsc *pf0_ep_dsc __free(kfree) =
>> +		kzalloc(sizeof(*pf0_ep_dsc), GFP_KERNEL);
>> +	if (!pf0_ep_dsc)
>> +		return NULL;
>> +
>> +	ret = pci_tsm_pf0_constructor(pdev, &pf0_ep_dsc->pci, tsm_dev);
>> +	if (ret)
>> +		return NULL;
>> +
>> +	pci_dbg(pdev, "tsm enabled\n");
>> +	return &no_free_ptr(pf0_ep_dsc)->pci.base_tsm;
>> +}
>> +
>> +static void cca_tsm_pci_remove(struct pci_tsm *tsm)
>> +{
>> +	struct pci_dev *pdev = tsm->pdev;
>> +
>> +	if (is_pci_tsm_pf0(pdev)) {
>> +		struct cca_host_pf0_ep_dsc *pf0_ep_dsc = to_cca_pf0_ep_dsc(pdev);
>> +
>> +		pci_tsm_pf0_destructor(&pf0_ep_dsc->pci);
>> +		kfree(pf0_ep_dsc);
>> +	} else {
>> +		kfree(to_cca_fn_dsc(pdev));
>> +	}
>> +}
>> +
>> +/* For now global for simplicity. Protected by pci_tsm_rwsem */
>> +static DECLARE_BITMAP(cca_stream_ids, MAX_STREAM_ID);
>> +static int alloc_stream_id(struct pci_host_bridge *hb)
>> +{
>> +	int stream_id;
>> +
>> +redo_alloc:
>> +	stream_id = find_first_zero_bit(cca_stream_ids, MAX_STREAM_ID);
>> +	if (stream_id == MAX_STREAM_ID)
>> +		return stream_id;
>> +
>> +	if (ida_exists(&hb->ide_stream_ids_ida, stream_id)) {
>> +		/* mark the stream allocated in the global bitmap. */
>> +		set_bit(stream_id, cca_stream_ids);
>> +		goto redo_alloc;
>> +	}
>> +	return stream_id;
>
> Is 256 total an RMM limit, and/or does it require globally unique
> stream-ids? If not you could do what SEV-TIO does and just set stream-id
> == stream-index.
>

Yes, I’ll switch to that.

>
>> +}
>> +
>> +static inline bool cca_pdev_need_sel_ide_streams(struct pci_dev *pdev)
>> +{
>> +	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
>> +}
>> +
>> +static int cca_tsm_connect(struct pci_dev *pdev)
>> +{
>> +	struct pci_dev *rp = pcie_find_root_port(pdev);
>> +	struct cca_host_pf0_ep_dsc *pf0_ep_dsc;
>> +	struct pci_ide *ide;
>> +	int ret, stream_id = 0;
>> +
>> +	/* Only function 0 supports connect in host */
>> +	if (WARN_ON(!is_pci_tsm_pf0(pdev)))
>> +		return -EIO;
>> +
>> +	pf0_ep_dsc = to_cca_pf0_ep_dsc(pdev);
>> +	if (cca_pdev_need_sel_ide_streams(pdev)) {
>> +		/* Allocate stream id */
>> +		stream_id = alloc_stream_id(pci_find_host_bridge(pdev->bus));
>> +		if (stream_id == MAX_STREAM_ID)
>> +			return -EBUSY;
>> +		set_bit(stream_id, cca_stream_ids);
>> +
>> +		ide = pci_ide_stream_alloc(pdev);
>> +		if (!ide) {
>> +			ret = -ENOMEM;
>> +			goto err_stream_alloc;
>> +		}
>> +
>> +		pf0_ep_dsc->sel_stream = ide;
>> +		ide->stream_id = stream_id;
>> +		ret = pci_ide_stream_register(ide);
>> +		if (ret)
>> +			goto err_stream;
>> +		/*
>> +		 * Configure IDE capability for target device
>> +		 *
>> +		 * Some test devices work only with DEFAULT_STREAM enabled.
>> +		 * For simplicity, enable DEFAULT_STREAM for all devices. A
>> +		 * future decent solution may be to have a quirk table to
>> +		 * specify which devices need DEFAULT_STREAM.
>> +		 */
>> +		ide->partner[PCI_IDE_EP].default_stream = 1;
>> +		pci_ide_stream_setup(pdev, ide);
>> +		pci_ide_stream_setup(rp, ide);
>> +
>> +		ret = tsm_ide_stream_register(ide);
>> +		if (ret)
>> +			goto err_tsm;
>> +
>> +		/*
>> +		 * Once ide is setup, enable the stream at the endpoint
>> +		 * Root port will be done by RMM
>> +		 */
>> +		pci_ide_stream_enable(pdev, ide);
>
> The end point of these patches follows the spec recommendation of
> delaying enable until after key programming.
>
>> +	}
>> +	return 0;
>
> Should this be making security claims to userspace without taking any
> action for non-endpoint devices that happen to be passed in?
>
> Thinking about a bisection case this should either fail here, print a
> message that is removed in the final enabling patch, or do the
> __maybe_unused arrangement to land all the CCA bits first and then do
> this hookup. Up to you.

Will do the latter. ie, I’ll call tsm_register() only in the final
patch.

-aneesh

^ permalink raw reply

* Re: [PATCH v5 5/5] iommufd/vdevice: add TSM request ioctl
From: Alexey Kardashevskiy @ 2026-06-02  8:40 UTC (permalink / raw)
  To: Dan Williams (nvidia), Aneesh Kumar K.V (Arm), linux-coco, iommu,
	linux-kernel, kvm
  Cc: Bjorn Helgaas, Jason Gunthorpe, Joerg Roedel, Jonathan Cameron,
	Kevin Tian, Nicolin Chen, Samuel Ortiz, Steven Price,
	Suzuki K Poulose, Will Deacon, Xu Yilun, Shameer Kolothum,
	Paolo Bonzini, Tony Krowiak, Halil Pasic, Jason Herne,
	Harald Freudenberger, Holger Dengler, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Alex Williamson, Matthew Rosato, Farhan Ali,
	Eric Farman, linux-s390
In-Reply-To: <6a168c8ea7d10_2129b2100e@djbw-dev.notmuch>



On 27/5/26 16:17, Dan Williams (nvidia) wrote:
> [You don't often get email from djbw@kernel.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Alexey Kardashevskiy wrote:
>>
>>
>> On 26/5/26 01:48, Aneesh Kumar K.V (Arm) wrote:
>>> Add IOMMU_VDEVICE_TSM_REQUEST for issuing TSM guest request/response
>>> transactions against an iommufd vdevice.
>>>
>>> The ioctl takes a vdevice_id plus request/response user buffers and length
>>> fields, and forwards the request through tsm_guest_req() to the PCI TSM
>>> backend. This provides the host-side passthrough path used by CoCo guests
>>> for TSM device attestation and acceptance flows after the device has been
>>> bound to TSM.
>>>
>>> Also add the supporting tsm_guest_req() helper and associated TSM core
>>> interface definitions.
>>>
>>> Based on changes from: Alexey Kardashevskiy <aik@amd.com>
>>>
>>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>
>>> ---
>>>    drivers/iommu/iommufd/iommufd_private.h |  6 ++
>>>    drivers/iommu/iommufd/main.c            |  3 +
>>>    drivers/iommu/iommufd/tsm.c             | 68 +++++++++++++++++++++
>>>    drivers/virt/coco/tsm-core.c            | 39 ++++++++++++
>>>    include/linux/pci-tsm.h                 |  9 +--
>>>    include/linux/tsm.h                     | 25 ++++++++
>>>    include/uapi/linux/iommufd.h            | 80 +++++++++++++++++++++++++
>>>    7 files changed, 226 insertions(+), 4 deletions(-)
> [..]
>>> diff --git a/drivers/iommu/iommufd/tsm.c b/drivers/iommu/iommufd/tsm.c
>>> index 09ee668dbed9..342fbdb6a6b9 100644
>>> --- a/drivers/iommu/iommufd/tsm.c
>>> +++ b/drivers/iommu/iommufd/tsm.c
>>> @@ -60,3 +60,71 @@ int iommufd_vdevice_tsm_op_ioctl(struct iommufd_ucmd *ucmd)
>>>      iommufd_put_object(ucmd->ictx, &vdev->obj);
>>>      return rc;
>>>    }
>>> +
>>> +static bool iommufd_vdevice_tsm_req_scope_valid(u32 scope)
>>> +{
>>> +   if (scope > IOMMU_VDEVICE_TSM_REQ_SCOPE_PCI_LAST)
>>> +           return false;
>>> +
>>> +   switch (scope) {
>>> +   case IOMMU_VDEVICE_TSM_REQ_PCI_INFO:
>>> +   case IOMMU_VDEVICE_TSM_REQ_PCI_STATE_CHANGE:
>>> +   case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_READ:
>>> +   case IOMMU_VDEVICE_TSM_REQ_PCI_DEBUG_WRITE:
>>
>> This scope thing still needs clarification.
>>
>> I have 3 types of requests to fit here, all go via VM -> KVM -> QEMU -> IOMMUFD -> TSM.
>>
>> 1) bind/unbind TDI <- moves to CONFIG_LOCKED, this is "OP";
>> 2) start/stop TDI <- moves to RUN, this is "GR"? Right now I route it via "OP";
>> 3) enable/disable MMIO/DMA <- no TDI state change, this is "GR" but which scope is it here?
> 
> The scope parameter was meant to enumerate a security model for classes
> of commands that are otherwise opaque to the kernel. However, none of
> the commands we are targeting are opaque (private specification with
> unknown effect). It now turns out there is no role for @scope for
> security.
> 
> Now a command family that iommufd can validate seems useful. As it
> stands this implementation aliases command codes across TSMs. Do we
> proceed with creating an actual shared command uapi for the truly shared
> commands:
> 
> TSM_REQ_TYPE_DEFAULT: Commands every arch needs
> TSM_REQ_READ_OBJECT
> TSM_REQ_REGEN_OBJECT
> TSM_REQ_OBJECT_INFO

These 3 are already in that netlink interface of the TSM (so common for all arches), right?

> TSM_REQ_VALIDATE_MMIO

SEV handles this in the KVM as this is where RMP and NPT are managed + opaque guest request to the TSM, I'd think it is the same for others.

> TSM_REQ_SET_TDI_STATE

This is a common one.

> TSM_REQ_TYPE_SEV: Commands only SEV needs
> TSM_REQ_SEV_ENABLE_DMA
> TSM_REQ_SEV_DISABLE_DMA

No change to host owned part of the IOMMU when TDX or CCA moves the device to secure? Or it is packed into those opaque requests to the TSM?

> ...or just observe that per CC arch commands are needed to setup the VM
> so per CC arch commands are needed to marshal device assignment support
> requests.
> 
> In that case pci_tsm_req_scope becomes tsm_req_type and is just:
> 
> TSM_REQ_TYPE_CCA
> TSM_REQ_TYPE_SEV
> TSM_REQ_TYPE_TDX
> 
> I am leaning towards the latter at this point.

Dunno, besides the DMA thing, these CCA/SEV/TDX types will only appear in WARN_ON of the arch TSM drivers and will not really be seen. If a wrong TSM driver is loaded (say, TDX on AMD), then something just went terribly wrong. Thanks,


-- 
Alexey


^ permalink raw reply

* RE: [PATCH v5 10/20] dma-direct: make dma_direct_map_phys() honor DMA_ATTR_CC_SHARED
From: Aneesh Kumar K.V @ 2026-06-02  6:10 UTC (permalink / raw)
  To: Michael Kelley, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev
  Cc: Robin Murphy, Marek Szyprowski, Will Deacon, Marc Zyngier,
	Steven Price, Suzuki K Poulose, Catalin Marinas, Jiri Pirko,
	Jason Gunthorpe, Mostafa Saleh, Petr Tesarik,
	Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <SN6PR02MB41574064D14D4A2734222C51D40B2@SN6PR02MB4157.namprd02.prod.outlook.com>

Michael Kelley <mhklinux@outlook.com> writes:

> From: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org> Sent: Thursday, May 21, 2026 9:28 PM
>> 
>> Teach dma_direct_map_phys() to select the DMA address encoding based on
>> DMA_ATTR_CC_SHARED.
>> 
>> Use phys_to_dma_unencrypted() for decrypted mappings and
>> phys_to_dma_encrypted() otherwise. If a device requires unencrypted DMA
>> but the source physical address is still encrypted, force the mapping
>> through swiotlb so the DMA address and backing memory attributes remain
>> consistent.
>> 
>> Update the arm64, x86, s390 and powerpc secure-guest setup to not use
>> swiotlb force option
>> 
>> Tested-by: Jiri Pirko <jiri@nvidia.com>
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar@kernel.org>

...

> With this patch removing SWIOTLB_FORCE from four places in
> kernel code, there are no remaining places where it is set.
> The test of SWIOTLB_FORCE could be removed from
> swiotlb_init_remap(), and its definition could be deleted
> from include/linux/swiotlb.h.
>

Sure, I’ll add that as a separate patch in the series.

-aneesh

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox