Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Sudeep Holla @ 2026-06-08 12:32 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Suzuki K Poulose
  Cc: linux-coco, Sudeep Holla, linux-arm-kernel, linux-kernel,
	Catalin Marinas, Greg KH, Jeremy Linton, Jonathan Cameron,
	Lorenzo Pieralisi, Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <yq5aldcpz316.fsf@kernel.org>

On Mon, Jun 08, 2026 at 04:56:29PM +0530, Aneesh Kumar K.V wrote:
> Suzuki K Poulose <suzuki.poulose@arm.com> writes:
> 
> > On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
> >> Sudeep Holla <sudeep.holla@kernel.org> writes:
> >> 
> >>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
> >>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
> >>>>
> >>>> ...
> 
> ...
> 
> >>> I was trying to avoid conditional compilation altogether and hence the
> >>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
> >>> in above snippet must come as some condition to this generic probe.
> >>>
> >>> Adding any more logic or callback defeats the bus idea here if we need
> >>> to rely/depend on multiple conditional compilation or callbacks IMO.
> >>>
> >>> Let's find see if it can work with what we are adding now and may add in
> >>> near future and then decide.
> >>>
> >> 
> >> If we move all the conditional checks to the driver probe path, then I
> >> think this can work. Something like the below:
> >> 
> >> struct smccc_device_info {
> >> 	u32 func_id;
> >> 	bool requires_smc;
> >> 	const char *device_name;
> >> };
> >> 
> >> static const struct smccc_device_info smccc_devices[] __initconst = {
> >> 	{
> >> 		.func_id        = ARM_SMCCC_TRNG_VERSION,
> >> 		.requires_smc   = false,
> >> 		.device_name    = "arm-smccc-trng",
> >> 	},
> >> 
> >> 	{
> >> 		.func_id        = RSI_ABI_VERSION,
> >
> > Don't we need parameters passed to this (Requested Interface version for 
> > e.g.) ? See more below.
> >
> 
> The idea is that we only check whether the function ID is supported. All
> other conditional logic should be handled in the driver probe path, as
> demonstrated by the changes in drivers/char/hw_random/arm_smccc_trng.c.
> 

+1. Yes, we just want to know whether the firmware is aware of that feature
before creating the `smccc_device` for it. The device probe can then perform a
more thorough, feature-specific check to determine whether the device/feature
is usable.

That is the main idea behind the approach I suggested. Please let me know if
you still see any issues or think this may not work.

-- 
Regards,
Sudeep

^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Aneesh Kumar K.V @ 2026-06-08 11:26 UTC (permalink / raw)
  To: Suzuki K Poulose, Sudeep Holla
  Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
	Greg KH, Jeremy Linton, Jonathan Cameron, Lorenzo Pieralisi,
	Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <e0071e90-c2f5-4d15-b3c6-fe05bf1464e4@arm.com>

Suzuki K Poulose <suzuki.poulose@arm.com> writes:

> On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>> 
>>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>>>
>>>> ...

...

>>> I was trying to avoid conditional compilation altogether and hence the
>>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
>>> in above snippet must come as some condition to this generic probe.
>>>
>>> Adding any more logic or callback defeats the bus idea here if we need
>>> to rely/depend on multiple conditional compilation or callbacks IMO.
>>>
>>> Let's find see if it can work with what we are adding now and may add in
>>> near future and then decide.
>>>
>> 
>> If we move all the conditional checks to the driver probe path, then I
>> think this can work. Something like the below:
>> 
>> struct smccc_device_info {
>> 	u32 func_id;
>> 	bool requires_smc;
>> 	const char *device_name;
>> };
>> 
>> static const struct smccc_device_info smccc_devices[] __initconst = {
>> 	{
>> 		.func_id        = ARM_SMCCC_TRNG_VERSION,
>> 		.requires_smc   = false,
>> 		.device_name    = "arm-smccc-trng",
>> 	},
>> 
>> 	{
>> 		.func_id        = RSI_ABI_VERSION,
>
> Don't we need parameters passed to this (Requested Interface version for 
> e.g.) ? See more below.
>

The idea is that we only check whether the function ID is supported. All
other conditional logic should be handled in the driver probe path, as
demonstrated by the changes in drivers/char/hw_random/arm_smccc_trng.c.

>
>> 		.requires_smc   = true,
>> 		.device_name    = RSI_DEV_NAME,
>> 	},
>> };
>> 
>> static bool __init smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
>> {
>> 	unsigned long ret;
>> 	struct arm_smccc_res res;
>> 
>> 	if (smccc_conduit == SMCCC_CONDUIT_NONE)
>> 		return false;
>> 
>> 	if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
>> 		return false;
>> 
>> 	arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
>> 	ret = res.a0;
>> 
>> 	if ((s32)ret == SMCCC_RET_NOT_SUPPORTED)
>
> Is this a reliable check for all possible SMCCC services ? i.e., Are we 
> expected to get RET_NOT_SUPPORTED for any service for which the backend
> is not available ?
>

That is correct. That is what the SMCCC specification says

> Also, as pointed out RSI_ABI_VERSION may return other errors based on 
> the input (requested version, e.g., RSI_ERROR_INPUT) and we may still go
> ahead and register the device ?
>

Correct, and the driver probe will check for the minimum and maximum supported versions.

>> 		return false;
>> 
>> 	return true;
>> }
>> 
>> static int __init smccc_devices_init(void)
>> {
>> 	struct arm_smccc_device *sdev;
>> 	const struct smccc_device_info *smccc_dev;
>> 
>> 	for (int i = 0; i < ARRAY_SIZE(smccc_devices); i++) {
>> 		smccc_dev = &smccc_devices[i];
>> 
>> 		if (!smccc_probe_smccc_device(smccc_dev))
>> 			continue;
>> 
>>                 sdev = arm_smccc_device_register(smccc_dev->device_name);
>>                 if (IS_ERR(sdev))
>>                         pr_err("%s: could not register device: %ld\n",
>>                                smccc_dev->device_name, PTR_ERR(sdev));
>> 
>> 	}
>> 
>> 	return 0;
>> }
>> device_initcall(smccc_devices_init);
>> 
>> with the diff to hw_random/smccc_trng
>> 
>> modified   arch/arm64/include/asm/archrandom.h
>> @@ -12,7 +12,7 @@
>>   
>>   extern bool smccc_trng_available;
>>   
>> -static inline bool __init smccc_probe_trng(void)
>> +static inline bool smccc_probe_trng(void)
>>   {
>>   	struct arm_smccc_res res;
>>   
>> modified   drivers/char/hw_random/arm_smccc_trng.c
>> @@ -19,6 +19,8 @@
>>   #include <linux/arm-smccc.h>
>>   #include <linux/arm-smccc-bus.h>
>>   
>> +#include <asm/archrandom.h>
>> +
>>   #ifdef CONFIG_ARM64
>>   #define ARM_SMCCC_TRNG_RND	ARM_SMCCC_TRNG_RND64
>>   #define MAX_BITS_PER_CALL	(3 * 64UL)
>> @@ -98,6 +100,10 @@ static int smccc_trng_probe(struct arm_smccc_device *sdev)
>>   {
>>   	struct hwrng *trng;
>>   
>> +	/* validate the minimum version requirement */
>> +	if (!smccc_probe_trng())
>> +		return -ENODEV;
>> +
>>   	trng = devm_kzalloc(&sdev->dev, sizeof(*trng), GFP_KERNEL);
>>   	if (!trng)
>>   		return -ENOMEM;
>> 
>> We can also move arch/arm64/include/asm/rsi_smc.h to
>> include/linux/arm-rsi-smccc.h. There was a suggestion to move these
>
> super minor nit: arm-smccc-rsi.h ?
>

-aneesh

^ permalink raw reply

* Re: [PATCH v14 32/44] KVM: arm64: Handle Realm PSCI requests
From: Steven Price @ 2026-06-08 11:15 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <775a0d29-4d92-4ecc-96dd-5b0eaeff1528@redhat.com>

On 28/05/2026 07:55, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The RMM needs to be informed of the target REC when a PSCI call is made
>> with an MPIDR argument.
>>
>> This requirement will be removed in a future release of the RMM 2.0
>> specification but is still required for v2.0-bet1.
>>
>> Co-developed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Chanegs since v13:
>>   * The ioctl KVM_ARM_VCPU_RMI_PSCI_COMPLETE has gone. The RMI call is
>>     made automatically just before entering the REC again.
>> Changes since v12:
>>   * Chance return code for non-realms to -ENXIO to better represent that
>>     the ioctl is invalid for non-realms (checkpatch is insistent that
>>     "ENOSYS means 'invalid syscall nr' and nothing else").
>> Changes since v11:
>>   * RMM->RMI renaming.
>> Changes since v6:
>>   * Use vcpu_is_rec() rather than kvm_is_realm(vcpu->kvm).
>>   * Minor renaming/formatting fixes.
>> ---
>>   arch/arm64/include/asm/kvm_rmi.h |  3 ++
>>   arch/arm64/kvm/psci.c            | 15 ++++++++-
>>   arch/arm64/kvm/rmi.c             | 58 ++++++++++++++++++++++++++++++++
>>   3 files changed, 75 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index b65cfec10dee..eacf82a7467d 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -109,6 +109,9 @@ int realm_map_non_secure(struct realm *realm,
>>                unsigned long size,
>>                enum kvm_pgtable_prot prot,
>>                struct kvm_mmu_memory_cache *memcache);
>> +int realm_psci_complete(struct kvm_vcpu *source,
>> +            struct kvm_vcpu *target,
>> +            unsigned long status);
>>     static inline bool kvm_realm_is_private_address(struct realm *realm,
>>                           unsigned long addr)
>> diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
>> index 3b5dbe9a0a0e..a2cd55dc7b5b 100644
>> --- a/arch/arm64/kvm/psci.c
>> +++ b/arch/arm64/kvm/psci.c
>> @@ -103,7 +103,6 @@ static unsigned long kvm_psci_vcpu_on(struct
>> kvm_vcpu *source_vcpu)
>>         reset_state->reset = true;
>>       kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
>> -
> 
> This change isn't supposed to be part of this patch :-)

Whoops - indeed it isn't!

>>       /*
>>        * Make sure the reset request is observed if the RUNNABLE
>> mp_state is
>>        * observed.
>> @@ -142,6 +141,20 @@ static unsigned long
>> kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu)
>>       /* Ignore other bits of target affinity */
>>       target_affinity &= target_affinity_mask;
>>   +    if (vcpu_is_rec(vcpu)) {
>> +        struct kvm_vcpu *target_vcpu;
>> +
>> +        /* RMM supports only zero affinity level */
>> +        if (lowest_affinity_level != 0)
>> +            return PSCI_RET_INVALID_PARAMS;
>> +
>> +        target_vcpu = kvm_mpidr_to_vcpu(kvm, target_affinity);
>> +        if (!target_vcpu)
>> +            return PSCI_RET_INVALID_PARAMS;
>> +
>> +        return PSCI_RET_SUCCESS;
>> +    }
>> +
>>       /*
>>        * If one or more VCPU matching target affinity are running
>>        * then ON else OFF
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index 761b38a4071c..2b03e962ee41 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -3,6 +3,7 @@
>>    * Copyright (C) 2023-2025 ARM Ltd.
>>    */
>>   +#include <uapi/linux/psci.h>
>>   #include <linux/kvm_host.h>
>>     #include <asm/kvm_emulate.h>
>> @@ -127,6 +128,25 @@ static void free_rtt(phys_addr_t phys)
>>       kvm_account_pgtable_pages(phys_to_virt(phys), -1);
>>   }
>>   +int realm_psci_complete(struct kvm_vcpu *source, struct kvm_vcpu
>> *target,
>> +            unsigned long status)
>> +{
>> +    int ret;
>> +
>> +    /*
>> +     * XXX: RMM-v2.0 doesn't require the target REC address for
>> completing
>> +     * PSCI requests. Temporary hack until RMM implementation catches up
>> +     * to the full spec.
>> +     */
>> +    ret = rmi_psci_complete(virt_to_phys(source->arch.rec.rec_page),
>> +                virt_to_phys(target->arch.rec.rec_page),
>> +                status);
>> +    if (ret)
>> +        return -EINVAL;
> 
>         return -ENXIO;

Ack, although as the comment says this should be going away.

Thanks,
Steve

>> +
>> +    return 0;
>> +}
>> +
>>   static int realm_rtt_create(struct realm *realm,
>>                   unsigned long addr,
>>                   int level,
>> @@ -1004,6 +1024,41 @@ static void kvm_complete_ripas_change(struct
>> kvm_vcpu *vcpu)
>>       rec->run->exit.ripas_base = base;
>>   }
>>   +static void kvm_rec_complete_psci(struct kvm_vcpu *vcpu)
>> +{
>> +    struct rec_run *run = vcpu->arch.rec.run;
>> +    unsigned long status = PSCI_RET_DENIED;
>> +    unsigned long ret = vcpu_get_reg(vcpu, 0);
>> +    struct kvm_vcpu *target;
>> +
>> +    switch (run->exit.gprs[0]) {
>> +    /*
>> +     * XXX: RMM-v2.0 doesn't cause RMI_EXIT_PSCI for AFFINITY_INFO
>> +     * Temporary hack until tf-RMM gets the REC to MPIDR mapping via
>> +     * RD Auxiliary granules.
>> +     * For now always report SUCCESS
>> +     */
>> +    case PSCI_0_2_FN64_AFFINITY_INFO:
>> +        status = PSCI_RET_SUCCESS;
>> +        break;
>> +    case PSCI_0_2_FN64_CPU_ON: {
>> +        if (ret != PSCI_RET_SUCCESS &&
>> +            ret != PSCI_RET_ALREADY_ON)
>> +            status = PSCI_RET_DENIED;
>> +        else
>> +            status = PSCI_RET_SUCCESS;
>> +        break;
>> +    }
>> +    default:
>> +        return;
>> +    }
>> +
>> +    target = kvm_mpidr_to_vcpu(vcpu->kvm, run->exit.gprs[1]);
>> +    /* RMM makes sure that we don't get RMI_EXIT_PSCI for invalid
>> mpidrs */
>> +    if (target)
>> +        realm_psci_complete(vcpu, target, status);
>> +}
>> +
>>   /*
>>    * kvm_rec_pre_enter - Complete operations before entering a REC
>>    *
>> @@ -1028,6 +1083,9 @@ int kvm_rec_pre_enter(struct kvm_vcpu *vcpu)
>>           for (int i = 0; i < REC_RUN_GPRS; i++)
>>               rec->run->enter.gprs[i] = vcpu_get_reg(vcpu, i);
>>           break;
>> +    case RMI_EXIT_PSCI:
>> +        kvm_rec_complete_psci(vcpu);
>> +        break;
>>       case RMI_EXIT_RIPAS_CHANGE:
>>           kvm_complete_ripas_change(vcpu);
>>           break;
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Steven Price @ 2026-06-08 10:56 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <d08ca4d6-1e7c-473d-909e-89642bd8c4c2@redhat.com>

On 05/06/2026 12:20, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> At runtime if the realm guest accesses memory which hasn't yet been
>> mapped then KVM needs to either populate the region or fault the guest.
>>
>> For memory in the lower (protected) region of IPA a fresh page is
>> provided to the RMM which will zero the contents. For memory in the
>> upper (shared) region of IPA, the memory from the memslot is mapped
>> into the realm VM non secure.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Numerous changes due to rebasing.
>>   * Fix addr_range_desc() to encode the correct block size.
>> Changes since v12:
>>   * Switch to RMM v2.0 range based APIs.
>> Changes since v11:
>>   * Adapt to upstream changes.
>> Changes since v10:
>>   * RME->RMI renaming.
>>   * Adapt to upstream gmem changes.
>> Changes since v9:
>>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>>     may_block to avoid stall warnings.
>>   * Minor coding style fixes.
>> Changes since v8:
>>   * Propagate the may_block flag.
>>   * Minor comments and coding style changes.
>> Changes since v7:
>>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>>     internally WARN when necessary.
>> Changes since v6:
>>   * Handle PAGE_SIZE being larger than RMM granule size.
>>   * Some minor renaming following review comments.
>> Changes since v5:
>>   * Reduce use of struct page in preparation for supporting the RMM
>>     having a different page size to the host.
>>   * Handle a race when delegating a page where another CPU has faulted on
>>     a the same page (and already delegated the physical page) but not yet
>>     mapped it. In this case simply return to the guest to either use the
>>     mapping from the other CPU (or refault if the race is lost).
>>   * The changes to populate_par_region() are moved into the previous
>>     patch where they belong.
>> Changes since v4:
>>   * Code cleanup following review feedback.
>>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>>     This is now set by the RMM and the host has no control of it and the
>>     spec requires the bit to be set to zero.
>> Changes since v2:
>>   * Avoid leaking memory if failing to map it in the realm.
>>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>>   * Adapt to changes in previous patches.
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>>   4 files changed, 301 insertions(+), 20 deletions(-)
>>
> 
> [...]
> 
>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct
>> kvm_s2_fault_desc *s2fd)
>>       bool write_fault, exec_fault;
>>       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>> -    struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>> +    struct kvm_vcpu *vcpu = s2fd->vcpu;
>> +    struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>> +    gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>       unsigned long mmu_seq;
>>       struct page *page;
>> -    struct kvm *kvm = s2fd->vcpu->kvm;
>> +    struct kvm *kvm = vcpu->kvm;
>>       void *memcache;
>>       kvm_pfn_t pfn;
>>       gfn_t gfn;
>>       int ret;
>>   -    memcache = get_mmu_memcache(s2fd->vcpu);
>> -    ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>> +    if (kvm_is_realm(vcpu->kvm)) {
>> +        /* check for memory attribute mismatch */
>> +        bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>> +        /*
>> +         * For Realms, the shared address is an alias of the private
>> +         * PA with the top bit set. Thus if the fault address matches
>> +         * the GPA then it is the private alias.
>> +         */
>> +        bool is_priv_fault = (gpa == s2fd->fault_ipa);
>> +
>> +        if (is_priv_gfn != is_priv_fault) {
>> +            kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>> +                              kvm_is_write_fault(vcpu),
>> +                              false,
>> +                              is_priv_fault);
>> +            /*
>> +             * KVM_EXIT_MEMORY_FAULT requires an return code of
>> +             * -EFAULT, see the API documentation
>> +             */
>> +            return -EFAULT;
>> +        }
>> +    }
>> +
> 
> For a Realm, gmem_abort() is called by kvm_handle_guest_abort() only when
> we're faulting in the private (protected) space.
> 
>     if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu->kvm,
> fault_ipa))
>         ret = gmem_abort(&s2fd);
>     else
>         ret = user_mem_abort(&s2fd);
> 
> With the condition, this block of code can be simplied to handle conversion
> (shared -> private) instead of both directions.
> 
>     /* Convert the shared address to the private adress for Realm */
>     if (kvm_is_realm(vcpu->kvm) &&
>         !kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT)) {
>         /*
>          * KVM_EXIT_MEMORY_FAULT requires an return code of
>          * -EFAULT, see the API documentation
>          */
>         kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>                                       kvm_is_write_fault(vcpu),
>                                       false, true);
>         return -EFAULT;
>     }
> 
> 
> [...]
> 
>> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>                   !write_fault &&
>>                   !kvm_vcpu_trap_is_exec_fault(vcpu));
>>   -        if (kvm_slot_has_gmem(memslot))
>> +        if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu-
>> >kvm, fault_ipa))
>>               ret = gmem_abort(&s2fd);
>>           else
>>               ret = user_mem_abort(&s2fd);
> gmem_abort() is only called for faults in the protected (private) space.

You're absolutely correct - that's a nice simplification!

Thanks,
Steve

^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Steven Price @ 2026-06-08 10:56 UTC (permalink / raw)
  To: Suzuki K Poulose, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <cecbd148-5d33-49c8-928f-572f71b3dd69@arm.com>

On 08/06/2026 10:30, Suzuki K Poulose wrote:
> On 05/06/2026 07:23, Gavin Shan wrote:
>> Hi Steve,
>>
>> On 5/13/26 11:17 PM, Steven Price wrote:
>>> At runtime if the realm guest accesses memory which hasn't yet been
>>> mapped then KVM needs to either populate the region or fault the guest.
>>>
>>> For memory in the lower (protected) region of IPA a fresh page is
>>> provided to the RMM which will zero the contents. For memory in the
>>> upper (shared) region of IPA, the memory from the memslot is mapped
>>> into the realm VM non secure.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>
>>> ---
>>> Changes since v13:
>>>   * Numerous changes due to rebasing.
>>>   * Fix addr_range_desc() to encode the correct block size.
>>> Changes since v12:
>>>   * Switch to RMM v2.0 range based APIs.
>>> Changes since v11:
>>>   * Adapt to upstream changes.
>>> Changes since v10:
>>>   * RME->RMI renaming.
>>>   * Adapt to upstream gmem changes.
>>> Changes since v9:
>>>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>>>     may_block to avoid stall warnings.
>>>   * Minor coding style fixes.
>>> Changes since v8:
>>>   * Propagate the may_block flag.
>>>   * Minor comments and coding style changes.
>>> Changes since v7:
>>>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>>>     internally WARN when necessary.
>>> Changes since v6:
>>>   * Handle PAGE_SIZE being larger than RMM granule size.
>>>   * Some minor renaming following review comments.
>>> Changes since v5:
>>>   * Reduce use of struct page in preparation for supporting the RMM
>>>     having a different page size to the host.
>>>   * Handle a race when delegating a page where another CPU has
>>> faulted on
>>>     a the same page (and already delegated the physical page) but not
>>> yet
>>>     mapped it. In this case simply return to the guest to either use the
>>>     mapping from the other CPU (or refault if the race is lost).
>>>   * The changes to populate_par_region() are moved into the previous
>>>     patch where they belong.
>>> Changes since v4:
>>>   * Code cleanup following review feedback.
>>>   * Drop the PTE_SHARED bit when creating unprotected page table
>>> entries.
>>>     This is now set by the RMM and the host has no control of it and the
>>>     spec requires the bit to be set to zero.
>>> Changes since v2:
>>>   * Avoid leaking memory if failing to map it in the realm.
>>>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>>>   * Adapt to changes in previous patches.
>>> ---
>>>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>>>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>>>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>>>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>>>   4 files changed, 301 insertions(+), 20 deletions(-)
>>>

[...]

>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index cae29fd3353c..761b38a4071c 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm,
>>> unsigned long ipa,
>>>       return ret;
>>>   }
>>> +static unsigned long addr_range_desc(unsigned long phys, unsigned
>>> long size)
>>> +{
>>> +    unsigned long out = 0;
>>> +
>>> +    switch (size) {
>>> +    case P4D_SIZE:
>>> +        out = 3 | (1 << 2);
>>> +        break;
>>> +    case PUD_SIZE:
>>> +        out = 2 | (1 << 2);
>>> +        break;
>>> +    case PMD_SIZE:
>>> +        out = 1 | (1 << 2);
>>> +        break;
>>> +    case PAGE_SIZE:
>>> +        out = 0 | (1 << 2);
>>> +        break;
>>> +    default:
>>> +        /*
>>> +         * Only support mapping at the page level granulatity when
>>> +         * it's an unusual length. This should get us back onto a
>>> larger
>>> +         * block size for the subsequent mappings.
>>> +         */
>>> +        out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>>> +        break;
>>> +    }
>>> +
>>> +    WARN_ON(phys & ~PAGE_MASK);
>>> +
>>> +    out |= phys & PAGE_MASK;
>>> +
>>> +    return out;
>>> +}
>>> +
>>> +int realm_map_protected(struct kvm *kvm,
>>> +            unsigned long ipa,
>>> +            kvm_pfn_t pfn,
>>> +            unsigned long map_size,
>>> +            struct kvm_mmu_memory_cache *memcache)
>>> +{
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    phys_addr_t phys = __pfn_to_phys(pfn);
>>> +    phys_addr_t base_phys = phys;
>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>> +    unsigned long base_ipa = ipa;
>>> +    unsigned long ipa_top = ipa + map_size;
>>> +    int ret = 0;
>>> +
>>> +    if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>>> +            !IS_ALIGNED(ipa, map_size)))
>>> +        return -EINVAL;
>>> +
>>> +    if (rmi_delegate_range(phys, map_size)) {
>>> +        /*
>>> +         * It's likely we raced with another VCPU on the same
>>> +         * fault. Assume the other VCPU has handled the fault
>>> +         * and return to the guest.
>>> +         */
>>> +        return 0;
>>> +    }
>>> +
>>> +    while (ipa < ipa_top) {
>>> +        unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>>> +        unsigned long range_desc = addr_range_desc(phys, ipa_top -
>>> ipa);
>>> +        unsigned long out_top;
>>> +
>>> +        ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>>> +                       &out_top);
>>> +
>>> +        if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +            /* Create missing RTTs and retry */
>>> +            int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +            WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>>> +            ret = realm_create_rtt_levels(realm, ipa, level,
>>> +                              KVM_PGTABLE_LAST_LEVEL,
>>> +                              memcache);
> 
> Could we give the RMM a chance to make use of the Block mappings by
> creating the Missing RTTs to the level that may work for the current
> range_desc ? i.e., if the range_desc is a 2M block size, we could create
> tables upto L2 in the first go and if the RMM still needs RTT, we could
> go further down to the KVM_PGTABLE_LAST_LEVEL. I understand this is
> kind of an optimisation, so may be we could defer it. (Same applies for
> the non_secure map below).

A simple change would be just to create one level at a time like this:

diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
index b79b96f7dffb..3f3ade1d3895 100644
--- a/arch/arm64/kvm/rmi.c
+++ b/arch/arm64/kvm/rmi.c
@@ -767,15 +767,15 @@ static int realm_map_protected(struct kvm *kvm,
 			/* Create missing RTTs and retry */
 			int level = RMI_RETURN_INDEX(ret);
 
-			WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
+			if (WARN_ON(level >= KVM_PGTABLE_LAST_LEVEL))
+				goto err_undelegate;
 			ret = realm_create_rtt_levels(realm, ipa, level,
-						      KVM_PGTABLE_LAST_LEVEL,
+						      level + 1,
 						      memcache);
 			if (ret)
 				goto err_undelegate;
 
-			ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
-					       range_desc, &out_top);
+			continue;
 		}
 
 		if (WARN_ON(ret))

Thanks,
Steve


^ permalink raw reply related

* Re: [PATCH v5 05/20] dma-pool: track decrypted atomic pools and select them via attrs
From: Marek Szyprowski @ 2026-06-08 10:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Aneesh Kumar K.V
  Cc: Michael Kelley, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-coco@lists.linux.dev,
	Robin Murphy, Will Deacon, Marc Zyngier, Steven Price,
	Suzuki K Poulose, Catalin Marinas, Jiri Pirko, Mostafa Saleh,
	Petr Tesarik, Alexey Kardashevskiy, Dan Williams, Xu Yilun,
	linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy (CS GROUP), Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, x86@kernel.org, Jiri Pirko
In-Reply-To: <20260604182419.GC2487554@ziepe.ca>

On 04.06.2026 20:24, Jason Gunthorpe wrote:
> On Thu, Jun 04, 2026 at 08:27:36PM +0530, Aneesh Kumar K.V wrote:
>> I already sent a v6 in the hope of getting this merged for the next
>> merge window. Should I send a v7, or would you prefer that I do the
>> rename on top of v6?
> I think it is too late for such a major change, but this should be
> imaginged to be for rc2ish next cycle. You also have to spell out how
> the pkvm patch will get sequenced as well, it would be best to push
> that it gets picked up right away.


I would like to give this a bit of time in linux-next so it is a bit too
late for v7.2, but before merging it I would also like to have this code
reviewed by someone with confidential computing knowledge.


Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply

* Re: [PATCH 04/15] x86/virt/tdx: Enable the Extensions right after basic TDX Module init
From: Xu Yilun @ 2026-06-08 10:12 UTC (permalink / raw)
  To: Kishen Maloor
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <f8b7aea2-228c-4e51-b79e-094da656f4c9@intel.com>

> > +static __init int init_tdx_ext(void)
> >   {
> >   	int ret;
> > @@ -1373,6 +1373,10 @@ static __init int init_tdx_module(void)
> >   	if (ret)
> >   		goto err_reset_pamts;
> > +	ret = init_tdx_ext();
> > +	if (ret)
> > +		goto err_reset_pamts;
> 
> Is it a reasonable policy to fail TDX bringup entirely upon failing
> initialization of extensions (which are "add-on features")?

mm.. I think TDX Extension is not strictly an add-on feature from OS
POV. It is still a fundamental TDX infrastructure. Host should not
look into the Module and create substates like Base-good-Extension-bad or
both-good. There are some considerations:

 - Extension cannot be explicitly configured by TDH.SYS.CONFIG, it is
   implicitly configured by TDX Module if an add-on feature requires it.

 - There is no enumeration of which SEAMCALLs are Extension-SEAMCALLs so
   Base-good-Extension-bad actually brings more chaos later.

So the series is making all effort to make TDX bringup a stateless
process, no intermediate state.

> 
> The handling of tdx_quote_init() in Patch 6 suggests a more
> best-effort approach.

TDX Quoting is however a clear self-contained add-on feature from OS POV.
Though I'm not sure if a TDX platform is still a safe TCB with DICE
available but failed, and good for "best-effort" policy? Maybe Peter
could answer.
> 

^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Yan Zhao @ 2026-06-08  9:50 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Kiryl Shutsemau, Chao Gao, Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, seanjc@google.com,
	mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <572868d7-4794-4fec-b80f-97d8434d5fb6@intel.com>

On Fri, Jun 05, 2026 at 09:23:21AM -0700, Dave Hansen wrote:
> On 6/5/26 04:42, Kiryl Shutsemau wrote:
> >>> I don't see a reason why we can't keep the scoped_guard() on get side.
> >> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> >> with goto, which is discouraged. See [*]
> >>
> >>  :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
> >>  :that the “goto” statement can jump between scopes, the expectation is that
> >>  :usage of “goto” and cleanup helpers is never mixed in the same function.
> > Fair enough.
> > 
> > But it can also be address if we free the PAMT page array with the guard
> > too :P
> 
> How important is this patch? I see "Optimize" but I read "Optional".
This patch reduces the number of global pamt_lock acquisitions.

Reference testing data with/without the optimization:
(collected on my SPR test machine)

Booting/teardown of 1 TD (8 vcpus/8G memory) per iteration:
                |--------------|-------------|------------|
                |    avg (us)  |   max (us)  |   min (us) | 
                |  w/o  |  w/  |  w/o  | w/  | w/o  |  w/ |
----------------|-------|------|-------|-----|------|-----|
__tdx_pamt_get()|   2   |  0   |  578  | 505 |  2   |  0  |
__tdx_pamt_put()|   0   |  0   |  563  | 496 |  0   |  0  |
----------------|--------------|-------------|------------|

Boot/teardown of 5 TDs (each TD: 8 vcpus/8G memory) concurrently:
                |--------------|-------------|------------|
                |    avg (us)  |   max (us)  |   min (us) | 
                |  w/o  |  w/  |  w/o  | w/  | w/o  |  w/ |
----------------|-------|------|-------|-----|------|-----|
__tdx_pamt_get()|  15   |  0   |  1723 | 1386|  2   |  0  |
__tdx_pamt_put()|   0   |  0   |   562 |  733|  0   |  0  |
----------------|--------------|-------------|------------|


> If we're arguing about it, maybe we should just kick it out and focus on
> the more important bits.
DPAMT still works fine without this optimization. The optimization can reduce
the average time spent on the global lock, especially when there's high
contention.

^ permalink raw reply

* Re: [PATCH 02/15] x86/virt/tdx: Add extra memory to TDX Module for Extensions
From: Xu Yilun @ 2026-06-08  9:41 UTC (permalink / raw)
  To: Kishen Maloor
  Cc: kas, djbw, rick.p.edgecombe, x86, peter.fang, linux-coco,
	linux-kernel, kvm, sohil.mehta, yilun.xu, baolu.lu,
	zhenzhong.duan, xiaoyao.li
In-Reply-To: <f44d997e-49fe-4d48-84e3-e260bb9d3164@intel.com>

> > +static int tdx_ext_mem_add(struct page *root, unsigned int nr_pages)
> > +{
> > +	struct tdx_module_args args = {
> > +		.rcx = to_hpa_list_info(root, nr_pages),
> > +	};
> > +	u64 r;
> > +
> > +	tdx_clflush_hpa_list(root, nr_pages);
> > +
> > +	do {
> > +		/*
> > +		 * TDH_EXT_MEM_ADD is designed to use output parameter RCX to
> > +		 * override/update input parameter RCX, so the caller doesn't
> > +		 * have to do manual parameter update on retry call.
> > +		 */
> > +		r = seamcall_ret(TDH_EXT_MEM_ADD, &args);
> > +	} while (r == TDX_INTERRUPTED_RESUMABLE);
> 
> The retry loop compares the full return value against TDX_INTERRUPTED_RESUMABLE. Should
> it mask with TDX_SEAMCALL_STATUS_MASK first, in case the module sets any
> lower detail bits?

mm.. there is an existing case for TDX_INTERRUPTED_RESUMABLE which
doesn't do the mask:

  err = tdh_phymem_cache_wb(resume);
  switch (err) {
	case TDX_INTERRUPTED_RESUMABLE:
		continue;

I believe we don't mask it. TDX_INTERRUPTED_RESUMABLE should not carry
any lower bits according to its bit definition, if it does it's a
problem we should not skip.

> 
> Ditto for TDH.EXT.INIT in patch 3.
> 
> > +
> > +	if (r != TDX_SUCCESS)
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +
> > +static int tdx_ext_mem_setup(void)
> > +{
> > +	unsigned int nr_pages;
> > +	struct page *page;
> > +	u64 *root;
> > +	unsigned int i;
> > +	int ret;
> > +
> > +	nr_pages = tdx_sysinfo.ext.memory_pool_required_pages;
> > +	/*
> > +	 * memory_pool_required_pages == 0 means no need to add pages,
> > +	 * skip the memory setup.
> > +	 */
> > +	if (!nr_pages)
> > +		return 0;
> > +
> > +	root = kzalloc(PAGE_SIZE, GFP_KERNEL);
> > +	if (!root)
> > +		return -ENOMEM;
> > +
> > +	page = alloc_contig_pages(nr_pages, GFP_KERNEL, numa_mem_id(),
> > +				  &node_online_map);
> 
> The SEAMCALL takes a scatter list (HPA_LIST_INFO), so the module
> doesn't require contiguity. If the goal is just to avoid scattering
> pages across many 2MB regions, maybe dense, 2MB-aligned allocations should
> achieve that without a single pool-wide contiguous block.
> 
> > +	if (!page) {
> > +		ret = -ENOMEM;
> > +		goto out_free_root;
> > +	}
> > +
> > +	for (i = 0; i < nr_pages;) {
> > +		unsigned int nents = min(nr_pages - i,
> > +					 PAGE_SIZE / sizeof(*root));
> > +		int j;
> > +
> > +		for (j = 0; j < nents; j++)
> > +			root[j] = page_to_phys(page + i + j);
> 
> Would it be better to allocate per-batch (i.e. one root page's worth
> at a time) rather than the whole pool up front?
> 
> That way an intermediate TDH.EXT.MEM.ADD failure wouldn't leak
> all nr_pages. Also, a batch is up to 512 pages (= 2MB) and its allocation
> could be 2MB-aligned, addressing your fragmentation concern.

So IIUC allocating 2MB by 2MB has the pros:

  - Larger chance to get the memory.
  - Less memory waste when TDH.EXT.MEM.ADD failed.

and the cons:

  - Still fragment 4M & 1G memory region.


I think first of all we should focus on the normal path when Extension is
successfully initialized and memory is added, note these memory can
never be reclaimed in this case, so memory fragmentation becomes the
primary considration.

And in TDX platform, the TDH.EXT.MEM.ADD failure is not expected to
happen, which means the TDX module is buggy and from Confidential
Computing POV we should not continue, we should change to a new module
and reboot. So less memory waste doesn't matter much actually.

Then, the Extension initialization is done at bootup time. We can get
the memory in big chance. If we really can't, it is a signal that the
system is not well configured for TDX, and failing earlier isn't such
a bad thing to me.

So for now I still think alloc_contig_pages() is better than 2M-by-2M
allocation.

> 
> > +
> > +		ret = tdx_ext_mem_add(virt_to_page(root), nents);
> > +		/*
> > +		 * No SEAMCALLs to reclaim the added pages. For simple error
> > +		 * handling, leak all pages.
> > +		 */
> > +		WARN_ON_ONCE(ret);
> > +		if (ret)
> > +			break;
> > +
> > +		i += nents;
> > +	}
> > +
> > +	/*
> > +	 * Extensions memory can't be reclaimed once added, print out the
> > +	 * amount, stop tracking it and free the root page, no matter success
> > +	 * or failure.
> > +	 */
> > +	pr_info("%lu KB allocated for TDX Module Extensions\n",
> > +		nr_pages * PAGE_SIZE / 1024);
> > +
> > +out_free_root:
> > +	kfree(root);
> > +
> > +	return ret;
> > +}
> > +
> > +static int __maybe_unused init_tdx_ext(void)
> 
> Could this be named init_tdx_extensions() instead to disambiguate
> from tdx_ext_init() in patch 3?

Yes, good to me.

I'm changing all Extensions to Extension, cause the SPEC says "TDX
Module Extension". So I'll use init_tdx_extension().

Thanks,
Yilun

^ permalink raw reply

* Re: [PATCH v14 28/44] arm64: RMI: Create the realm descriptor
From: Steven Price @ 2026-06-08  9:56 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <80b6ab2d-1a3e-4c15-b06b-00aaa23fcf74@redhat.com>

On 28/05/2026 06:51, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> Creating a realm involves first creating a realm descriptor (RD). This
>> involves passing the configuration information to the RMM. Do this as
>> part of realm_ensure_created() so that the realm is created when it is
>> first needed.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * The RMM no longer uses AUX granules, so no need to ask it how many it
>>     needs.
>>   * Adapted to other changes.
>> Changes since v12:
>>   * Since RMM page size is now equal to the host's page size various
>>     calculations are simplified.
>>   * Switch to using range based APIs to delegate/undelegate.
>>   * VMID handling is now handled entirely by the RMM.
>> ---
>>   arch/arm64/kvm/rmi.c | 88 +++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index fb96bcaa73ed..cae29fd3353c 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -418,6 +418,77 @@ static void realm_unmap_shared_range(struct kvm
>> *kvm,
>>                    start, end);
>>   }
>>   +static int realm_create_rd(struct kvm *kvm)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +    struct realm_params *params = realm->params;
>> +    void *rd = NULL;
>> +    phys_addr_t rd_phys, params_phys;
>> +    size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +    int r;
>> +
>> +    realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +
>> +    if (WARN_ON(realm->rd || !realm->params))
>> +        return -EEXIST;
>> +
>> +    rd = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
>> +    if (!rd)
>> +        return -ENOMEM;
>> +
>> +    rd_phys = virt_to_phys(rd);
>> +    if (rmi_delegate_page(rd_phys)) {
>> +        r = -ENXIO;
>> +        goto free_rd;
>> +    }
>> +
>> +    if (rmi_delegate_range(kvm->arch.mmu.pgd_phys, pgd_size)) {
>> +        r = -ENXIO;
>> +        goto out_undelegate_tables;
>> +    }
>> +
>> +    params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +    params->rtt_level_start = get_start_level(realm);
>> +    params->rtt_num_start = pgd_size / PAGE_SIZE;
>> +    params->rtt_base = kvm->arch.mmu.pgd_phys;
>> +
>> +    if (kvm->arch.arm_pmu) {
>> +        params->pmu_num_ctrs = kvm->arch.nr_pmu_counters;
>> +        params->flags |= RMI_REALM_PARAM_FLAG_PMU;
>> +    }
>> +
>> +    if (kvm_lpa2_is_enabled())
>> +        params->flags |= RMI_REALM_PARAM_FLAG_LPA2;
>> +
>> +    params_phys = virt_to_phys(params);
>> +
>> +    if (rmi_realm_create(rd_phys, params_phys)) {
>> +        r = -ENXIO;
>> +        goto out_undelegate_tables;
>> +    }
>> +
>> +    realm->rd = rd;
>> +    kvm_set_realm_state(kvm, REALM_STATE_NEW);
>> +    /* The realm is up, free the parameters.  */
>> +    free_page((unsigned long)realm->params);
>> +    realm->params = NULL;
>> +
>> +    return 0;
>> +
>> +out_undelegate_tables:
>> +    if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys,
>> pgd_size))) {
>> +        /* Leak the pages if they cannot be returned */
>> +        kvm->arch.mmu.pgt = NULL;
>> +    }
> 
> In the latest RMM implementation (topics/rmm-v2.0-poc_2),
> rmi_delegate_range() works
> with the granularity of granule (4KB) and it can fail on any granule.
> For example,
> we have 16x granule as the root RTT and rmi_delegate_range() fails on
> the first
> granule, we're going to undelegate all these 16x granules, which were
> never delegated
> to RMM. It eventually leads to error and memory leakage.
> 
> For this, rmi_delegate_range() could be improved to return the number of
> granules that
> have been delegated. The return value can be used by the caller to
> handle the erroneous
> case by passing the correct range to rmi_undelegate_page().

Well spotted - yes the current situation where the entire region is
leaked if the delegate only partially completes is less than ideal! I'll
add a third argument to rmi_delegate_range() to return the top of the
region that was successfully delegated. The caller can then attempt an
undelegate on just the range which was delegated.

Thanks,
Steve

>> +    if (WARN_ON(rmi_undelegate_page(rd_phys))) {
>> +        /* Leak the page if it isn't returned */
>> +        return r;
>> +    }
>> +free_rd:
>> +    free_page((unsigned long)rd);
>> +    return r;
>> +}
>> +
>>   static void realm_unmap_private_range(struct kvm *kvm,
>>                         unsigned long start,
>>                         unsigned long end,
>> @@ -647,8 +718,21 @@ static int realm_init_ipa_state(struct kvm *kvm,
>>     static int realm_ensure_created(struct kvm *kvm)
>>   {
>> -    /* Provided in later patch */
>> -    return -ENXIO;
>> +    int ret;
>> +
>> +    switch (kvm_realm_state(kvm)) {
>> +    case REALM_STATE_NONE:
>> +        break;
>> +    case REALM_STATE_NEW:
>> +        return 0;
>> +    case REALM_STATE_DEAD:
>> +        return -ENXIO;
>> +    default:
>> +        return -EBUSY;
>> +    }
>> +
>> +    ret = realm_create_rd(kvm);
>> +    return ret;
>>   }
>>     static int set_ripas_of_protected_regions(struct kvm *kvm)
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 28/44] arm64: RMI: Create the realm descriptor
From: Steven Price @ 2026-06-08  9:49 UTC (permalink / raw)
  To: Wei-Lin Chang, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Gavin Shan,
	Shanker Donthineni, Alper Gun, Aneesh Kumar K . V, Emi Kisanuki,
	Vishal Annapurve, Lorenzo.Pieralisi2
In-Reply-To: <wshgh4i6uqytq65rh3j2risam2y2evjnfyztoee46soemp5i4x@qhzj4lcs33yj>

On 26/05/2026 23:47, Wei-Lin Chang wrote:
> Hi,
> 
> On Wed, May 13, 2026 at 02:17:36PM +0100, Steven Price wrote:
>> Creating a realm involves first creating a realm descriptor (RD). This
>> involves passing the configuration information to the RMM. Do this as
>> part of realm_ensure_created() so that the realm is created when it is
>> first needed.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>  * The RMM no longer uses AUX granules, so no need to ask it how many it
>>    needs.
>>  * Adapted to other changes.
>> Changes since v12:
>>  * Since RMM page size is now equal to the host's page size various
>>    calculations are simplified.
>>  * Switch to using range based APIs to delegate/undelegate.
>>  * VMID handling is now handled entirely by the RMM.
>> ---
>>  arch/arm64/kvm/rmi.c | 88 +++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index fb96bcaa73ed..cae29fd3353c 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -418,6 +418,77 @@ static void realm_unmap_shared_range(struct kvm *kvm,
>>  			     start, end);
>>  }
>>  
>> +static int realm_create_rd(struct kvm *kvm)
>> +{
>> +	struct realm *realm = &kvm->arch.realm;
>> +	struct realm_params *params = realm->params;
>> +	void *rd = NULL;
>> +	phys_addr_t rd_phys, params_phys;
>> +	size_t pgd_size = kvm_pgtable_stage2_pgd_size(kvm->arch.mmu.vtcr);
>> +	int r;
>> +
>> +	realm->ia_bits = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +
>> +	if (WARN_ON(realm->rd || !realm->params))
>> +		return -EEXIST;
>> +
>> +	rd = (void *)__get_free_page(GFP_KERNEL_ACCOUNT);
>> +	if (!rd)
>> +		return -ENOMEM;
>> +
>> +	rd_phys = virt_to_phys(rd);
>> +	if (rmi_delegate_page(rd_phys)) {
>> +		r = -ENXIO;
>> +		goto free_rd;
>> +	}
>> +
>> +	if (rmi_delegate_range(kvm->arch.mmu.pgd_phys, pgd_size)) {
>> +		r = -ENXIO;
>> +		goto out_undelegate_tables;
>> +	}
>> +
>> +	params->s2sz = VTCR_EL2_IPA(kvm->arch.mmu.vtcr);
>> +	params->rtt_level_start = get_start_level(realm);
>> +	params->rtt_num_start = pgd_size / PAGE_SIZE;
>> +	params->rtt_base = kvm->arch.mmu.pgd_phys;
>> +
>> +	if (kvm->arch.arm_pmu) {
>> +		params->pmu_num_ctrs = kvm->arch.nr_pmu_counters;
>> +		params->flags |= RMI_REALM_PARAM_FLAG_PMU;
>> +	}
>> +
>> +	if (kvm_lpa2_is_enabled())
>> +		params->flags |= RMI_REALM_PARAM_FLAG_LPA2;
>> +
>> +	params_phys = virt_to_phys(params);
>> +
>> +	if (rmi_realm_create(rd_phys, params_phys)) {
>> +		r = -ENXIO;
>> +		goto out_undelegate_tables;
>> +	}
>> +
>> +	realm->rd = rd;
>> +	kvm_set_realm_state(kvm, REALM_STATE_NEW);
>> +	/* The realm is up, free the parameters.  */
>> +	free_page((unsigned long)realm->params);
>> +	realm->params = NULL;
>> +
>> +	return 0;
>> +
>> +out_undelegate_tables:
>> +	if (WARN_ON(rmi_undelegate_range(kvm->arch.mmu.pgd_phys, pgd_size))) {
>> +		/* Leak the pages if they cannot be returned */
>> +		kvm->arch.mmu.pgt = NULL;
>> +	}
>> +	if (WARN_ON(rmi_undelegate_page(rd_phys))) {
>> +		/* Leak the page if it isn't returned */
>> +		return r;
>> +	}
>> +free_rd:
>> +	free_page((unsigned long)rd);
>> +	return r;
>> +}
>> +
>>  static void realm_unmap_private_range(struct kvm *kvm,
>>  				      unsigned long start,
>>  				      unsigned long end,
>> @@ -647,8 +718,21 @@ static int realm_init_ipa_state(struct kvm *kvm,
>>  
>>  static int realm_ensure_created(struct kvm *kvm)
>>  {
>> -	/* Provided in later patch */
>> -	return -ENXIO;
>> +	int ret;
>> +
>> +	switch (kvm_realm_state(kvm)) {
>> +	case REALM_STATE_NONE:
>> +		break;
>> +	case REALM_STATE_NEW:
>> +		return 0;
>> +	case REALM_STATE_DEAD:
>> +		return -ENXIO;
>> +	default:
>> +		return -EBUSY;
>> +	}
>> +
>> +	ret = realm_create_rd(kvm);
>> +	return ret;
>>  }
> 
> I think ret can be simplified out.
Indeed.

Thanks,
Steve

> Thanks,
> Wei-Lin Chang
> 
>>  
>>  static int set_ripas_of_protected_regions(struct kvm *kvm)
>> -- 
>> 2.43.0
>>


^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Suzuki K Poulose @ 2026-06-08  9:41 UTC (permalink / raw)
  To: Steven Price, Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <0c71b4b8-ad0b-4a24-9f4a-180b2aaacdb6@arm.com>

On 08/06/2026 10:36, Steven Price wrote:
> On 28/05/2026 06:30, Gavin Shan wrote:
>> Hi Steve,
>>
>> On 5/13/26 11:17 PM, Steven Price wrote:
>>> The VMM needs to populate the realm with some data before starting (e.g.
>>> a kernel and initrd). This is measured by the RMM and used as part of
>>> the attestation later on.
>>>
>>> Signed-off-by: Steven Price <steven.price@arm.com>

...

>>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>>> index a89873a5eb77..209087bcf399 100644
>>> --- a/arch/arm64/kvm/rmi.c
>>> +++ b/arch/arm64/kvm/rmi.c
>>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>>> unsigned long start,
>>>            realm_unmap_private_range(kvm, start, end, may_block);
>>>    }
>>>    +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>>> +                   kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>>> +                   unsigned long flags)
>>> +{
>>> +    struct realm *realm = &kvm->arch.realm;
>>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>>> +    phys_addr_t dst_phys, src_phys;
>>> +    int ret;
>>> +
>>> +    dst_phys = __pfn_to_phys(dst_pfn);
>>> +    src_phys = __pfn_to_phys(src_pfn);
>>> +
>>> +    if (rmi_delegate_page(dst_phys))
>>> +        return -ENXIO;
>>> +
>>> +    ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>>> +    if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>>> +        /* Create missing RTTs and retry */
>>> +        int level = RMI_RETURN_INDEX(ret);
>>> +
>>> +        KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
>>
>>          KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);
> 
> Ack.
> 

Thinking more about this, I guess a buggy VMM can trigger this
by populating twice ? (level == KVM_PGTABLE_LAST_LEVEL). So, we should
return the error back, than warning here and suppressing the error ?


Suzuki

^ permalink raw reply

* Re: [PATCH v14 26/44] arm64: RMI: Allow populating initial contents
From: Steven Price @ 2026-06-08  9:36 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <ea4be6c5-9506-4253-80c5-c76c9ac3b77d@redhat.com>

On 28/05/2026 06:30, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> The VMM needs to populate the realm with some data before starting (e.g.
>> a kernel and initrd). This is measured by the RMM and used as part of
>> the attestation later on.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Rename realm_create_protected_data_page() to realm_data_map_init().
>> Changes since v12:
>>   * The ioctl now updates the structure with the amount populated rather
>>     than returning this through the ioctl return code.
>>   * Use the new RMM v2.0 range based RMI calls.
>>   * Adapt to upstream changes in kvm_gmem_populate().
>> Changes since v11:
>>   * The multiplex CAP is gone and there's a new ioctl which makes use of
>>     the generic kvm_gmem_populate() functionality.
>> Changes since v7:
>>   * Improve the error codes.
>>   * Other minor changes from review.
>> Changes since v6:
>>   * Handle host potentially having a larger page size than the RMM
>>     granule.
>>   * Drop historic "par" (protected address range) from
>>     populate_par_region() - it doesn't exist within the current
>>     architecture.
>>   * Add a cond_resched() call in kvm_populate_realm().
>> Changes since v5:
>>   * Refactor to use PFNs rather than tracking struct page in
>>     realm_create_protected_data_page().
>>   * Pull changes from a later patch (in the v5 series) for accessing
>>     pages from a guest memfd.
>>   * Do the populate in chunks to avoid holding locks for too long and
>>     triggering RCU stall warnings.
>> ---
>>   arch/arm64/include/asm/kvm_rmi.h |   4 ++
>>   arch/arm64/kvm/Kconfig           |   1 +
>>   arch/arm64/kvm/arm.c             |  13 ++++
>>   arch/arm64/kvm/rmi.c             | 106 +++++++++++++++++++++++++++++++
>>   4 files changed, 124 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/
>> asm/kvm_rmi.h
>> index 007249a13dbc..a2b6bc412a22 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -88,6 +88,10 @@ int kvm_rec_enter(struct kvm_vcpu *vcpu);
>>   int kvm_rec_pre_enter(struct kvm_vcpu *vcpu);
>>   int handle_rec_exit(struct kvm_vcpu *vcpu, int rec_run_status);
>>   +struct kvm_arm_rmi_populate;
>> +
>> +int kvm_arm_rmi_populate(struct kvm *kvm,
>> +             struct kvm_arm_rmi_populate *arg);
>>   void kvm_realm_unmap_range(struct kvm *kvm,
>>                  unsigned long ipa,
>>                  unsigned long size,
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 4e16719fda22..d0cd011cf672 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -38,6 +38,7 @@ menuconfig KVM
>>       select GUEST_PERF_EVENTS if PERF_EVENTS
>>       select KVM_GUEST_MEMFD
>>       select KVM_GENERIC_MEMORY_ATTRIBUTES
>> +    select HAVE_KVM_ARCH_GMEM_POPULATE
>>       help
>>         Support hosting virtualized guest machines.
>>   diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index ed88a203b892..073ba9181da9 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -2131,6 +2131,19 @@ int kvm_arch_vm_ioctl(struct file *filp,
>> unsigned int ioctl, unsigned long arg)
>>               return -EFAULT;
>>           return kvm_vm_ioctl_get_reg_writable_masks(kvm, &range);
>>       }
>> +    case KVM_ARM_RMI_POPULATE: {
>> +        struct kvm_arm_rmi_populate req;
>> +        int ret;
>> +
>> +        if (!kvm_is_realm(kvm))
>> +            return -ENXIO;
>> +        if (copy_from_user(&req, argp, sizeof(req)))
>> +            return -EFAULT;
>> +        ret = kvm_arm_rmi_populate(kvm, &req);
>> +        if (copy_to_user(argp, &req, sizeof(req)))
>> +            return -EFAULT;
>> +        return ret;
>> +    }
> 
> s/return ret/return 0; The variable 'ret' can be dropped.

kvm_arm_rmi_populate() may return an error though. E.g. if the
"reserved" field is set then it's kvm_arm_rmi_populate() that detects
that and returns -EINVAL.

>>       default:
>>           return -EINVAL;
>>       }
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index a89873a5eb77..209087bcf399 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -486,6 +486,75 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>> unsigned long start,
>>           realm_unmap_private_range(kvm, start, end, may_block);
>>   }
>>   +static int realm_data_map_init(struct kvm *kvm, unsigned long ipa,
>> +                   kvm_pfn_t dst_pfn, kvm_pfn_t src_pfn,
>> +                   unsigned long flags)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>> +    phys_addr_t dst_phys, src_phys;
>> +    int ret;
>> +
>> +    dst_phys = __pfn_to_phys(dst_pfn);
>> +    src_phys = __pfn_to_phys(src_pfn);
>> +
>> +    if (rmi_delegate_page(dst_phys))
>> +        return -ENXIO;
>> +
>> +    ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys, flags);
>> +    if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> +        /* Create missing RTTs and retry */
>> +        int level = RMI_RETURN_INDEX(ret);
>> +
>> +        KVM_BUG_ON(level == KVM_PGTABLE_LAST_LEVEL, kvm);
> 
>         KVM_BUG_ON(level >= KVM_PGTABLE_LAST_LEVEL, kvm);

Ack.

>> +        ret = realm_create_rtt_levels(realm, ipa, level,
>> +                          KVM_PGTABLE_LAST_LEVEL, NULL);
>> +        if (!ret) {
>> +            ret = rmi_rtt_data_map_init(rd, dst_phys, ipa, src_phys,
>> +                            flags);
>> +        }
>> +    }
>> +
>> +    if (ret) {
>> +        if (WARN_ON(rmi_undelegate_page(dst_phys))) {
>> +            /* Undelegate failed, so we leak the page */
>> +            get_page(pfn_to_page(dst_pfn));
>> +        }
>> +    }
>> +
> 
>     if (ret && WARN_ON(rmi_undelegate_page(dst_phys)) {
>         /* Leak the page that fails to be undelegated */
>         get_page(pfn_to_page(dst_pfn));
>     }

Ack

>> +    return ret;
>> +}
>> +
>> +static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
>> +                  struct page *src_page, void *opaque)
>> +{
>> +    unsigned long data_flags = *(unsigned long *)opaque;
>> +    phys_addr_t ipa = gfn_to_gpa(gfn);
>> +
>> +    if (!src_page)
>> +        return -EOPNOTSUPP;
>> +
>> +    return realm_data_map_init(kvm, ipa, pfn, page_to_pfn(src_page),
>> +                   data_flags);
>> +}
>> +
>> +static long populate_region(struct kvm *kvm,
>> +                gfn_t base_gfn,
>> +                unsigned long pages,
>> +                u64 uaddr,
>> +                unsigned long data_flags)
>> +{
>> +    long ret = 0;
>> +
>> +    mutex_lock(&kvm->slots_lock);
>> +    ret = kvm_gmem_populate(kvm, base_gfn, u64_to_user_ptr(uaddr),
>> pages,
>> +                populate_region_cb, &data_flags);
>> +    mutex_unlock(&kvm->slots_lock);
>> +
>> +    return ret;
>> +}
>> +
>>   enum ripas_action {
>>       RIPAS_INIT,
>>       RIPAS_SET,
>> @@ -574,6 +643,43 @@ static int realm_ensure_created(struct kvm *kvm)
>>       return -ENXIO;
>>   }
>>   +int kvm_arm_rmi_populate(struct kvm *kvm,
>> +             struct kvm_arm_rmi_populate *args)
>> +{
>> +    unsigned long data_flags = 0;
>> +    unsigned long ipa_start = args->base;
>> +    unsigned long ipa_end = ipa_start + args->size;
>> +    long pages_populated;
>> +    int ret;
>> +
>> +    if (args->reserved ||
>> +        (args->flags & ~KVM_ARM_RMI_POPULATE_FLAGS_MEASURE) ||
>> +        !IS_ALIGNED(ipa_start, PAGE_SIZE) ||
>> +        !IS_ALIGNED(ipa_end, PAGE_SIZE) ||
>> +        !IS_ALIGNED(args->source_uaddr, PAGE_SIZE))
>> +        return -EINVAL;
>> +
> 
> There are more conditions missed here:
> 
>     args->size == 0, return 0;
>     args->base + args->size < args->base, return -EINVAL;  // wrapped range

Good catch. args->size == 0 can trigger a WARN_ON currently. I'll put
the "return 0" after the realm_ensure_created() call so the behaviour
matches.

I don't think the wrapped range is quite such a problem - but detecting
it and rejecting it early seems like a good idea.

>> +    ret = realm_ensure_created(kvm);
>> +    if (ret)
>> +        return ret;
>> +
>> +    if (args->flags & KVM_ARM_RMI_POPULATE_FLAGS_MEASURE)
>> +        data_flags |= RMI_MEASURE_CONTENT;
>> +
>> +    pages_populated = populate_region(kvm, gpa_to_gfn(ipa_start),
>> +                      args->size >> PAGE_SHIFT,
>> +                      args->source_uaddr, data_flags);
>> +
>> +    if (pages_populated < 0)
>> +        return pages_populated;
> 
> pages_populaged is 'unsigned long', this function returns a 'int' value.

pages_populated is *signed* long. This is handling an error code - so if
it's negative we expect the error code to be between -1 and -MAX_ERRNO
which should easily fit within the 'int' return.

For positive values we continue below (encoding the potentially larger
number in the args outputs) and return 0.

Thanks,
Steve

>> +
>> +    args->size -= pages_populated << PAGE_SHIFT;
>> +    args->source_uaddr += pages_populated << PAGE_SHIFT;
>> +    args->base += pages_populated << PAGE_SHIFT;
>> +
>> +    return 0;
>> +}
>> +
>>   static void kvm_complete_ripas_change(struct kvm_vcpu *vcpu)
>>   {
>>       struct kvm *kvm = vcpu->kvm;
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v14 29/44] arm64: RMI: Runtime faulting of memory
From: Suzuki K Poulose @ 2026-06-08  9:30 UTC (permalink / raw)
  To: Gavin Shan, Steven Price, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Zenghui Yu, linux-arm-kernel, linux-kernel,
	Joey Gouly, Alexandru Elisei, Christoffer Dall, Fuad Tabba,
	linux-coco, Ganapatrao Kulkarni, Shanker Donthineni, Alper Gun,
	Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve, WeiLin.Chang,
	Lorenzo.Pieralisi2
In-Reply-To: <3359f788-07fa-41a1-9ac7-45c58577c1fa@redhat.com>

On 05/06/2026 07:23, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> At runtime if the realm guest accesses memory which hasn't yet been
>> mapped then KVM needs to either populate the region or fault the guest.
>>
>> For memory in the lower (protected) region of IPA a fresh page is
>> provided to the RMM which will zero the contents. For memory in the
>> upper (shared) region of IPA, the memory from the memslot is mapped
>> into the realm VM non secure.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> ---
>> Changes since v13:
>>   * Numerous changes due to rebasing.
>>   * Fix addr_range_desc() to encode the correct block size.
>> Changes since v12:
>>   * Switch to RMM v2.0 range based APIs.
>> Changes since v11:
>>   * Adapt to upstream changes.
>> Changes since v10:
>>   * RME->RMI renaming.
>>   * Adapt to upstream gmem changes.
>> Changes since v9:
>>   * Fix call to kvm_stage2_unmap_range() in kvm_free_stage2_pgd() to set
>>     may_block to avoid stall warnings.
>>   * Minor coding style fixes.
>> Changes since v8:
>>   * Propagate the may_block flag.
>>   * Minor comments and coding style changes.
>> Changes since v7:
>>   * Remove redundant WARN_ONs for realm_create_rtt_levels() - it will
>>     internally WARN when necessary.
>> Changes since v6:
>>   * Handle PAGE_SIZE being larger than RMM granule size.
>>   * Some minor renaming following review comments.
>> Changes since v5:
>>   * Reduce use of struct page in preparation for supporting the RMM
>>     having a different page size to the host.
>>   * Handle a race when delegating a page where another CPU has faulted on
>>     a the same page (and already delegated the physical page) but not yet
>>     mapped it. In this case simply return to the guest to either use the
>>     mapping from the other CPU (or refault if the race is lost).
>>   * The changes to populate_par_region() are moved into the previous
>>     patch where they belong.
>> Changes since v4:
>>   * Code cleanup following review feedback.
>>   * Drop the PTE_SHARED bit when creating unprotected page table entries.
>>     This is now set by the RMM and the host has no control of it and the
>>     spec requires the bit to be set to zero.
>> Changes since v2:
>>   * Avoid leaking memory if failing to map it in the realm.
>>   * Correctly mask RTT based on LPA2 flag (see rtt_get_phys()).
>>   * Adapt to changes in previous patches.
>> ---
>>   arch/arm64/include/asm/kvm_emulate.h |   8 ++
>>   arch/arm64/include/asm/kvm_rmi.h     |  12 ++
>>   arch/arm64/kvm/mmu.c                 | 128 ++++++++++++++++----
>>   arch/arm64/kvm/rmi.c                 | 173 +++++++++++++++++++++++++++
>>   4 files changed, 301 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/ 
>> include/asm/kvm_emulate.h
>> index 2e69fe494716..8b6f9d26b5d8 100644
>> --- a/arch/arm64/include/asm/kvm_emulate.h
>> +++ b/arch/arm64/include/asm/kvm_emulate.h
>> @@ -712,6 +712,14 @@ static inline bool kvm_realm_is_created(struct 
>> kvm *kvm)
>>       return kvm_is_realm(kvm) && kvm_realm_state(kvm) != 
>> REALM_STATE_NONE;
>>   }
>> +static inline gpa_t kvm_gpa_from_fault(struct kvm *kvm, phys_addr_t ipa)
>> +{
>> +    if (!kvm_is_realm(kvm))
>> +        return ipa;
>> +
>> +    return ipa & ~BIT(kvm->arch.realm.ia_bits - 1);
>> +}
>> +
>>   static inline bool vcpu_is_rec(const struct kvm_vcpu *vcpu)
>>   {
>>       return kvm_is_realm(vcpu->kvm);
>> diff --git a/arch/arm64/include/asm/kvm_rmi.h b/arch/arm64/include/ 
>> asm/kvm_rmi.h
>> index a2b6bc412a22..b65cfec10dee 100644
>> --- a/arch/arm64/include/asm/kvm_rmi.h
>> +++ b/arch/arm64/include/asm/kvm_rmi.h
>> @@ -6,6 +6,7 @@
>>   #ifndef __ASM_KVM_RMI_H
>>   #define __ASM_KVM_RMI_H
>> +#include <asm/kvm_pgtable.h>
>>   #include <asm/rmi_smc.h>
>>   /**
>> @@ -97,6 +98,17 @@ void kvm_realm_unmap_range(struct kvm *kvm,
>>                  unsigned long size,
>>                  bool unmap_private,
>>                  bool may_block);
>> +int realm_map_protected(struct kvm *kvm,
>> +            unsigned long base_ipa,
>> +            kvm_pfn_t pfn,
>> +            unsigned long size,
>> +            struct kvm_mmu_memory_cache *memcache);
>> +int realm_map_non_secure(struct realm *realm,
>> +             unsigned long ipa,
>> +             kvm_pfn_t pfn,
>> +             unsigned long size,
>> +             enum kvm_pgtable_prot prot,
>> +             struct kvm_mmu_memory_cache *memcache);
>>   static inline bool kvm_realm_is_private_address(struct realm *realm,
>>                           unsigned long addr)
>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>> index ac2a0f0106b0..776ffe56d17e 100644
>> --- a/arch/arm64/kvm/mmu.c
>> +++ b/arch/arm64/kvm/mmu.c
>> @@ -334,8 +334,15 @@ static void __unmap_stage2_range(struct 
>> kvm_s2_mmu *mmu, phys_addr_t start, u64
>>       lockdep_assert_held_write(&kvm->mmu_lock);
>>       WARN_ON(size & ~PAGE_MASK);
>> -    WARN_ON(stage2_apply_range(mmu, start, end, 
>> KVM_PGT_FN(kvm_pgtable_stage2_unmap),
>> -                   may_block));
>> +
>> +    if (kvm_is_realm(kvm)) {
>> +        kvm_realm_unmap_range(kvm, start, size, !only_shared,
>> +                      may_block);
>> +    } else {
>> +        WARN_ON(stage2_apply_range(mmu, start, end,
>> +                       KVM_PGT_FN(kvm_pgtable_stage2_unmap),
>> +                       may_block));
>> +    }
>>   }
>>   void kvm_stage2_unmap_range(struct kvm_s2_mmu *mmu, phys_addr_t start,
>> @@ -358,7 +365,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
>>       phys_addr_t addr = memslot->base_gfn << PAGE_SHIFT;
>>       phys_addr_t end = addr + PAGE_SIZE * memslot->npages;
>> -    kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
>> +    if (kvm_is_realm(kvm))
>> +        kvm_realm_unmap_range(kvm, addr, end - addr, false, true);
>> +    else
>> +        kvm_stage2_flush_range(&kvm->arch.mmu, addr, end);
>>   }
>>   /**
>> @@ -1103,6 +1113,10 @@ void stage2_unmap_vm(struct kvm *kvm)
>>       struct kvm_memory_slot *memslot;
>>       int idx, bkt;
>> +    /* For realms this is handled by the RMM so nothing to do here */
>> +    if (kvm_is_realm(kvm))
>> +        return;
>> +
>>       idx = srcu_read_lock(&kvm->srcu);
>>       mmap_read_lock(current->mm);
>>       write_lock(&kvm->mmu_lock);
>> @@ -1528,6 +1542,29 @@ static bool kvm_vma_mte_allowed(struct 
>> vm_area_struct *vma)
>>       return vma->vm_flags & VM_MTE_ALLOWED;
>>   }
>> +static int realm_map_ipa(struct kvm *kvm, phys_addr_t ipa,
>> +             kvm_pfn_t pfn, unsigned long map_size,
>> +             enum kvm_pgtable_prot prot,
>> +             struct kvm_mmu_memory_cache *memcache)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +
>> +    /*
>> +     * Write permission is required for now even though it's possible to
>> +     * map unprotected pages (granules) as read-only. It's impossible to
>> +     * map protected pages (granules) as read-only.
>> +     */
>> +    if (WARN_ON(!(prot & KVM_PGTABLE_PROT_W)))
>> +        return -EFAULT;
>> +
> 
> I'm a bit concerned with this. We don't have KVM_PGTABLE_PROT_W set in 
> @prot
> if the stage2 fault is raised due to memory read. With -EFAULT returned 
> to VMM
> (e.g. QEMU), the vCPU continuous execution is stopped and system won't be
> working any more.
> 
>> +    ipa = ALIGN_DOWN(ipa, PAGE_SIZE);
>> +    if (!kvm_realm_is_private_address(realm, ipa))
>> +        return realm_map_non_secure(realm, ipa, pfn, map_size, prot,
>> +                        memcache);
>> +
>> +    return realm_map_protected(kvm, ipa, pfn, map_size, memcache);
>> +}
>> +
>>   static bool kvm_vma_is_cacheable(struct vm_area_struct *vma)
>>   {
>>       switch (FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(vma- 
>> >vm_page_prot))) {
>> @@ -1604,27 +1641,52 @@ static int gmem_abort(const struct 
>> kvm_s2_fault_desc *s2fd)
>>       bool write_fault, exec_fault;
>>       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_SHARED;
>>       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
>> -    struct kvm_pgtable *pgt = s2fd->vcpu->arch.hw_mmu->pgt;
>> +    struct kvm_vcpu *vcpu = s2fd->vcpu;
>> +    struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
>> +    gpa_t gpa = kvm_gpa_from_fault(vcpu->kvm, s2fd->fault_ipa);
>>       unsigned long mmu_seq;
>>       struct page *page;
>> -    struct kvm *kvm = s2fd->vcpu->kvm;
>> +    struct kvm *kvm = vcpu->kvm;
>>       void *memcache;
>>       kvm_pfn_t pfn;
>>       gfn_t gfn;
>>       int ret;
>> -    memcache = get_mmu_memcache(s2fd->vcpu);
>> -    ret = topup_mmu_memcache(s2fd->vcpu, memcache);
>> +    if (kvm_is_realm(vcpu->kvm)) {
>> +        /* check for memory attribute mismatch */
>> +        bool is_priv_gfn = kvm_mem_is_private(kvm, gpa >> PAGE_SHIFT);
>> +        /*
>> +         * For Realms, the shared address is an alias of the private
>> +         * PA with the top bit set. Thus if the fault address matches
>> +         * the GPA then it is the private alias.
>> +         */
>> +        bool is_priv_fault = (gpa == s2fd->fault_ipa);
>> +
>> +        if (is_priv_gfn != is_priv_fault) {
>> +            kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>> +                              kvm_is_write_fault(vcpu),
>> +                              false,
>> +                              is_priv_fault);
>> +            /*
>> +             * KVM_EXIT_MEMORY_FAULT requires an return code of
>> +             * -EFAULT, see the API documentation
>> +             */
>> +            return -EFAULT;
>> +        }
>> +    }
>> +
>> +    memcache = get_mmu_memcache(vcpu);
>> +    ret = topup_mmu_memcache(vcpu, memcache);
>>       if (ret)
>>           return ret;
>>       if (s2fd->nested)
>>           gfn = kvm_s2_trans_output(s2fd->nested) >> PAGE_SHIFT;
>>       else
>> -        gfn = s2fd->fault_ipa >> PAGE_SHIFT;
>> +        gfn = gpa >> PAGE_SHIFT;
>> -    write_fault = kvm_is_write_fault(s2fd->vcpu);
>> -    exec_fault = kvm_vcpu_trap_is_exec_fault(s2fd->vcpu);
>> +    write_fault = kvm_is_write_fault(vcpu);
>> +    exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>>       VM_WARN_ON_ONCE(write_fault && exec_fault);
>> @@ -1634,7 +1696,7 @@ static int gmem_abort(const struct 
>> kvm_s2_fault_desc *s2fd)
>>       ret = kvm_gmem_get_pfn(kvm, s2fd->memslot, gfn, &pfn, &page, NULL);
>>       if (ret) {
>> -        kvm_prepare_memory_fault_exit(s2fd->vcpu, s2fd->fault_ipa, 
>> PAGE_SIZE,
>> +        kvm_prepare_memory_fault_exit(vcpu, gpa, PAGE_SIZE,
>>                             write_fault, exec_fault, false);
>>           return ret;
>>       }
>> @@ -1654,14 +1716,20 @@ static int gmem_abort(const struct 
>> kvm_s2_fault_desc *s2fd)
>>       kvm_fault_lock(kvm);
>>       if (mmu_invalidate_retry(kvm, mmu_seq)) {
>>           ret = -EAGAIN;
>> -        goto out_unlock;
>> +        goto out_release_page;
>> +    }
>> +
>> +    if (kvm_is_realm(kvm)) {
>> +        ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn,
>> +                    PAGE_SIZE, KVM_PGTABLE_PROT_R | 
>> KVM_PGTABLE_PROT_W, memcache);
>> +        goto out_release_page;
>>       }
>>       ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, 
>> PAGE_SIZE,
>>                            __pfn_to_phys(pfn), prot,
>>                            memcache, flags);
>> -out_unlock:
>> +out_release_page:
>>       kvm_release_faultin_page(kvm, page, !!ret, prot & 
>> KVM_PGTABLE_PROT_W);
>>       kvm_fault_unlock(kvm);
>> @@ -1847,7 +1915,7 @@ static int kvm_s2_fault_get_vma_info(const 
>> struct kvm_s2_fault_desc *s2fd,
>>        * mapping size to ensure we find the right PFN and lay down the
>>        * mapping in the right place.
>>        */
>> -    s2vi->gfn = ALIGN_DOWN(s2fd->fault_ipa, s2vi->vma_pagesize) >> 
>> PAGE_SHIFT;
>> +    s2vi->gfn = kvm_gpa_from_fault(kvm, ALIGN_DOWN(s2fd->fault_ipa, 
>> s2vi->vma_pagesize)) >> PAGE_SHIFT;
>>       s2vi->mte_allowed = kvm_vma_mte_allowed(vma);
>> @@ -2056,6 +2124,9 @@ static int kvm_s2_fault_map(const struct 
>> kvm_s2_fault_desc *s2fd,
>>           prot &= ~KVM_NV_GUEST_MAP_SZ;
>>           ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, 
>> gfn_to_gpa(gfn),
>>                                    prot, flags);
>> +    } else if (kvm_is_realm(kvm)) {
>> +        ret = realm_map_ipa(kvm, s2fd->fault_ipa, pfn, mapping_size,
>> +                    prot, memcache);
>>       } else {
>>           ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, 
>> gfn_to_gpa(gfn), mapping_size,
>>                                __pfn_to_phys(pfn), prot,
> 
> For the case kvm_is_realm(), need we adjust 's2fd->fault_ipa' for the 
> sake of
> huge pages. In kvm_s2_fault_map(), @gfn and @pfn may have been adjusted by
> transparent_hugepage_adjust() to be aligned with huge page size. If the
> adjustment happened in transparent_hugepage_adjust(), we need to align
> s2fd->fault_ipa down to the huge page size either.
> 
> 
>> @@ -2214,6 +2285,13 @@ int kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
>>       return 0;
>>   }
>> +static bool shared_ipa_fault(struct kvm *kvm, phys_addr_t fault_ipa)
>> +{
>> +    gpa_t gpa = kvm_gpa_from_fault(kvm, fault_ipa);
>> +
>> +    return (gpa != fault_ipa);
>> +}
>> +
>>   /**
>>    * kvm_handle_guest_abort - handles all 2nd stage aborts
>>    * @vcpu:    the VCPU pointer
>> @@ -2324,8 +2402,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>           nested = &nested_trans;
>>       }
>> -    gfn = ipa >> PAGE_SHIFT;
>> +    gfn = kvm_gpa_from_fault(vcpu->kvm, ipa) >> PAGE_SHIFT;
>>       memslot = gfn_to_memslot(vcpu->kvm, gfn);
>> +
>>       hva = gfn_to_hva_memslot_prot(memslot, gfn, &writable);
>>       write_fault = kvm_is_write_fault(vcpu);
>>       if (kvm_is_error_hva(hva) || (write_fault && !writable)) {
>> @@ -2368,7 +2447,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>            * of the page size.
>>            */
>>           ipa |= FAR_TO_FIPA_OFFSET(kvm_vcpu_get_hfar(vcpu));
>> -        ret = io_mem_abort(vcpu, ipa);
>> +        ret = io_mem_abort(vcpu, kvm_gpa_from_fault(vcpu->kvm, ipa));
>>           goto out_unlock;
>>       }
>> @@ -2396,7 +2475,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>>                   !write_fault &&
>>                   !kvm_vcpu_trap_is_exec_fault(vcpu));
>> -        if (kvm_slot_has_gmem(memslot))
>> +        if (kvm_slot_has_gmem(memslot) && !shared_ipa_fault(vcpu- 
>> >kvm, fault_ipa))
>>               ret = gmem_abort(&s2fd);
>>           else
>>               ret = user_mem_abort(&s2fd);
>> @@ -2433,6 +2512,10 @@ bool kvm_age_gfn(struct kvm *kvm, struct 
>> kvm_gfn_range *range)
>>       if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>>           return false;
>> +    /* We don't support aging for Realms */
>> +    if (kvm_is_realm(kvm))
>> +        return true;
>> +
>>       return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm- 
>> >arch.mmu.pgt,
>>                              range->start << PAGE_SHIFT,
>>                              size, true);
>> @@ -2449,6 +2532,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct 
>> kvm_gfn_range *range)
>>       if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
>>           return false;
>> +    /* We don't support aging for Realms */
>> +    if (kvm_is_realm(kvm))
>> +        return true;
>> +
>>       return KVM_PGT_FN(kvm_pgtable_stage2_test_clear_young)(kvm- 
>> >arch.mmu.pgt,
>>                              range->start << PAGE_SHIFT,
>>                              size, false);
>> @@ -2628,10 +2715,11 @@ int kvm_arch_prepare_memory_region(struct kvm 
>> *kvm,
>>           return -EFAULT;
>>       /*
>> -     * Only support guest_memfd backed memslots with mappable memory, 
>> since
>> -     * there aren't any CoCo VMs that support only private memory on 
>> arm64.
>> +     * Only support guest_memfd backed memslots with mappable memory,
>> +     * unless the guest is a CCA realm guest.
>>        */
>> -    if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
>> +    if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new) &&
>> +        !kvm_is_realm(kvm))
>>           return -EINVAL;
>>       hva = new->userspace_addr;
>> diff --git a/arch/arm64/kvm/rmi.c b/arch/arm64/kvm/rmi.c
>> index cae29fd3353c..761b38a4071c 100644
>> --- a/arch/arm64/kvm/rmi.c
>> +++ b/arch/arm64/kvm/rmi.c
>> @@ -597,6 +597,179 @@ static int realm_data_map_init(struct kvm *kvm, 
>> unsigned long ipa,
>>       return ret;
>>   }
>> +static unsigned long addr_range_desc(unsigned long phys, unsigned 
>> long size)
>> +{
>> +    unsigned long out = 0;
>> +
>> +    switch (size) {
>> +    case P4D_SIZE:
>> +        out = 3 | (1 << 2);
>> +        break;
>> +    case PUD_SIZE:
>> +        out = 2 | (1 << 2);
>> +        break;
>> +    case PMD_SIZE:
>> +        out = 1 | (1 << 2);
>> +        break;
>> +    case PAGE_SIZE:
>> +        out = 0 | (1 << 2);
>> +        break;
>> +    default:
>> +        /*
>> +         * Only support mapping at the page level granulatity when
>> +         * it's an unusual length. This should get us back onto a larger
>> +         * block size for the subsequent mappings.
>> +         */
>> +        out = 0 | ((MIN(size >> PAGE_SHIFT, PTRS_PER_PTE - 1)) << 2);
>> +        break;
>> +    }
>> +
>> +    WARN_ON(phys & ~PAGE_MASK);
>> +
>> +    out |= phys & PAGE_MASK;
>> +
>> +    return out;
>> +}
>> +
>> +int realm_map_protected(struct kvm *kvm,
>> +            unsigned long ipa,
>> +            kvm_pfn_t pfn,
>> +            unsigned long map_size,
>> +            struct kvm_mmu_memory_cache *memcache)
>> +{
>> +    struct realm *realm = &kvm->arch.realm;
>> +    phys_addr_t phys = __pfn_to_phys(pfn);
>> +    phys_addr_t base_phys = phys;
>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>> +    unsigned long base_ipa = ipa;
>> +    unsigned long ipa_top = ipa + map_size;
>> +    int ret = 0;
>> +
>> +    if (WARN_ON(!IS_ALIGNED(map_size, PAGE_SIZE) ||
>> +            !IS_ALIGNED(ipa, map_size)))
>> +        return -EINVAL;
>> +
>> +    if (rmi_delegate_range(phys, map_size)) {
>> +        /*
>> +         * It's likely we raced with another VCPU on the same
>> +         * fault. Assume the other VCPU has handled the fault
>> +         * and return to the guest.
>> +         */
>> +        return 0;
>> +    }
>> +
>> +    while (ipa < ipa_top) {
>> +        unsigned long flags = RMI_ADDR_TYPE_SINGLE;
>> +        unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>> +        unsigned long out_top;
>> +
>> +        ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags, range_desc,
>> +                       &out_top);
>> +
>> +        if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> +            /* Create missing RTTs and retry */
>> +            int level = RMI_RETURN_INDEX(ret);
>> +
>> +            WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>> +            ret = realm_create_rtt_levels(realm, ipa, level,
>> +                              KVM_PGTABLE_LAST_LEVEL,
>> +                              memcache);

Could we give the RMM a chance to make use of the Block mappings by 
creating the Missing RTTs to the level that may work for the current
range_desc ? i.e., if the range_desc is a 2M block size, we could create
tables upto L2 in the first go and if the RMM still needs RTT, we could
go further down to the KVM_PGTABLE_LAST_LEVEL. I understand this is
kind of an optimisation, so may be we could defer it. (Same applies for
the non_secure map below).


>> +            if (ret)
>> +                goto err_undelegate;
>> +
>> +            ret = rmi_rtt_data_map(rd, ipa, ipa_top, flags,
>> +                           range_desc, &out_top);
>> +        }
>> +
>> +        if (WARN_ON(ret))
>> +            goto err_undelegate;
>> +
>> +        phys += out_top - ipa;
>> +        ipa = out_top;
>> +    }
>> +
>> +    return 0;
>> +
>> +err_undelegate:
>> +    realm_unmap_private_range(kvm, base_ipa, ipa, true);
>> +    if (WARN_ON(rmi_undelegate_range(base_phys, map_size))) {
>> +        /* Page can't be returned to NS world so is lost */
>> +        get_page(phys_to_page(base_phys));
>> +    }
>> +    return -ENXIO;
>> +}
>> +
>> +int realm_map_non_secure(struct realm *realm,
>> +             unsigned long ipa,
>> +             kvm_pfn_t pfn,
>> +             unsigned long size,
>> +             enum kvm_pgtable_prot prot,
>> +             struct kvm_mmu_memory_cache *memcache)
>> +{
>> +    unsigned long attr, flags = 0;
>> +    phys_addr_t rd = virt_to_phys(realm->rd);
>> +    phys_addr_t phys = __pfn_to_phys(pfn);
>> +    unsigned long ipa_top = ipa + size;
>> +    int ret;
>> +
>> +    if (WARN_ON(!IS_ALIGNED(size, PAGE_SIZE) ||
>> +            !IS_ALIGNED(ipa, size)))
>> +        return -EINVAL;
>> +
>> +    switch (prot & (KVM_PGTABLE_PROT_DEVICE | 
>> KVM_PGTABLE_PROT_NORMAL_NC)) {
>> +    case KVM_PGTABLE_PROT_DEVICE | KVM_PGTABLE_PROT_NORMAL_NC:
>> +        return -EINVAL;
>> +    case KVM_PGTABLE_PROT_DEVICE:
>> +        attr = MT_S2_FWB_DEVICE_nGnRE;
>> +        break;
>> +    case KVM_PGTABLE_PROT_NORMAL_NC:
>> +        attr = MT_S2_FWB_NORMAL_NC;
>> +        break;
>> +    default:
>> +        attr = MT_S2_FWB_NORMAL;
>> +    }
>> +
>> +    flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_MEMATTR, attr);
>> +
>> +    if (prot & KVM_PGTABLE_PROT_R)
>> +        flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, 
>> RMI_S2AP_DIRECT_READ);
>> +    if (prot & KVM_PGTABLE_PROT_W)
>> +        flags |= FIELD_PREP(RMI_RTT_UNPROT_MAP_FLAGS_S2AP, 
>> RMI_S2AP_DIRECT_WRITE);
>> +
>> +    flags |= RMI_ADDR_TYPE_SINGLE;
>> +
>> +    while (ipa < ipa_top) {
>> +        unsigned long range_desc = addr_range_desc(phys, ipa_top - ipa);
>> +        unsigned long out_top;
>> +
>> +        ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags, range_desc,
>> +                     &out_top);
>> +
>> +        if (RMI_RETURN_STATUS(ret) == RMI_ERROR_RTT) {
>> +            /* Create missing RTTs and retry */
>> +            int level = RMI_RETURN_INDEX(ret);
>> +
>> +            WARN_ON(level == KVM_PGTABLE_LAST_LEVEL);
>> +            ret = realm_create_rtt_levels(realm, ipa, level,
>> +                              KVM_PGTABLE_LAST_LEVEL,

^^ Same as above.

Suzuki


>> +                              memcache);
>> +            if (ret)
>> +                return ret;
>> +
>> +            ret = rmi_rtt_unprot_map(rd, ipa, ipa_top, flags,
>> +                         range_desc, &out_top);
>> +        }
>> +
>> +        if (WARN_ON(ret))
>> +            return ret;
>> +
>> +        phys += out_top - ipa;
>> +        ipa = out_top;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   static int populate_region_cb(struct kvm *kvm, gfn_t gfn, kvm_pfn_t 
>> pfn,
>>                     struct page *src_page, void *opaque)
>>   {
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v6 06/11] x86/virt/tdx: Optimize tdx_pamt_get/put()
From: Kiryl Shutsemau @ 2026-06-08  9:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Chao Gao, Edgecombe, Rick P, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, Huang, Kai, Zhao, Yan Y,
	seanjc@google.com, mingo@redhat.com, linux-kernel@vger.kernel.org,
	pbonzini@redhat.com, nik.borisov@suse.com,
	linux-doc@vger.kernel.org, hpa@zytor.com, tglx@kernel.org,
	Annapurve, Vishal, bp@alien8.de, kirill.shutemov@linux.intel.com,
	x86@kernel.org
In-Reply-To: <572868d7-4794-4fec-b80f-97d8434d5fb6@intel.com>

On Fri, Jun 05, 2026 at 09:23:21AM -0700, Dave Hansen wrote:
> On 6/5/26 04:42, Kiryl Shutsemau wrote:
> >>> I don't see a reason why we can't keep the scoped_guard() on get side.
> >> One additional reason to drop scoped_guard() is that it mixes cleanup helpers
> >> with goto, which is discouraged. See [*]
> >>
> >>  :Lastly, given that the benefit of cleanup helpers is removal of “goto”, and
> >>  :that the “goto” statement can jump between scopes, the expectation is that
> >>  :usage of “goto” and cleanup helpers is never mixed in the same function.
> > Fair enough.
> > 
> > But it can also be address if we free the PAMT page array with the guard
> > too :P
> 
> How important is this patch? I see "Optimize" but I read "Optional".
> 
> If we're arguing about it, maybe we should just kick it out and focus on
> the more important bits.

I don't think it is optional for anything outside of test setup.

Without the optimization, we have all KVM memory allocations serialized
on a single spinlock. And we do alloc_pamt_array()/free_pamt_array() all
the time too.

And since the lock is global, it is an easy DoS attack vector: one guest
can do a shared->private->shared conversion loop and make every guest on
the host suffer.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Suzuki K Poulose @ 2026-06-08  9:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Sudeep Holla
  Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
	Greg KH, Jeremy Linton, Jonathan Cameron, Lorenzo Pieralisi,
	Mark Rutland, Will Deacon, Steven Price
In-Reply-To: <yq5ao6hlzbpa.fsf@kernel.org>

On 08/06/2026 09:19, Aneesh Kumar K.V wrote:
> Sudeep Holla <sudeep.holla@kernel.org> writes:
> 
>> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>>>
>>> ...
>>>
>>>> +static const struct smccc_device_info smccc_devices[] __initconst = {
>>>> +       {
>>>> +               .func_id        = ARM_SMCCC_TRNG_VERSION,
>>>> +               .requires_smc   = false,
>>>> +               .min_return     = ARM_SMCCC_TRNG_MIN_VERSION,
>>>> +               .device_name    = "arm-smccc-trng",
>>>> +       },
>>>> +};
>>>> +
>>>> +static bool __init
>>>> +smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
>>>> +{
>>>> +       struct arm_smccc_res res;
>>>> +       unsigned long ret;
>>>> +
>>>> +       if (!IS_ENABLED(CONFIG_ARM64))
>>>> +               return false;
>>>> +
>>>> +       if (smccc_conduit == SMCCC_CONDUIT_NONE)
>>>> +               return false;
>>>> +
>>>> +       if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
>>>> +               return false;
>>>> +
>>>> +       arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
>>>> +       ret = res.a0;
>>>> +
>>>> +       if ((s32)ret < 0)
>>>> +               return false;
>>>> +
>>>> +       return ret >= smccc_dev->min_return;
>>>> +}
>>>> +
>>>>
>>>
>>> I am not sure we want the check to be as simple as ret < 0. Some
>>> function IDs may return input errors based on the supplied arguments
>>> (for example, RMI_ERROR_INPUT). In those cases, we would likely want
>>> this to be handled via a callback.
>>>
>>
>> As I mentioned in response to Suzuki, we can defer that to probe of
>> that device. If *_VERSION, succeeds SMCCC core can add that device and
>> leave the rest to the core keeping the core and bus layer simple IMO.
>>
>>> We also want to use conditional compilation for some function IDs.
>>> Given the callback approach and the #ifdefs, I wonder whether what we
>>> currently have is actually simpler and more flexible.”
>>>
>>
>> I was trying to avoid conditional compilation altogether and hence the
>> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
>> in above snippet must come as some condition to this generic probe.
>>
>> Adding any more logic or callback defeats the bus idea here if we need
>> to rely/depend on multiple conditional compilation or callbacks IMO.
>>
>> Let's find see if it can work with what we are adding now and may add in
>> near future and then decide.
>>
> 
> If we move all the conditional checks to the driver probe path, then I
> think this can work. Something like the below:
> 
> struct smccc_device_info {
> 	u32 func_id;
> 	bool requires_smc;
> 	const char *device_name;
> };
> 
> static const struct smccc_device_info smccc_devices[] __initconst = {
> 	{
> 		.func_id        = ARM_SMCCC_TRNG_VERSION,
> 		.requires_smc   = false,
> 		.device_name    = "arm-smccc-trng",
> 	},
> 
> 	{
> 		.func_id        = RSI_ABI_VERSION,

Don't we need parameters passed to this (Requested Interface version for 
e.g.) ? See more below.


> 		.requires_smc   = true,
> 		.device_name    = RSI_DEV_NAME,
> 	},
> };
> 
> static bool __init smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
> {
> 	unsigned long ret;
> 	struct arm_smccc_res res;
> 
> 	if (smccc_conduit == SMCCC_CONDUIT_NONE)
> 		return false;
> 
> 	if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
> 		return false;
> 
> 	arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
> 	ret = res.a0;
> 
> 	if ((s32)ret == SMCCC_RET_NOT_SUPPORTED)

Is this a reliable check for all possible SMCCC services ? i.e., Are we 
expected to get RET_NOT_SUPPORTED for any service for which the backend
is not available ?

Also, as pointed out RSI_ABI_VERSION may return other errors based on 
the input (requested version, e.g., RSI_ERROR_INPUT) and we may still go
ahead and register the device ?

> 		return false;
> 
> 	return true;
> }
> 
> static int __init smccc_devices_init(void)
> {
> 	struct arm_smccc_device *sdev;
> 	const struct smccc_device_info *smccc_dev;
> 
> 	for (int i = 0; i < ARRAY_SIZE(smccc_devices); i++) {
> 		smccc_dev = &smccc_devices[i];
> 
> 		if (!smccc_probe_smccc_device(smccc_dev))
> 			continue;
> 
>                 sdev = arm_smccc_device_register(smccc_dev->device_name);
>                 if (IS_ERR(sdev))
>                         pr_err("%s: could not register device: %ld\n",
>                                smccc_dev->device_name, PTR_ERR(sdev));
> 
> 	}
> 
> 	return 0;
> }
> device_initcall(smccc_devices_init);
> 
> with the diff to hw_random/smccc_trng
> 
> modified   arch/arm64/include/asm/archrandom.h
> @@ -12,7 +12,7 @@
>   
>   extern bool smccc_trng_available;
>   
> -static inline bool __init smccc_probe_trng(void)
> +static inline bool smccc_probe_trng(void)
>   {
>   	struct arm_smccc_res res;
>   
> modified   drivers/char/hw_random/arm_smccc_trng.c
> @@ -19,6 +19,8 @@
>   #include <linux/arm-smccc.h>
>   #include <linux/arm-smccc-bus.h>
>   
> +#include <asm/archrandom.h>
> +
>   #ifdef CONFIG_ARM64
>   #define ARM_SMCCC_TRNG_RND	ARM_SMCCC_TRNG_RND64
>   #define MAX_BITS_PER_CALL	(3 * 64UL)
> @@ -98,6 +100,10 @@ static int smccc_trng_probe(struct arm_smccc_device *sdev)
>   {
>   	struct hwrng *trng;
>   
> +	/* validate the minimum version requirement */
> +	if (!smccc_probe_trng())
> +		return -ENODEV;
> +
>   	trng = devm_kzalloc(&sdev->dev, sizeof(*trng), GFP_KERNEL);
>   	if (!trng)
>   		return -ENOMEM;
> 
> We can also move arch/arm64/include/asm/rsi_smc.h to
> include/linux/arm-rsi-smccc.h. There was a suggestion to move these

super minor nit: arm-smccc-rsi.h ?

Cheers
Suzuki


> firmware interfaces out of architecture-specific code:
> 
> https://lore.kernel.org/all/agsNO9cc7H-b0H8L@willie-the-truck
> 
> This will also avoid the #ifdef CONFIG_ARM64
> 
> -aneesh


^ permalink raw reply

* Re: [PATCH v7 10/42] KVM: guest_memfd: Ensure pages are not in use before conversion
From: Vlastimil Babka (SUSE) @ 2026-06-08  8:55 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-10-2f0fae496530@google.com>

On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> When converting memory to private in guest_memfd, it is necessary to ensure
> that the pages are not currently being accessed by any other part of the
> kernel or userspace to avoid any current user writing to guest private
> memory.
> 
> guest_memfd checks for unexpected refcounts to determine whether a page is
> still in use. The only expected refcounts after unmapping the range
> requested for conversion are those that are held by guest_memfd itself.

Is it sufficient to only check, and not also freeze the refcount? (i.e.
using folio_ref_freeze()), because without freezing, anything (e.g.
compaction's pfn-based scanner) could do a speculative folio_try_get() and
the checked refcount becomes stale.

Might be ok if we know that no such speculative increment can result in
actually touching the page contents, and the extra refcount and something
inspecting the struct folio won't interfere with anything else. Then it
could be just a comment mentioning why it's safe.

IIRC the compaction's scanning can result in a migration here so it's
probably ok?

> Update the kvm_memory_attributes2 structure to include an error_offset
> field. This allows KVM to report the exact offset where a conversion
> failed to userspace. If the safety check fails, return -EAGAIN and copy
> the error_offset back to userspace so that it can potentially retry the
> operation or handle the failure gracefully.
> 
> Suggested-by: David Hildenbrand <david@kernel.org>
> Co-developed-by: Vishal Annapurve <vannapurve@google.com>
> Signed-off-by: Vishal Annapurve <vannapurve@google.com>
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  include/uapi/linux/kvm.h |  3 ++-
>  virt/kvm/guest_memfd.c   | 68 ++++++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 65 insertions(+), 6 deletions(-)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index e6bbf68a83813..0b55258573d3d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1658,7 +1658,8 @@ struct kvm_memory_attributes2 {
>  	__u64 size;
>  	__u64 attributes;
>  	__u64 flags;
> -	__u64 reserved[12];
> +	__u64 error_offset;
> +	__u64 reserved[11];
>  };
>  
>  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 426917d22a2b6..2767992955752 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -572,9 +572,45 @@ static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
>  	return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
>  }
>  
> +static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
> +					    size_t nr_pages, pgoff_t *err_index)
> +{
> +	struct address_space *mapping = inode->i_mapping;
> +	const int filemap_get_folios_refcount = 1;
> +	pgoff_t last = start + nr_pages - 1;
> +	struct folio_batch fbatch;
> +	bool safe = true;
> +	pgoff_t next;
> +	int i;
> +
> +	folio_batch_init(&fbatch);
> +
> +	next = start;
> +	while (safe && filemap_get_folios(mapping, &next, last, &fbatch)) {
> +
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *folio = fbatch.folios[i];
> +
> +			if (folio_ref_count(folio) !=
> +			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
> +				safe = false;
> +				*err_index = max(start, folio->index);
> +				break;
> +			}
> +		}
> +
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +
> +	return safe;
> +}
> +
>  static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
> -				     size_t nr_pages, uint64_t attrs)
> +				     size_t nr_pages, uint64_t attrs,
> +				     pgoff_t *err_index)
>  {
> +	bool to_private = attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE;
>  	struct address_space *mapping = inode->i_mapping;
>  	struct gmem_inode *gi = GMEM_I(inode);
>  	pgoff_t end = start + nr_pages;
> @@ -588,8 +624,21 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>  
>  	mas_init(&mas, mt, start);
>  	r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
> -	if (r)
> +	if (r) {
> +		*err_index = start;
>  		goto out;
> +	}
> +
> +	if (to_private) {
> +		unmap_mapping_pages(mapping, start, nr_pages, false);
> +
> +		if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
> +						     err_index)) {
> +			mas_destroy(&mas);
> +			r = -EAGAIN;
> +			goto out;
> +		}
> +	}
>  
>  	/*
>  	 * From this point on guest_memfd has performed necessary
> @@ -609,9 +658,10 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>  	struct gmem_file *f = file->private_data;
>  	struct inode *inode = file_inode(file);
>  	struct kvm_memory_attributes2 attrs;
> +	pgoff_t err_index;
>  	size_t nr_pages;
>  	pgoff_t index;
> -	int i;
> +	int i, r;
>  
>  	if (copy_from_user(&attrs, argp, sizeof(attrs)))
>  		return -EFAULT;
> @@ -635,8 +685,16 @@ static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
>  
>  	nr_pages = attrs.size >> PAGE_SHIFT;
>  	index = attrs.offset >> PAGE_SHIFT;
> -	return __kvm_gmem_set_attributes(inode, index, nr_pages,
> -					 attrs.attributes);
> +	r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
> +				      &err_index);
> +	if (r) {
> +		attrs.error_offset = ((uint64_t)err_index) << PAGE_SHIFT;
> +
> +		if (copy_to_user(argp, &attrs, sizeof(attrs)))
> +			return -EFAULT;
> +	}
> +
> +	return r;
>  }
>  
>  static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
> 


^ permalink raw reply

* Re: [PATCH v14 24/44] KVM: arm64: Handle realm MMIO emulation
From: Steven Price @ 2026-06-08  8:49 UTC (permalink / raw)
  To: Gavin Shan, kvm, kvmarm
  Cc: Catalin Marinas, Marc Zyngier, Will Deacon, James Morse,
	Oliver Upton, Suzuki K Poulose, Zenghui Yu, linux-arm-kernel,
	linux-kernel, Joey Gouly, Alexandru Elisei, Christoffer Dall,
	Fuad Tabba, linux-coco, Ganapatrao Kulkarni, Shanker Donthineni,
	Alper Gun, Aneesh Kumar K . V, Emi Kisanuki, Vishal Annapurve,
	WeiLin.Chang, Lorenzo.Pieralisi2
In-Reply-To: <8b648b59-c411-4126-be18-686d2927f24a@redhat.com>

On 28/05/2026 06:03, Gavin Shan wrote:
> Hi Steve,
> 
> On 5/13/26 11:17 PM, Steven Price wrote:
>> MMIO emulation for a realm cannot be done directly with the VM's
>> registers as they are protected from the host. However, for emulatable
>> data aborts, the RMM uses GPRS[0] to provide the read/written value.
>> We can transfer this from/to the equivalent VCPU's register entry and
>> then depend on the generic MMIO handling code in KVM.
>>
>> For a MMIO read, the value is placed in the shared RecExit structure
>> during kvm_handle_mmio_return() rather than in the VCPU's register
>> entry.
>>
>> Signed-off-by: Steven Price <steven.price@arm.com>
>> Reviewed-by: Gavin Shan <gshan@redhat.com>
>> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since v7:
>>   * New comment for rec_exit_sync_dabt() explaining the call to
>>     vcpu_set_reg().
>> Changes since v5:
>>   * Inject SEA to the guest is an emulatable MMIO access triggers a data
>>     abort.
>>   * kvm_handle_mmio_return() - disable kvm_incr_pc() for a REC (as the PC
>>     isn't under the host's control) and move the REC_ENTER_EMULATED_MMIO
>>     flag setting to this location (as that tells the RMM to skip the
>>     instruction).
>> ---
>>   arch/arm64/kvm/inject_fault.c |  4 +++-
>>   arch/arm64/kvm/mmio.c         | 16 ++++++++++++----
>>   arch/arm64/kvm/rmi-exit.c     | 14 ++++++++++++++
>>   3 files changed, 29 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/
>> inject_fault.c
>> index 89982bd3345f..6492397b73d7 100644
>> --- a/arch/arm64/kvm/inject_fault.c
>> +++ b/arch/arm64/kvm/inject_fault.c
>> @@ -228,7 +228,9 @@ static void inject_abt32(struct kvm_vcpu *vcpu,
>> bool is_pabt, u32 addr)
>>     static void __kvm_inject_sea(struct kvm_vcpu *vcpu, bool iabt, u64
>> addr)
>>   {
>> -    if (vcpu_el1_is_32bit(vcpu))
>> +    if (unlikely(vcpu_is_rec(vcpu)))
>> +        vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_INJECT_SEA;
>> +    else if (vcpu_el1_is_32bit(vcpu))
>>           inject_abt32(vcpu, iabt, addr);
>>       else
>>           inject_abt64(vcpu, iabt, addr);
>> diff --git a/arch/arm64/kvm/mmio.c b/arch/arm64/kvm/mmio.c
>> index e2285ed8c91d..6a8cb927fcca 100644
>> --- a/arch/arm64/kvm/mmio.c
>> +++ b/arch/arm64/kvm/mmio.c
>> @@ -6,6 +6,7 @@
>>     #include <linux/kvm_host.h>
>>   #include <asm/kvm_emulate.h>
>> +#include <asm/rmi_smc.h>
>>   #include <trace/events/kvm.h>
>>     #include "trace.h"
>> @@ -138,14 +139,21 @@ int kvm_handle_mmio_return(struct kvm_vcpu *vcpu)
>>           trace_kvm_mmio(KVM_TRACE_MMIO_READ, len, run->mmio.phys_addr,
>>                      &data);
>>           data = vcpu_data_host_to_guest(vcpu, data, len);
>> -        vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>> +
>> +        if (vcpu_is_rec(vcpu))
>> +            vcpu->arch.rec.run->enter.gprs[0] = data;
>> +        else
>> +            vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu), data);
>>       }
>>         /*
>>        * The MMIO instruction is emulated and should not be re-executed
>>        * in the guest.
>>        */
>> -    kvm_incr_pc(vcpu);
>> +    if (vcpu_is_rec(vcpu))
>> +        vcpu->arch.rec.run->enter.flags |= REC_ENTER_FLAG_EMULATED_MMIO;
>> +    else
>> +        kvm_incr_pc(vcpu);
>>         return 1;
>>   }
>> @@ -167,14 +175,14 @@ int io_mem_abort(struct kvm_vcpu *vcpu,
>> phys_addr_t fault_ipa)
>>        * No valid syndrome? Ask userspace for help if it has
>>        * volunteered to do so, and bail out otherwise.
>>        *
>> -     * In the protected VM case, there isn't much userspace can do
>> +     * In the protected/realm VM case, there isn't much userspace can do
>>        * though, so directly deliver an exception to the guest.
>>        */
>>       if (!kvm_vcpu_dabt_isvalid(vcpu)) {
>>           trace_kvm_mmio_nisv(*vcpu_pc(vcpu), esr,
>>                       kvm_vcpu_get_hfar(vcpu), fault_ipa);
>>   -        if (vcpu_is_protected(vcpu))
>> +        if (vcpu_is_protected(vcpu) || vcpu_is_rec(vcpu))
>>               return kvm_inject_sea_dabt(vcpu, kvm_vcpu_get_hfar(vcpu));
>>             if (test_bit(KVM_ARCH_FLAG_RETURN_NISV_IO_ABORT_TO_USER,
>> diff --git a/arch/arm64/kvm/rmi-exit.c b/arch/arm64/kvm/rmi-exit.c
>> index e7c51b6cf6ce..8ec0d179eba2 100644
>> --- a/arch/arm64/kvm/rmi-exit.c
>> +++ b/arch/arm64/kvm/rmi-exit.c
>> @@ -25,6 +25,20 @@ static int rec_exit_reason_notimpl(struct kvm_vcpu
>> *vcpu)
>>     static int rec_exit_sync_dabt(struct kvm_vcpu *vcpu)
>>   {
>> +    struct realm_rec *rec = &vcpu->arch.rec;
>> +
>> +    /*
>> +     * In the case of a write, copy over gprs[0] to the target GPR,
>> +     * preparing to handle MMIO write fault. The content to be
>> written has
>> +     * been saved to gprs[0] by the RMM (even if another register was
>> used
>> +     * by the guest). In the case of normal memory access this is
>> redundant
>> +     * (the guest will replay the instruction), but the overhead is
>> +     * minimal.
>> +     */
>> +    if (kvm_vcpu_dabt_iswrite(vcpu) && kvm_vcpu_dabt_isvalid(vcpu))
>> +        vcpu_set_reg(vcpu, kvm_vcpu_dabt_get_rd(vcpu),
>> +                 rec->run->exit.gprs[0]);
>> +
> 
> { } is needed here.

Indeed - I'm surprised checkpatch didn't manage to flag that. I'll fix.

Thanks,
Steve

>>       return kvm_handle_guest_abort(vcpu);
>>   }
>>   
> 
> Thanks,
> Gavin
> 


^ permalink raw reply

* Re: [PATCH v7 14/42] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
From: Vlastimil Babka (SUSE) @ 2026-06-08  8:45 UTC (permalink / raw)
  To: ackerleytng, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
	david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, tabba, willy, wyihan, yan.y.zhao, forkloop,
	pratyush, suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260522-gmem-inplace-conversion-v7-14-2f0fae496530@google.com>

On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> When checking if a guest_memfd folio is safe for conversion, its refcount
> is examined. A folio may be present in a per-CPU lru_add fbatch, which
> temporarily increases its refcount. This can lead to a false positive,
> incorrectly indicating that the folio is in use and preventing the
> conversion, even if it is otherwise safe. The conversion process might not
> be on the same CPU that holds the folio in its fbatch, making a simple
> per-CPU check insufficient.
> 
> To address this, drain all CPUs' lru_add fbatches if an unexpectedly high
> refcount is encountered during the safety check. This is performed at most
> once per conversion request. Draining only if the folio in question may be
> lru cached.
> 
> guest_memfd folios are unevictable, so they can only reside in the lru_add
> fbatch. If the folio's refcount is still unsafe after draining, then the
> conversion is truly deemed unsafe.
> 
> Reviewed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>


^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Sudeep Holla @ 2026-06-08  8:39 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-coco, linux-arm-kernel, linux-kernel, Sudeep Holla,
	Catalin Marinas, Greg KH, Jeremy Linton, Jonathan Cameron,
	Lorenzo Pieralisi, Mark Rutland, Will Deacon, Steven Price,
	Suzuki K Poulose
In-Reply-To: <yq5ao6hlzbpa.fsf@kernel.org>

On Mon, Jun 08, 2026 at 01:49:13PM +0530, Aneesh Kumar K.V wrote:
> Sudeep Holla <sudeep.holla@kernel.org> writes:
> 
> > On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
> >> Sudeep Holla <sudeep.holla@kernel.org> writes:
> >> 
> >> ...
> >> 
> >> > +static const struct smccc_device_info smccc_devices[] __initconst = {
> >> > +       {
> >> > +               .func_id        = ARM_SMCCC_TRNG_VERSION,
> >> > +               .requires_smc   = false,
> >> > +               .min_return     = ARM_SMCCC_TRNG_MIN_VERSION,
> >> > +               .device_name    = "arm-smccc-trng",
> >> > +       },
> >> > +};
> >> > +
> >> > +static bool __init
> >> > +smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
> >> > +{
> >> > +       struct arm_smccc_res res;
> >> > +       unsigned long ret;
> >> > +
> >> > +       if (!IS_ENABLED(CONFIG_ARM64))
> >> > +               return false;
> >> > +
> >> > +       if (smccc_conduit == SMCCC_CONDUIT_NONE)
> >> > +               return false;
> >> > +
> >> > +       if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
> >> > +               return false;
> >> > +
> >> > +       arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
> >> > +       ret = res.a0;
> >> > +
> >> > +       if ((s32)ret < 0)
> >> > +               return false;
> >> > +
> >> > +       return ret >= smccc_dev->min_return;
> >> > +}
> >> > +
> >> >
> >> 
> >> I am not sure we want the check to be as simple as ret < 0. Some
> >> function IDs may return input errors based on the supplied arguments
> >> (for example, RMI_ERROR_INPUT). In those cases, we would likely want
> >> this to be handled via a callback.
> >> 
> >
> > As I mentioned in response to Suzuki, we can defer that to probe of
> > that device. If *_VERSION, succeeds SMCCC core can add that device and
> > leave the rest to the core keeping the core and bus layer simple IMO.
> >
> >> We also want to use conditional compilation for some function IDs.
> >> Given the callback approach and the #ifdefs, I wonder whether what we
> >> currently have is actually simpler and more flexible.”
> >> 
> >
> > I was trying to avoid conditional compilation altogether and hence the
> > reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
> > in above snippet must come as some condition to this generic probe.
> >
> > Adding any more logic or callback defeats the bus idea here if we need
> > to rely/depend on multiple conditional compilation or callbacks IMO.
> >
> > Let's find see if it can work with what we are adding now and may add in
> > near future and then decide.
> >
> 
> If we move all the conditional checks to the driver probe path, then I
> think this can work. Something like the below:
> 

Sounds good to me.

[...]

> We can also move arch/arm64/include/asm/rsi_smc.h to
> include/linux/arm-rsi-smccc.h. There was a suggestion to move these
> firmware interfaces out of architecture-specific code:
> 
> https://lore.kernel.org/all/agsNO9cc7H-b0H8L@willie-the-truck
>

Ah OK, sorry I had missed this.

-- 
Regards,
Sudeep

^ permalink raw reply

* Re: [PATCH v13 15/22] KVM: selftests: Call KVM_TDX_INIT_VCPU when creating a new TDX vcpu
From: Binbin Wu @ 2026-06-08  8:34 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-15-6983ae4c3a4d@google.com>



On 5/22/2026 7:16 AM, Lisa Wang wrote:
[...]> diff --git a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> index 9660ea9d2f31..4d01f806b37d 100644
> --- a/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> +++ b/tools/testing/selftests/kvm/include/x86/tdx/tdx_util.h
> @@ -39,6 +39,30 @@ static inline bool is_tdx_vm(struct kvm_vm *vm)
>  	__TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd,	ret, vm);		\
>  })
>  
> +#define __tdx_vcpu_ioctl(vcpu, cmd, _flags, arg)			\
> +({									\
> +	int r;								\
> +									\
> +	union {								\
> +		struct kvm_tdx_cmd c;					\
> +		unsigned long raw;					\
> +	} tdx_cmd = { .c = {						\
> +		.id = (cmd),						\
> +		.flags = (u32)(_flags),				\
> +		.data = (u64)(arg),				\

Nit:
The two lines' backslashes are misaligned.

> +	} };								\
> +									\
> +	r = __vcpu_ioctl(vcpu, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd.raw);	\
> +	r ?: tdx_cmd.c.hw_error;					\

Similar issue of the truncation of upper bits.
Though TDX KVM code never sets hw_error currently for vcpu version. 

> +})
> +
> +#define tdx_vcpu_ioctl(vcpu, cmd, flags, arg)				\
> +({									\
> +	int ret = __tdx_vcpu_ioctl(vcpu, cmd, flags, arg);		\
> +									\
> +	__TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd,	ret, (vcpu)->vm);	\
> +})
> +
>  void tdx_init_vm(struct kvm_vm *vm, u64 attributes);
>  void tdx_vm_setup_boot_code_region(struct kvm_vm *vm);
>  void tdx_vm_setup_boot_parameters_region(struct kvm_vm *vm, u32 nr_runnable_vcpus);

^ permalink raw reply

* [PATCH 3/4] x86/msr: Switch wrmsrl() users to wrmsrq()
From: Juergen Gross @ 2026-06-08  8:28 UTC (permalink / raw)
  To: linux-kernel, x86, linux-perf-users, kvm, linux-coco,
	linux-hyperv, linux-pm
  Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Babu Moger, Sean Christopherson, Paolo Bonzini,
	Kiryl Shutsemau, Rick Edgecombe, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
	Artem Bityutskiy, Len Brown
In-Reply-To: <20260608082809.3492719-1-jgross@suse.com>

wrmsrl() is a deprecated synonym for wrmsrq(). Switch its users to
wrmsrq().

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/events/amd/uncore.c          | 2 +-
 arch/x86/events/intel/core.c          | 4 ++--
 arch/x86/kernel/cpu/resctrl/monitor.c | 2 +-
 arch/x86/kernel/process_64.c          | 2 +-
 arch/x86/kvm/pmu.c                    | 6 +++---
 arch/x86/kvm/vmx/tdx.c                | 6 +++---
 drivers/hv/mshv_vtl_main.c            | 2 +-
 drivers/idle/intel_idle.c             | 2 +-
 8 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 98ef4bf9911a..7dc6af4231cc 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -975,7 +975,7 @@ static void amd_uncore_umc_read(struct perf_event *event)
 	 * that the counter never gets a chance to saturate.
 	 */
 	if (new & BIT_ULL(63 - COUNTER_SHIFT)) {
-		wrmsrl(hwc->event_base, 0);
+		wrmsrq(hwc->event_base, 0);
 		local64_set(&hwc->prev_count, 0);
 	} else {
 		local64_set(&hwc->prev_count, new);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index dd1e3aa75ee9..e9baa64dc962 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3166,12 +3166,12 @@ static void intel_pmu_config_acr(int idx, u64 mask, u32 reload)
 	}
 
 	if (cpuc->acr_cfg_b[idx] != mask) {
-		wrmsrl(msr_b + msr_offset, mask);
+		wrmsrq(msr_b + msr_offset, mask);
 		cpuc->acr_cfg_b[idx] = mask;
 	}
 	/* Only need to update the reload value when there is a valid config value. */
 	if (mask && cpuc->acr_cfg_c[idx] != reload) {
-		wrmsrl(msr_c + msr_offset, reload);
+		wrmsrq(msr_c + msr_offset, reload);
 		cpuc->acr_cfg_c[idx] = reload;
 	}
 }
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index c5ed0bc1f831..e4918c32a822 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -532,7 +532,7 @@ static void resctrl_abmc_config_one_amd(void *info)
 {
 	union l3_qos_abmc_cfg *abmc_cfg = info;
 
-	wrmsrl(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
+	wrmsrq(MSR_IA32_L3_QOS_ABMC_CFG, abmc_cfg->full);
 }
 
 /*
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b85e715ebb30..d44afbe005bb 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -708,7 +708,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	/* Reset hw history on AMD CPUs */
 	if (cpu_feature_enabled(X86_FEATURE_AMD_WORKLOAD_CLASS))
-		wrmsrl(MSR_AMD_WORKLOAD_HRST, 0x1);
+		wrmsrq(MSR_AMD_WORKLOAD_HRST, 0x1);
 
 	return prev_p;
 }
diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index e218352e3423..aee70e5dc15d 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -1313,14 +1313,14 @@ static void kvm_pmu_load_guest_pmcs(struct kvm_vcpu *vcpu)
 		pmc = &pmu->gp_counters[i];
 
 		if (pmc->counter != rdpmc(i))
-			wrmsrl(gp_counter_msr(i), pmc->counter);
-		wrmsrl(gp_eventsel_msr(i), pmc->eventsel_hw);
+			wrmsrq(gp_counter_msr(i), pmc->counter);
+		wrmsrq(gp_eventsel_msr(i), pmc->eventsel_hw);
 	}
 	for (i = 0; i < pmu->nr_arch_fixed_counters; i++) {
 		pmc = &pmu->fixed_counters[i];
 
 		if (pmc->counter != rdpmc(INTEL_PMC_FIXED_RDPMC_BASE | i))
-			wrmsrl(fixed_counter_msr(i), pmc->counter);
+			wrmsrq(fixed_counter_msr(i), pmc->counter);
 	}
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 04ce321ebdf3..cb50e23c39ca 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -823,7 +823,7 @@ static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
 		return;
 
 	++vcpu->stat.host_state_reload;
-	wrmsrl(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
+	wrmsrq(MSR_KERNEL_GS_BASE, vt->msr_host_kernel_gs_base);
 
 	vt->guest_state_loaded = false;
 }
@@ -1048,10 +1048,10 @@ static void tdx_load_host_xsave_state(struct kvm_vcpu *vcpu)
 
 	/*
 	 * Likewise, even if a TDX hosts didn't support XSS both arms of
-	 * the comparison would be 0 and the wrmsrl would be skipped.
+	 * the comparison would be 0 and the wrmsrq would be skipped.
 	 */
 	if (kvm_host.xss != (kvm_tdx->xfam & kvm_caps.supported_xss))
-		wrmsrl(MSR_IA32_XSS, kvm_host.xss);
+		wrmsrq(MSR_IA32_XSS, kvm_host.xss);
 }
 
 #define TDX_DEBUGCTL_PRESERVED (DEBUGCTLMSR_BTF | \
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index f5d27f28d6ad..0d3d4161974f 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -596,7 +596,7 @@ static int mshv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set)
 		} else {
 			/* Handle MSRs */
 			if (set)
-				wrmsrl(reg_table[i].msr_addr, *reg64);
+				wrmsrq(reg_table[i].msr_addr, *reg64);
 			else
 				rdmsrq(reg_table[i].msr_addr, *reg64);
 		}
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 15c698291b32..67d5993c7387 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -2379,7 +2379,7 @@ static void intel_c1_demotion_toggle(void *enable)
 		msr_val |= NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE;
 	else
 		msr_val &= ~(NHM_C1_AUTO_DEMOTE | SNB_C1_AUTO_UNDEMOTE);
-	wrmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
+	wrmsrq(MSR_PKG_CST_CONFIG_CONTROL, msr_val);
 }
 
 static ssize_t intel_c1_demotion_store(struct device *dev,
-- 
2.54.0


^ permalink raw reply related

* [PATCH 0/4] x86/msr: Get rid of rdmsrl() and wrmsrl()
From: Juergen Gross @ 2026-06-08  8:28 UTC (permalink / raw)
  To: linux-kernel, x86, linux-perf-users, linux-hyperv, linux-pm, kvm,
	linux-coco
  Cc: Juergen Gross, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Namhyung Kim, Mark Rutland,
	Alexander Shishkin, Jiri Olsa, Ian Rogers, Adrian Hunter,
	James Clark, Thomas Gleixner, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Babu Moger, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Rafael J. Wysocki, Artem Bityutskiy,
	Artem Bityutskiy, Len Brown, Sean Christopherson, Paolo Bonzini,
	Kiryl Shutsemau, Rick Edgecombe

rdsmrl() and wrmsrl() are deprecated aliases of rdmsrq() and wrmsrq().

Switch all users and remove the deprecated variants.

Juergen Gross (4):
  x86/msr: Switch rdmsrl() users to rdmsrq()
  x86/msr: Remove rdmsrl()
  x86/msr: Switch wrmsrl() users to wrmsrq()
  x86/msr: Remove wrmsrl()

 arch/x86/events/amd/uncore.c          | 4 ++--
 arch/x86/events/intel/core.c          | 4 ++--
 arch/x86/include/asm/msr.h            | 5 -----
 arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++--
 arch/x86/kernel/process_64.c          | 2 +-
 arch/x86/kvm/pmu.c                    | 6 +++---
 arch/x86/kvm/vmx/tdx.c                | 6 +++---
 drivers/hv/mshv_vtl_main.c            | 4 ++--
 drivers/idle/intel_idle.c             | 6 +++---
 9 files changed, 18 insertions(+), 23 deletions(-)

-- 
2.54.0


^ permalink raw reply

* Re: [PATCH v6 3/4] firmware: smccc: arm-cca-guest: Bind the TSM provider to an SMCCC device
From: Aneesh Kumar K.V @ 2026-06-08  8:19 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: linux-coco, linux-arm-kernel, linux-kernel, Catalin Marinas,
	Sudeep Holla, Greg KH, Jeremy Linton, Jonathan Cameron,
	Lorenzo Pieralisi, Mark Rutland, Will Deacon, Steven Price,
	Suzuki K Poulose
In-Reply-To: <20260604-juicy-daft-starling-3eec1f@sudeepholla>

Sudeep Holla <sudeep.holla@kernel.org> writes:

> On Thu, Jun 04, 2026 at 06:56:28PM +0530, Aneesh Kumar K.V wrote:
>> Sudeep Holla <sudeep.holla@kernel.org> writes:
>> 
>> ...
>> 
>> > +static const struct smccc_device_info smccc_devices[] __initconst = {
>> > +       {
>> > +               .func_id        = ARM_SMCCC_TRNG_VERSION,
>> > +               .requires_smc   = false,
>> > +               .min_return     = ARM_SMCCC_TRNG_MIN_VERSION,
>> > +               .device_name    = "arm-smccc-trng",
>> > +       },
>> > +};
>> > +
>> > +static bool __init
>> > +smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
>> > +{
>> > +       struct arm_smccc_res res;
>> > +       unsigned long ret;
>> > +
>> > +       if (!IS_ENABLED(CONFIG_ARM64))
>> > +               return false;
>> > +
>> > +       if (smccc_conduit == SMCCC_CONDUIT_NONE)
>> > +               return false;
>> > +
>> > +       if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
>> > +               return false;
>> > +
>> > +       arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
>> > +       ret = res.a0;
>> > +
>> > +       if ((s32)ret < 0)
>> > +               return false;
>> > +
>> > +       return ret >= smccc_dev->min_return;
>> > +}
>> > +
>> >
>> 
>> I am not sure we want the check to be as simple as ret < 0. Some
>> function IDs may return input errors based on the supplied arguments
>> (for example, RMI_ERROR_INPUT). In those cases, we would likely want
>> this to be handled via a callback.
>> 
>
> As I mentioned in response to Suzuki, we can defer that to probe of
> that device. If *_VERSION, succeeds SMCCC core can add that device and
> leave the rest to the core keeping the core and bus layer simple IMO.
>
>> We also want to use conditional compilation for some function IDs.
>> Given the callback approach and the #ifdefs, I wonder whether what we
>> currently have is actually simpler and more flexible.”
>> 
>
> I was trying to avoid conditional compilation altogether and hence the
> reason for keeping it as simple as possible. Also IS_ENABLED(CONFIG_ARM64)
> in above snippet must come as some condition to this generic probe.
>
> Adding any more logic or callback defeats the bus idea here if we need
> to rely/depend on multiple conditional compilation or callbacks IMO.
>
> Let's find see if it can work with what we are adding now and may add in
> near future and then decide.
>

If we move all the conditional checks to the driver probe path, then I
think this can work. Something like the below:

struct smccc_device_info {
	u32 func_id;
	bool requires_smc;
	const char *device_name;
};

static const struct smccc_device_info smccc_devices[] __initconst = {
	{
		.func_id        = ARM_SMCCC_TRNG_VERSION,
		.requires_smc   = false,
		.device_name    = "arm-smccc-trng",
	},

	{
		.func_id        = RSI_ABI_VERSION,
		.requires_smc   = true,
		.device_name    = RSI_DEV_NAME,
	},
};

static bool __init smccc_probe_smccc_device(const struct smccc_device_info *smccc_dev)
{
	unsigned long ret;
	struct arm_smccc_res res;

	if (smccc_conduit == SMCCC_CONDUIT_NONE)
		return false;

	if (smccc_dev->requires_smc && smccc_conduit != SMCCC_CONDUIT_SMC)
		return false;

	arm_smccc_1_1_invoke(smccc_dev->func_id, &res);
	ret = res.a0;

	if ((s32)ret == SMCCC_RET_NOT_SUPPORTED)
		return false;

	return true;
}

static int __init smccc_devices_init(void)
{
	struct arm_smccc_device *sdev;
	const struct smccc_device_info *smccc_dev;

	for (int i = 0; i < ARRAY_SIZE(smccc_devices); i++) {
		smccc_dev = &smccc_devices[i];

		if (!smccc_probe_smccc_device(smccc_dev))
			continue;

               sdev = arm_smccc_device_register(smccc_dev->device_name);
               if (IS_ERR(sdev))
                       pr_err("%s: could not register device: %ld\n",
                              smccc_dev->device_name, PTR_ERR(sdev));

	}

	return 0;
}
device_initcall(smccc_devices_init);

with the diff to hw_random/smccc_trng

modified   arch/arm64/include/asm/archrandom.h
@@ -12,7 +12,7 @@
 
 extern bool smccc_trng_available;
 
-static inline bool __init smccc_probe_trng(void)
+static inline bool smccc_probe_trng(void)
 {
 	struct arm_smccc_res res;
 
modified   drivers/char/hw_random/arm_smccc_trng.c
@@ -19,6 +19,8 @@
 #include <linux/arm-smccc.h>
 #include <linux/arm-smccc-bus.h>
 
+#include <asm/archrandom.h>
+
 #ifdef CONFIG_ARM64
 #define ARM_SMCCC_TRNG_RND	ARM_SMCCC_TRNG_RND64
 #define MAX_BITS_PER_CALL	(3 * 64UL)
@@ -98,6 +100,10 @@ static int smccc_trng_probe(struct arm_smccc_device *sdev)
 {
 	struct hwrng *trng;
 
+	/* validate the minimum version requirement */
+	if (!smccc_probe_trng())
+		return -ENODEV;
+
 	trng = devm_kzalloc(&sdev->dev, sizeof(*trng), GFP_KERNEL);
 	if (!trng)
 		return -ENOMEM;

We can also move arch/arm64/include/asm/rsi_smc.h to
include/linux/arm-rsi-smccc.h. There was a suggestion to move these
firmware interfaces out of architecture-specific code:

https://lore.kernel.org/all/agsNO9cc7H-b0H8L@willie-the-truck

This will also avoid the #ifdef CONFIG_ARM64

-aneesh

^ permalink raw reply

* Re: [PATCH v13 13/22] KVM: selftests: Set first memory region as shared if guest_memfd
From: Binbin Wu @ 2026-06-08  8:03 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-13-6983ae4c3a4d@google.com>

On 5/22/2026 7:16 AM, Lisa Wang wrote:
> Set the initial state of the first memory region as shared if it is
> backed by guest_memfd, so that the KVM selftest framework functions can
> populate mmap()-ed guest_memfd memory the same way memory from other
> memory providers are populated.
> 
> For CoCo VMs, pages that need to be private are explicitly set to
> private before executing the VM.
> 
> Signed-off-by: Lisa Wang <wyihan@google.com>
> ---
>  tools/testing/selftests/kvm/lib/kvm_util.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 9a29540fff40..1bab7d76a59c 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -484,8 +484,10 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>  	u64 nr_pages = vm_nr_pages_required(shape.mode, nr_runnable_vcpus,
>  						 nr_extra_pages);
>  	struct userspace_mem_region *slot0;
> +	u64 gmem_flags = 0;
>  	struct kvm_vm *vm;
> -	int i, flags;
> +	int flags = 0;
> +	int i;
>  
>  	kvm_set_files_rlimit(nr_runnable_vcpus);
>  
> @@ -495,14 +497,16 @@ struct kvm_vm *__vm_create(struct vm_shape shape, u32 nr_runnable_vcpus,
>  	vm = ____vm_create(shape);
>  
>  	/*
> -	 * Force GUEST_MEMFD for the primary memory region if necessary, e.g.
> -	 * for CoCo VMs that require GUEST_MEMFD backed private memory.
> +	 * Force GUEST_MEMFD for the primary memory region if necessary, and
> +	 * initialize it as shared so the selftest framework can populate it
> +	 * exactly like other memory providers.
>  	 */
> -	flags = 0;
> -	if (is_guest_memfd_required(shape))
> +	if (is_guest_memfd_required(shape)) {
>  		flags |= KVM_MEM_GUEST_MEMFD;
> +		gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED;
> +	}
>  
> -	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
> +	vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);

The build failed due to this:

lib/kvm_util.c: In function ‘__vm_create’:
lib/kvm_util.c:507:9: error: too many arguments to function ‘vm_mem_add’
  507 |         vm_mem_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags, -1, 0, gmem_flags);
      |         ^~~~~~~~~~
In file included from lib/kvm_util.c:9:
include/kvm_util.h:714:6: note: declared here
  714 | void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
      |      ^~~~~~~~~~
lib/kvm_util.c: At top level:
cc1: note: unrecognized command-line option ‘-Wno-gnu-variable-sized-type-not-at-end’ may have been intended to silence earlier diagnostics

It seems the patch set doesn't wire gmem_flags parameter to vm_mem_add().

>  	for (i = 0; i < NR_MEM_REGIONS; i++)
>  		vm->memslots[i] = 0;
>  
> 


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox