xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Limitations for Running Xen on KVM Arm64
@ 2025-10-30  6:12 haseeb.ashraf
  2025-10-30 13:41 ` haseeb.ashraf
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: haseeb.ashraf @ 2025-10-30  6:12 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org

[-- Attachment #1: Type: text/plain, Size: 3776 bytes --]

Hello Xen development community,

I wanted to discuss the limitations that I have faced while running Xen on KVM on Arm64 machines. I hope I am using the right mailing list.

The biggest limitation is the costly emulation of instruction tlbi vmalls12e1is in KVM. The cost is exponentially proportional to the IPA size exposed by KVM for VM hosting Xen. If I reduce the IPA size to 40-bits in KVM, then this issue is not much observable but with the IPA size of 48-bits, it is 256x more costly than the former one. Xen uses this instruction too frequently and this instruction is trapped and emulated by KVM, and performance is not as good as on bare-metal hardware. With 48-bit IPA, it can take up to 200 minutes for domu creation with just 128M RAM. I have identified two places in Xen which are problematic w.r.t the usage of this instruction and hoping to reduce the frequency of this instruction or use a more relevant TLBI instruction instead of invalidating whole stage-1 and stage-2 translations.


  1.
During the creation of domu, first the domu memory is mapped onto dom0 domain, images are copied into it, and it is then unmapped. During unmapping, the TLB translations are invalidated one by one for each page being unmapped in XENMEM_remove_from_physmap hypercall. Here is the code snippet where the decision to flush TLBs is being made during removal of mapping.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -1103,7 +1103,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,

    if ( removing_mapping )
        /* Flush can be deferred if the entry is removed */
-        p2m->need_flush |= !!lpae_is_valid(orig_pte);
+        //p2m->need_flush |= !!lpae_is_valid(orig_pte);
+        p2m->need_flush |= false;
    else
    {
        lpae_t pte = mfn_to_p2m_entry(smfn, t, a);

  1.
This can be optimized by either introducing a batch version of this hypercall i.e., XENMEM_remove_from_physmap_batch and flushing TLBs only once for all pages being removed
OR
by using a TLBI instruction that only invalidates the intended range of addresses instead of the whole stage-1 and stage-2 translations. I understand that a single TLBI instruction does not exist that can perform both stage-1 and stage-2 invalidations for a given address range but maybe a combination of instructions can be used such as:

; switch to current VMID
tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
dsb ish
isb
; switch back the VMID

  1.
This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.


  1.
The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
      * when running multiple vCPU of the same domain on a single pCPU.
      */
     if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
-        flush_guest_tlb_local();
+        ; // flush_guest_tlb_local();

     *last_vcpu_ran = n->vcpu_id;
 }

Thanks & Regards,
Haseeb Ashraf

[-- Attachment #2: Type: text/html, Size: 10416 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-30  6:12 Limitations for Running Xen on KVM Arm64 haseeb.ashraf
@ 2025-10-30 13:41 ` haseeb.ashraf
  2025-10-30 18:33   ` Mohamed Mediouni
  2025-10-31 15:17 ` Mohamed Mediouni
  2025-11-01  2:04 ` Demi Marie Obenour
  2 siblings, 1 reply; 18+ messages in thread
From: haseeb.ashraf @ 2025-10-30 13:41 UTC (permalink / raw)
  To: xen-devel@lists.xenproject.org; +Cc: julien@xen.org, Volodymyr_Babchuk@epam.com

[-- Attachment #1: Type: text/plain, Size: 5340 bytes --]

Adding @julien@xen.org<mailto:julien@xen.org> and replying to his questions he asked over #XenDevel:matrix.org.

can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.

Regards,
Haseeb Ashraf
________________________________
From: Ashraf, Haseeb (DI SW EDA HAV SLS EPS RTOS LIN)
Sent: Thursday, October 30, 2025 11:12 AM
To: xen-devel@lists.xenproject.org <xen-devel@lists.xenproject.org>
Subject: Limitations for Running Xen on KVM Arm64

Hello Xen development community,

I wanted to discuss the limitations that I have faced while running Xen on KVM on Arm64 machines. I hope I am using the right mailing list.

The biggest limitation is the costly emulation of instruction tlbi vmalls12e1is in KVM. The cost is exponentially proportional to the IPA size exposed by KVM for VM hosting Xen. If I reduce the IPA size to 40-bits in KVM, then this issue is not much observable but with the IPA size of 48-bits, it is 256x more costly than the former one. Xen uses this instruction too frequently and this instruction is trapped and emulated by KVM, and performance is not as good as on bare-metal hardware. With 48-bit IPA, it can take up to 200 minutes for domu creation with just 128M RAM. I have identified two places in Xen which are problematic w.r.t the usage of this instruction and hoping to reduce the frequency of this instruction or use a more relevant TLBI instruction instead of invalidating whole stage-1 and stage-2 translations.


  1.
During the creation of domu, first the domu memory is mapped onto dom0 domain, images are copied into it, and it is then unmapped. During unmapping, the TLB translations are invalidated one by one for each page being unmapped in XENMEM_remove_from_physmap hypercall. Here is the code snippet where the decision to flush TLBs is being made during removal of mapping.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -1103,7 +1103,8 @@ static int __p2m_set_entry(struct p2m_domain *p2m,

    if ( removing_mapping )
        /* Flush can be deferred if the entry is removed */
-        p2m->need_flush |= !!lpae_is_valid(orig_pte);
+        //p2m->need_flush |= !!lpae_is_valid(orig_pte);
+        p2m->need_flush |= false;
    else
    {
        lpae_t pte = mfn_to_p2m_entry(smfn, t, a);

  1.
This can be optimized by either introducing a batch version of this hypercall i.e., XENMEM_remove_from_physmap_batch and flushing TLBs only once for all pages being removed
OR
by using a TLBI instruction that only invalidates the intended range of addresses instead of the whole stage-1 and stage-2 translations. I understand that a single TLBI instruction does not exist that can perform both stage-1 and stage-2 invalidations for a given address range but maybe a combination of instructions can be used such as:

; switch to current VMID
tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
dsb ish
isb
; switch back the VMID

  1.
This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.


  1.
The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.

diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
index 7642dbc7c5..e96ff92314 100644
--- a/xen/arch/arm/mmu/p2m.c
+++ b/xen/arch/arm/mmu/p2m.c
@@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
      * when running multiple vCPU of the same domain on a single pCPU.
      */
     if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
-        flush_guest_tlb_local();
+        ; // flush_guest_tlb_local();

     *last_vcpu_ran = n->vcpu_id;
 }

Thanks & Regards,
Haseeb Ashraf

[-- Attachment #2: Type: text/html, Size: 13561 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-30 13:41 ` haseeb.ashraf
@ 2025-10-30 18:33   ` Mohamed Mediouni
  2025-10-30 23:55     ` Julien Grall
  0 siblings, 1 reply; 18+ messages in thread
From: Mohamed Mediouni @ 2025-10-30 18:33 UTC (permalink / raw)
  To: haseeb.ashraf
  Cc: xen-devel@lists.xenproject.org, julien@xen.org,
	Volodymyr_Babchuk@epam.com



> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
> 
> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
> 
> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.


Hello,

You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.

> 
> ; switch to current VMID
> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
> dsb ish
> isb
> ; switch back the VMID
>     • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
> 

Note that the documentation says

> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.

for TLBIP RIPAS2E1
>     • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.


One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.

Thank you,
-Mohamed
> 
> 
> diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> index 7642dbc7c5..e96ff92314 100644
> --- a/xen/arch/arm/mmu/p2m.c
> +++ b/xen/arch/arm/mmu/p2m.c
> @@ -247,7 +247,7 @@ void p2m_restore_state(struct vcpu *n)
>       * when running multiple vCPU of the same domain on a single pCPU.
>       */
>      if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != n->vcpu_id )
> -        flush_guest_tlb_local();
> +        ; // flush_guest_tlb_local();
>       *last_vcpu_ran = n->vcpu_id;
>  } 
> 
> Thanks & Regards,
> Haseeb Ashraf




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-30 18:33   ` Mohamed Mediouni
@ 2025-10-30 23:55     ` Julien Grall
  2025-10-31  0:20       ` Mohamed Mediouni
  0 siblings, 1 reply; 18+ messages in thread
From: Julien Grall @ 2025-10-30 23:55 UTC (permalink / raw)
  To: Mohamed Mediouni, haseeb.ashraf
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com

Hi Mohamed,

On 30/10/2025 18:33, Mohamed Mediouni wrote:
> 
> 
>> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
>>
>> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
>>
>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
> 
> 
> Hello,
> 
> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
> 
>>
>> ; switch to current VMID
>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>> dsb ish
>> isb
>> ; switch back the VMID
>>      • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>
> 
> Note that the documentation says
> 
>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
> 
> for TLBIP RIPAS2E1
>>      • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
> 
> 
> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.

To confirm my understanding, you are suggesting to rely on the L2 guest 
to send the TLB flush. Did I understanding correctly? If so, wouldn't 
this open a security hole because a misbehaving guest may never send the 
TLB flush?

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-30 23:55     ` Julien Grall
@ 2025-10-31  0:20       ` Mohamed Mediouni
  2025-10-31  0:38         ` Mohamed Mediouni
  2025-10-31  9:18         ` Julien Grall
  0 siblings, 2 replies; 18+ messages in thread
From: Mohamed Mediouni @ 2025-10-31  0:20 UTC (permalink / raw)
  To: Julien Grall
  Cc: haseeb.ashraf, xen-devel@lists.xenproject.org,
	Volodymyr_Babchuk@epam.com



> On 31. Oct 2025, at 00:55, Julien Grall <julien@xen.org> wrote:
> 
> Hi Mohamed,
> 
> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
>>> 
>>> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
>>> 
>>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
>> Hello,
>> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
>>> 
>>> ; switch to current VMID
>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>>> dsb ish
>>> isb
>>> ; switch back the VMID
>>>     • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>> 
>> Note that the documentation says
>>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
>> for TLBIP RIPAS2E1
>>>     • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
>> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.
> 
> To confirm my understanding, you are suggesting to rely on the L2 guest to send the TLB flush. Did I understanding correctly? If so, wouldn't this open a security hole because a misbehaving guest may never send the TLB flush?
> 
Hello,

HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which is a stage1 one) a broadcast TLB invalidate.

If a TLB invalidate wasn’t issued, then well the cached stage1 translations could have been out of date on the core the VM was running on in the first place.

If a core-local TLB invalidate was issued, this bit forces it to become a broadcast, so that you don’t have to worry about flushing TLBs when moving a vCPU between different pCPUs. KVM operates with this bit set.

As of the hypervisor, it’s responsible to issue the appropriate TLB invalidates as necessary if it changes stage2 mappings. This includes a stage-2 TLB invalidate and further necessary maintenance if the CPU core does do combined TLB entries. Whether a CPU core does that can be queried through FEAT_nTLBPA.

On processors without FEAT_nTLBPA, it should be assumed that there are non-coherent caching structures within the TLB. And as such also do the corresponding stage-1 maintenance when invalidating stage2 entries.

Thank you,
-Mohamed
> Cheers,
> 
> -- 
> Julien Grall
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-31  0:20       ` Mohamed Mediouni
@ 2025-10-31  0:38         ` Mohamed Mediouni
  2025-10-31  9:18         ` Julien Grall
  1 sibling, 0 replies; 18+ messages in thread
From: Mohamed Mediouni @ 2025-10-31  0:38 UTC (permalink / raw)
  To: Julien Grall
  Cc: haseeb.ashraf, xen-devel@lists.xenproject.org,
	Volodymyr_Babchuk@epam.com



> On 31. Oct 2025, at 01:20, Mohamed Mediouni <mohamed@unpredictable.fr> wrote:
> 
> 
> 
>> On 31. Oct 2025, at 00:55, Julien Grall <julien@xen.org> wrote:
>> 
>> Hi Mohamed,
>> 
>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
>>>> 
>>>> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
>>>> 
>>>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>>>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
>>> Hello,
>>> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
>>>> 
>>>> ; switch to current VMID
>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>>>> dsb ish
>>>> isb
>>>> ; switch back the VMID
>>>>    • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>>> 
>>> Note that the documentation says
>>>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
>>> for TLBIP RIPAS2E1
>>>>    • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
>>> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.
>> 
>> To confirm my understanding, you are suggesting to rely on the L2 guest to send the TLB flush. Did I understanding correctly? If so, wouldn't this open a security hole because a misbehaving guest may never send the TLB flush?
>> 
> Hello,
> 
> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which is a stage1 one) a broadcast TLB invalidate.
> 
> If a TLB invalidate wasn’t issued, then well the cached stage1 translations could have been out of date on the core the VM was running on in the first place.
> 
> If a core-local TLB invalidate was issued, this bit forces it to become a broadcast, so that you don’t have to worry about flushing TLBs when moving a vCPU between different pCPUs. KVM operates with this bit set.
> 
> As of the hypervisor, it’s responsible to issue the appropriate TLB invalidates as necessary if it changes stage2 mappings. This includes a stage-2 TLB invalidate and further necessary maintenance if the CPU core does do combined TLB entries. Whether a CPU core does that can be queried through FEAT_nTLBPA.
> 
> On processors without FEAT_nTLBPA, it should be assumed that there are non-coherent caching structures within the TLB. And as such also do the corresponding stage-1 maintenance when invalidating stage2 entries.
> 
> Thank you,
> -Mohamed

On the Neoverse V3 core for example, there’s this note in the TRM:

https://developer.arm.com/documentation/107734/0002/AArch64-registers/AArch64-Identification-registers-summary/ID-AA64MMFR1-EL1--AArch64-Memory-Model-Feature-Register-1?lang=en

> nTLBPA: The intermediate caching of translation table walks does not include non-coherent physical translation caches.

Which means that this heavyweight Stage-2 flush is no longer necessary on that core.

On Neoverse V2, this bit is defined as RES0 instead. And as such invalidating the whole of Stage1 is necessary on Neoverse V2 when doing Stage-2 invalidates for an HV in practice… (or more heavyweight tracking…)

What KVM currently does in arch/arm64/kvm/hyp/nvhe/tlb.c ( ~line 158) (__kvm_tlb_flush_vmid_ipa) today: It just always flushes Stage-1 when doing a Stage-2 flush. 

Thank you,
-Mohamed





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-31  0:20       ` Mohamed Mediouni
  2025-10-31  0:38         ` Mohamed Mediouni
@ 2025-10-31  9:18         ` Julien Grall
  2025-10-31 11:54           ` Mohamed Mediouni
  2025-10-31 13:01           ` haseeb.ashraf
  1 sibling, 2 replies; 18+ messages in thread
From: Julien Grall @ 2025-10-31  9:18 UTC (permalink / raw)
  To: Mohamed Mediouni
  Cc: haseeb.ashraf, xen-devel@lists.xenproject.org,
	Volodymyr_Babchuk@epam.com



On 31/10/2025 00:20, Mohamed Mediouni wrote:
> 
> 
>> On 31. Oct 2025, at 00:55, Julien Grall <julien@xen.org> wrote:
>>
>> Hi Mohamed,
>>
>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
>>>>
>>>> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
>>>>
>>>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>>>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
>>> Hello,
>>> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
>>>>
>>>> ; switch to current VMID
>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>>>> dsb ish
>>>> isb
>>>> ; switch back the VMID
>>>>      • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>>>
>>> Note that the documentation says
>>>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
>>> for TLBIP RIPAS2E1
>>>>      • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
>>> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.
>>
>> To confirm my understanding, you are suggesting to rely on the L2 guest to send the TLB flush. Did I understanding correctly? If so, wouldn't this open a security hole because a misbehaving guest may never send the TLB flush?
>>
> Hello,
> 
> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which is a stage1 one) a broadcast TLB invalidate.

Xen already sets HCR_EL2.FB. But I believe this is only solving the 
problem where the vCPU is moved to another pCPU. This doesn't solve the 
problem where two vCPUs from the same VM is sharing the same pCPU.

Per the Arm Arm each CPU have their own private TLBs. So we have to 
flush between vCPU of the same domains to avoid translations from vCPU 1 
to "leak" to the vCPU 2 (they may have confliected page-tables).

KVM has a similar logic see "last_vcpu_ran" and 
"__kvm_flush_cpu_context()". That said... they are using "vmalle1" 
whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if 
this would make any difference for the performance though.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-31  9:18         ` Julien Grall
@ 2025-10-31 11:54           ` Mohamed Mediouni
  2025-11-01 17:20             ` Julien Grall
  2025-10-31 13:01           ` haseeb.ashraf
  1 sibling, 1 reply; 18+ messages in thread
From: Mohamed Mediouni @ 2025-10-31 11:54 UTC (permalink / raw)
  To: Julien Grall
  Cc: haseeb.ashraf, xen-devel@lists.xenproject.org,
	Volodymyr_Babchuk@epam.com



> On 31. Oct 2025, at 10:18, Julien Grall <julien@xen.org> wrote:
> 
> 
> 
> On 31/10/2025 00:20, Mohamed Mediouni wrote:
>>> On 31. Oct 2025, at 00:55, Julien Grall <julien@xen.org> wrote:
>>> 
>>> Hi Mohamed,
>>> 
>>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
>>>>> 
>>>>> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
>>>>> 
>>>>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>>>>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
>>>> Hello,
>>>> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
>>>>> 
>>>>> ; switch to current VMID
>>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>>>>> dsb ish
>>>>> isb
>>>>> ; switch back the VMID
>>>>>     • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>>>> 
>>>> Note that the documentation says
>>>>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
>>>> for TLBIP RIPAS2E1
>>>>>     • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
>>>> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.
>>> 
>>> To confirm my understanding, you are suggesting to rely on the L2 guest to send the TLB flush. Did I understanding correctly? If so, wouldn't this open a security hole because a misbehaving guest may never send the TLB flush?
>>> 
>> Hello,
>> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which is a stage1 one) a broadcast TLB invalidate.
> 
> Xen already sets HCR_EL2.FB. But I believe this is only solving the problem where the vCPU is moved to another pCPU. This doesn't solve the problem where two vCPUs from the same VM is sharing the same pCPU.
> 
> Per the Arm Arm each CPU have their own private TLBs. So we have to flush between vCPU of the same domains to avoid translations from vCPU 1 to "leak" to the vCPU 2 (they may have confliected page-tables).
Hm… it varies on whether the VM uses CnP or not (and whether the HW supports it)… (Linux does…)
> KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()". That said... they are using "vmalle1" whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if this would make any difference for the performance though.
vmalle1 avoids the problem here (because it only invalidates stage-1 translations). 
> Cheers,
> 
> -- 
> Julien Grall
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-31  9:18         ` Julien Grall
  2025-10-31 11:54           ` Mohamed Mediouni
@ 2025-10-31 13:01           ` haseeb.ashraf
  2025-11-01 18:23             ` Julien Grall
  1 sibling, 1 reply; 18+ messages in thread
From: haseeb.ashraf @ 2025-10-31 13:01 UTC (permalink / raw)
  To: Julien Grall, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com

[-- Attachment #1: Type: text/plain, Size: 6200 bytes --]

Hello,

Thanks for your reply.

You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
Yes, I am using Graviton4 (r8g.metal-24xl). Nope, it wasn't much of an issue to use G4.
KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()". That said... they are using "vmalle1" whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if this would make any difference for the performance though.
I have seen no such performance issue with nested KVM. For Xen, if this can be relaxed from vmalls12e1 to vmalle1, this would still be a huge performance improvement. I used Ftrace to get execution time of each of these handler functions:
handle_vmalls12e1is() min-max = 1464441 - 9495486 us
handle_tlbi_el1() min-max = 10 - 27 us

So, to summarize using HCR_EL2.FB (which Xen already enables?) and then using vmalle1 instead of vmalls12e1 should resolve the issue-2 for vCPUs switching on pCPUs.

Coming back to issue-1, what do you think about creating a batch version of hypercall XENMEM_remove_from_physmap (other batch versions exist such as for XENMEM_add_to_physmap) and doing the TLB invalidation only once per this hypercall? I just realized that ripas2e1 is a range TLBI instruction which is only supported after Armv8.4 indicated by ID_AA64ISAR0_EL1.TLB == 2. So, on older architectures, full stage-2 invalidation would be required. For an architecture independent solution, creating a batch version seems to be a better way.

Regards,
Haseeb
________________________________
From: Julien Grall <julien@xen.org>
Sent: Friday, October 31, 2025 2:18 PM
To: Mohamed Mediouni <mohamed@unpredictable.fr>
Cc: Ashraf, Haseeb (DI SW EDA HAV SLS EPS RTOS LIN) <haseeb.ashraf@siemens.com>; xen-devel@lists.xenproject.org <xen-devel@lists.xenproject.org>; Volodymyr_Babchuk@epam.com <Volodymyr_Babchuk@epam.com>
Subject: Re: Limitations for Running Xen on KVM Arm64



On 31/10/2025 00:20, Mohamed Mediouni wrote:
>
>
>> On 31. Oct 2025, at 00:55, Julien Grall <julien@xen.org> wrote:
>>
>> Hi Mohamed,
>>
>> On 30/10/2025 18:33, Mohamed Mediouni wrote:
>>>> On 30. Oct 2025, at 14:41, haseeb.ashraf@siemens.com wrote:
>>>>
>>>> Adding @julien@xen.org and replying to his questions he asked over #XenDevel:matrix.org.
>>>>
>>>> can you add some details why the implementation cannot be optimized in KVM? Asking because I have never seen such issue when running Xen on QEMU (without nested virt enabled).
>>>> AFAIK when Xen is run on QEMU without virtualization, then instructions are emulated in QEMU while with KVM, ideally the instruction should run directly on hardware except in some special cases (those trapped by FGT/CGT). Such as this one where KVM maintains shadow page tables for each VM. It traps these instructions and emulates them with callback such as handle_vmalls12e1is(). The way this callback is implemented, it has to iterate over the whole address space and clean-up the page tables which is a costly operation. Regardless of this, it should still be optimized in Xen as invalidating a selective range would be much better than invalidating a whole range of 48-bit address space.
>>>> Some details about your platform and use case would be helpful. I am interested to know whether you are using all the features for nested virt.
>>>> I am using AWS G4. My use case is to run Xen as guest hypervisor. Yes, most of the features are enabled except VHE or those which are disabled by KVM.
>>> Hello,
>>> You mean Graviton4 (for reference to others, from a bare metal instance)? Interesting to see people caring about nested virt there :) - and hopefully using it wasn’t too much of a pain for you to deal with.
>>>>
>>>> ; switch to current VMID
>>>> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
>>>> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
>>>> dsb ish
>>>> isb
>>>> ; switch back the VMID
>>>>      • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.
>>>>
>>> Note that the documentation says
>>>> The invalidation is not required to apply to caching structures that combine stage 1 and stage 2 translation table entries.
>>> for TLBIP RIPAS2E1
>>>>      • The second place in Xen where this is problematic is when multiple vCPUs of the same domain juggle on single pCPU, TLBs are invalidated everytime a different vCPU runs on a pCPU. I do not know how this can be optimized. Any support on this is appreciated.
>>> One way to handle this is every invalidate within the VM a broadcast TLB invalidate (HCR_EL2.FB is what you’re looking for) and then forego that TLB maintenance as it’s no longer necessary. This should not have a practical performance impact.
>>
>> To confirm my understanding, you are suggesting to rely on the L2 guest to send the TLB flush. Did I understanding correctly? If so, wouldn't this open a security hole because a misbehaving guest may never send the TLB flush?
>>
> Hello,
>
> HCR_EL2.FB can be used to make every TLB invalidate the guest issues (which is a stage1 one) a broadcast TLB invalidate.

Xen already sets HCR_EL2.FB. But I believe this is only solving the
problem where the vCPU is moved to another pCPU. This doesn't solve the
problem where two vCPUs from the same VM is sharing the same pCPU.

Per the Arm Arm each CPU have their own private TLBs. So we have to
flush between vCPU of the same domains to avoid translations from vCPU 1
to "leak" to the vCPU 2 (they may have confliected page-tables).

KVM has a similar logic see "last_vcpu_ran" and
"__kvm_flush_cpu_context()". That said... they are using "vmalle1"
whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if
this would make any difference for the performance though.

Cheers,

--
Julien Grall


[-- Attachment #2: Type: text/html, Size: 10385 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-30  6:12 Limitations for Running Xen on KVM Arm64 haseeb.ashraf
  2025-10-30 13:41 ` haseeb.ashraf
@ 2025-10-31 15:17 ` Mohamed Mediouni
  2025-11-01  2:04 ` Demi Marie Obenour
  2 siblings, 0 replies; 18+ messages in thread
From: Mohamed Mediouni @ 2025-10-31 15:17 UTC (permalink / raw)
  To: haseeb.ashraf; +Cc: xen-devel@lists.xenproject.org



> On 30. Oct 2025, at 07:12, haseeb.ashraf@siemens.com wrote:
> 
>     • This can be optimized by either introducing a batch version of this hypercall i.e., XENMEM_remove_from_physmap_batch and flushing TLBs only once for all pages being removed
> OR
> by using a TLBI instruction that only invalidates the intended range of addresses instead of the whole stage-1 and stage-2 translations. I understand that a single TLBI instruction does not exist that can perform both stage-1 and stage-2 invalidations for a given address range but maybe a combination of instructions can be used such as:
> ; switch to current VMID
> tlbi rvae1, guest_vaddr ; first invalidate stage-1 TLB by guest VA for current VMID
> tlbi ripas2e1, guest_paddr ; then invalidate stage-2 TLB by IPA range for current VMID
> dsb ish
> isb
> ; switch back the VMID
>     • This is where I am not quite sure and I was hoping that if someone with Arm expertise could sign off on this so that I can work on its implementation in Xen. This will be an optimization not only for virtualized hardware but also in general for Xen on arm64 machines.


There’s no visibility on what’s going on at stage-1. We don’t know the guest VAs that map to the given IPA so doing the full stage-1 TLB flush is the only option if FEAT_nTLBPA isn’t present (and FEAT_nTLBPA is not present on Neoverse V2).

If FEAT_nTLBPA is present (such as Neoverse V3), then you don’t need the stage-1 TLB invalidate in this code path.

> So, on older architectures, full stage-2 invalidation would be required. For an architecture independent solution, creating a batch version seems to be a better way.

Might as well have both, although the range invalidate for stage-2 is most likely enough to resolve performance issues in your case.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-30  6:12 Limitations for Running Xen on KVM Arm64 haseeb.ashraf
  2025-10-30 13:41 ` haseeb.ashraf
  2025-10-31 15:17 ` Mohamed Mediouni
@ 2025-11-01  2:04 ` Demi Marie Obenour
  2 siblings, 0 replies; 18+ messages in thread
From: Demi Marie Obenour @ 2025-11-01  2:04 UTC (permalink / raw)
  To: haseeb.ashraf@siemens.com, xen-devel@lists.xenproject.org


[-- Attachment #1.1.1: Type: text/plain, Size: 1325 bytes --]

On 10/30/25 02:12, haseeb.ashraf@siemens.com wrote:
> Hello Xen development community,
> 
> I wanted to discuss the limitations that I have faced while running Xen on KVM on Arm64 machines. I hope I am using the right mailing list.
> 
> The biggest limitation is the costly emulation of instruction tlbi vmalls12e1is in KVM. The cost is exponentially proportional to the IPA size exposed by KVM for VM hosting Xen. If I reduce the IPA size to 40-bits in KVM, then this issue is not much observable but with the IPA size of 48-bits, it is 256x more costly than the former one. Xen uses this instruction too frequently and this instruction is trapped and emulated by KVM, and performance is not as good as on bare-metal hardware. With 48-bit IPA, it can take up to 200 minutes for domu creation with just 128M RAM. I have identified two places in Xen which are problematic w.r.t the usage of this instruction and hoping to reduce the frequency of this instruction or use a more relevant TLBI instruction instead of invalidating whole stage-1 and stage-2 translations.

Why the exponential scaling?  It should be possible for KVM to fall
back to a full TLB flush, which should be O(1) in the size of the
address space.  It might have terrible constant factors though.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-31 11:54           ` Mohamed Mediouni
@ 2025-11-01 17:20             ` Julien Grall
  0 siblings, 0 replies; 18+ messages in thread
From: Julien Grall @ 2025-11-01 17:20 UTC (permalink / raw)
  To: Mohamed Mediouni
  Cc: haseeb.ashraf, xen-devel@lists.xenproject.org,
	Volodymyr_Babchuk@epam.com

Hi,

On 31/10/2025 11:54, Mohamed Mediouni wrote:
>> Per the Arm Arm each CPU have their own private TLBs. So we have to flush between vCPU of the same domains to avoid translations from vCPU 1 to "leak" to the vCPU 2 (they may have confliected page-tables).
> Hm… it varies on whether the VM uses CnP or not (and whether the HW supports it)… (Linux does…)

Skimming through the Arm Arm, it seems that CnP is a per page-table/ASID 
decision. So I think it would be difficult to take advantage of this 
knowlege in Xen unless we start trapping access to TTBRn_EL1 which is 
likely going to be expensive.

Obviously, if someone trusts and knows their VM then they could rely
on it. But that's not something I would want to accept in upstream
Xen at the moment.

>> KVM has a similar logic see "last_vcpu_ran" and "__kvm_flush_cpu_context()". That said... they are using "vmalle1" whereas we are using "vmalls12e1". So maybe we can relax it. Not sure if this would make any difference for the performance though.
> vmalle1 avoids the problem here (because it only invalidates stage-1 translations).

I saw Haseeb provided some good numbers. I think switching to vmalle1 is 
a no brainer.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-10-31 13:01           ` haseeb.ashraf
@ 2025-11-01 18:23             ` Julien Grall
  2025-11-03 13:09               ` haseeb.ashraf
  0 siblings, 1 reply; 18+ messages in thread
From: Julien Grall @ 2025-11-01 18:23 UTC (permalink / raw)
  To: haseeb.ashraf@siemens.com, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com,
	Stefano Stabellini, Bertrand Marquis, Michal Orzel

(+ the other Arm maintainers)

On 31/10/2025 13:01, haseeb.ashraf@siemens.com wrote:
> Hello,

Hi,

Before answering to the rest, would you be able to configure your e-mail 
client to quote with '>' and avoid top-posting? Otherwise, it will 
become quite difficult to follow the conversation after a few round.

> I have seen no such performance issue with nested KVM. For Xen, if this 
> can be relaxed from |vmalls12e1| to |vmalle1|, this would still be a 
> huge performance improvement. I used Ftrace to get execution time of 
> each of these handler functions:
> handle_vmalls12e1is() min-max = 1464441 - 9495486 us

To clarify, Xen is using the local TLB version. So it should be 
vmalls12e1. But it looks like KVM will treat it the same way and I 
wonder whether this could be optimized? (I don't know much about the KVM 
implementation though).

> 
> So, to summarize using HCR_EL2.FB (which Xen already enables?) and then 
> using vmalle1 instead of vmalls12e1 should resolve the issue-2 for vCPUs 
> switching on pCPUs.

I don't think HCR_EL2.FB would matter here.

> 
> Coming back to issue-1, what do you think about creating a batch version 
> of hypercall XENMEM_remove_from_physmap (other batch versions exist such 
> as for XENMEM_add_to_physmap) and doing the TLB invalidation only once 
> per this hypercall?

Before going into batching, do you have any data showing how often 
XENMEM_remove_from_physmap is called in your setup? Similar, I would be 
interested to know the number of TLBs flush within one hypercalls and 
whether the regions unmapped were contiguous.

In your previous e-mail you wrote:

 > During the creation of domu, first the domu memory is mapped onto 
dom0 domain, images are copied into it, and it is then unmapped. During 
unmapping, the TLB translations are invalidated one by one for each page 
being unmapped in XENMEM_remove_from_physmap hypercall. Here is the code 
snippet where the decision to flush TLBs is being made during removal of 
mapping.

Don't we map only the memory that is needed to copy the binaries? If 
not, then I would suggest to look at that first.

I am asking because even with batching, we may still send a few TLBs 
because:
    * We need to avoid long-running operations, so the hypercall may 
restart. So we will have to flush at mininum before every restart
    * The current way we handle batching is we will process one item at 
the time. As this may free memory (either leaf or intermediate 
page-tables), we will need to flush the TLBs first to prevent the domain 
accessing the wrong memory. This could be solved by keeping track of the 
list of memory to free. But this is going to require some work and I am 
not entirely sure this is worth it at the moment.

> I just realized that ripas2e1 is a range TLBI 
> instruction which is only supported after Armv8.4 indicated 
> by ID_AA64ISAR0_EL1.TLB == 2. So, on older architectures, full stage-2 
> invalidation would be required. For an architecture independent 
> solution, creating a batch version seems to be a better way.

I don't think we necessarily need a full stage-2 invalidation for 
processor not supporting range TLBI. We could use a series of TLBI 
IPAS2E1IS which I think is what TBLI range is meant to replace (so long 
the addresses are contiguous in the given space).

On the KVM side, it would be worth looking at whether the implementation 
can be optimized. Is this really walking block by block? Can it skip 
over large hole (e.g. if we know a level 1 entry doesn't exist, then we 
can increment by 1GB).

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-11-01 18:23             ` Julien Grall
@ 2025-11-03 13:09               ` haseeb.ashraf
  2025-11-03 14:30                 ` Julien Grall
  0 siblings, 1 reply; 18+ messages in thread
From: haseeb.ashraf @ 2025-11-03 13:09 UTC (permalink / raw)
  To: Julien Grall, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com,
	Stefano Stabellini, Bertrand Marquis, Michal Orzel

Hi,

> To clarify, Xen is using the local TLB version. So it should be vmalls12e1.
If I understood correctly, won't HCR_EL2.FB makes local TLB, a broadcast one?

Mohamed mentioned this in earlier email:
> If a core-local TLB invalidate was issued, this bit forces it to become a broadcast, so that you don’t have to worry about flushing TLBs when moving a vCPU between different pCPUs. KVM operates with this bit set.

Can you explain in what scenario exactly, can we use vmalle1?

> Before going into batching, do you have any data showing how often XENMEM_remove_from_physmap is called in your setup? Similar, I would be interested to know the number of TLBs flush within one hypercalls and whether the regions unmapped were contiguous.
The number of times XENMEM_remove_from_physmap is invoked depends upon the size of each binary. Each hypercall invokes TLB instruction once. If I use persistent rootfs, then this hypercall is invoked almost 7458 times (+8 approx) which is equal to sum of kernel and DTB image pages:
domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x48000000 -> 0x4800188d  (pfn 0x48000 + 0x2 pages)

And if I use ramdisk image, then this hypercall is invoked almost 222815 times (+8 approx) which is equal to sum of kernel, ramdisk and DTB image 4k pages.
domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
domainbuilder: detail: xc_dom_alloc_segment:   module0      : 0x48000000 -> 0x7c93d000  (pfn 0x48000 + 0x3493d pages)
domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x7c93d000 -> 0x7c93e8d9  (pfn 0x7c93d + 0x2 pages)

You can see the address ranges in above logs, the addresses seem contiguous in this address space and at best we can reduce the number of calls to 3, each at the end of every image when removed from physmap.

> we may still send a few TLBs because:
> * We need to avoid long-running operations, so the hypercall may restart. So we will have to flush at mininum before every restart
> * The current way we handle batching is we will process one item at the time. As this may free memory (either leaf or intermediate page-tables), we will need to flush the TLBs first to prevent the domain accessing the wrong memory. This could be solved by keeping track of the list of memory to free. But this is going to require some work and I am not entirely sure this is worth it at the moment.
I think now you have the figure that 222815 TLBs are too much and a few TLBs would still be a lot better. TLBs less than 10 are not much noticeable.

> We could use a series of TLBI IPAS2E1IS which I think is what TLBI range is meant to replace (so long the addresses are contiguous in the given space).
Isn't IPAS2E1IS a range tlbi instruction? My understanding is that this instruction is available on processors with range TLBI support, I could be wrong. I saw its KVM emulation which does full invalidation if range TLBI is not supported (https://github.com/torvalds/linux/blob/master/arch/arm64/kvm/hyp/pgtable.c#L647).

> On the KVM side, it would be worth looking at whether the implementation can be optimized. Is this really walking block by block? Can it skip over large hole (e.g. if we know a level 1 entry doesn't exist, then we can increment by 1GB).
Yes, this should also be looked from KVM side. I think to solve this problem, we need this optimized on both places in Xen and in KVM because Xen is invoking this instruction too many times and unless KVM can provide performance close to bare-metal tlbi, this would still be a problem.

Regards,
Haseeb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-11-03 13:09               ` haseeb.ashraf
@ 2025-11-03 14:30                 ` Julien Grall
  2025-11-04  7:50                   ` haseeb.ashraf
  0 siblings, 1 reply; 18+ messages in thread
From: Julien Grall @ 2025-11-03 14:30 UTC (permalink / raw)
  To: haseeb.ashraf@siemens.com, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com,
	Stefano Stabellini, Bertrand Marquis, Michal Orzel



On 03/11/2025 13:09, haseeb.ashraf@siemens.com wrote:
> Hi,

Hi,
> 
>> To clarify, Xen is using the local TLB version. So it should be vmalls12e1.
> If I understood correctly, won't HCR_EL2.FB makes local TLB, a broadcast one?

HCR_EL2.FB only applies to EL1. So it depends who is setting it in the 
this situation. If it is Xen, then it would only apply to its VM. If it 
is KVM, then it would also apply to the nested Xen.

> Can you explain in what scenario exactly, can we use vmalle1?

We can use vmalle1 in Xen for the situation we discussed. I was only 
pointing out that the implementation in KVM seems suboptimal.

> 
>> Before going into batching, do you have any data showing how often XENMEM_remove_from_physmap is called in your setup? Similar, I would be interested to know the number of TLBs flush within one hypercalls and whether the regions unmapped were contiguous.
> The number of times XENMEM_remove_from_physmap is invoked depends upon the size of each binary. Each hypercall invokes TLB instruction once. If I use persistent rootfs, then this hypercall is invoked almost 7458 times (+8 approx) which is equal to sum of kernel and DTB image pages:
> domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
> domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x48000000 -> 0x4800188d  (pfn 0x48000 + 0x2 pages)
> 
> And if I use ramdisk image, then this hypercall is invoked almost 222815 times (+8 approx) which is equal to sum of kernel, ramdisk and DTB image 4k pages.
> domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
> domainbuilder: detail: xc_dom_alloc_segment:   module0      : 0x48000000 -> 0x7c93d000  (pfn 0x48000 + 0x3493d pages)
> domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x7c93d000 -> 0x7c93e8d9  (pfn 0x7c93d + 0x2 pages)
> 
> You can see the address ranges in above logs, the addresses seem contiguous in this address space and at best we can reduce the number of calls to 3, each at the end of every image when removed from physmap.

Thanks for the log. I haven't looked at the toolstack code. Does this 
mean only one ioctl call will be issue per blob will be used?

> 
>> we may still send a few TLBs because:
>> * We need to avoid long-running operations, so the hypercall may restart. So we will have to flush at mininum before every restart
>> * The current way we handle batching is we will process one item at the time. As this may free memory (either leaf or intermediate page-tables), we will need to flush the TLBs first to prevent the domain accessing the wrong memory. This could be solved by keeping track of the list of memory to free. But this is going to require some work and I am not entirely sure this is worth it at the moment.
> I think now you have the figure that 222815 TLBs are too much and a few TLBs would still be a lot better. TLBs less than 10 are not much noticeable.

I agree this is too much but this is going to require quite a bit of 
work (as I said we would need to keep track of pages to be freed before 
the TLB flush).

At least to me, it feels like switching to TLBI range (or a series os 
IPAS2E1IS) is an easier win. But if you feel like doing the larger 
rework, I would be happy to have a look to check whether it would be an 
acceptable change for upstream.

> 
>> We could use a series of TLBI IPAS2E1IS which I think is what TLBI range is meant to replace (so long the addresses are contiguous in the given space).
> Isn't IPAS2E1IS a range tlbi instruction? My understanding is that this instruction is available on processors with range TLBI support, I could be wrong. I saw its KVM emulation which does full invalidation if range TLBI is not supported (https://github.com/torvalds/linux/blob/master/arch/arm64/kvm/hyp/pgtable.c#L647).

IPAS2E1IS only allows you to invalidate one address at the time and is 
available on all processors. The R version is only available when the 
processor support TLBI range and allow you to invalidate multiple 
contiguous address.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-11-03 14:30                 ` Julien Grall
@ 2025-11-04  7:50                   ` haseeb.ashraf
  2025-11-05 13:39                     ` haseeb.ashraf
  0 siblings, 1 reply; 18+ messages in thread
From: haseeb.ashraf @ 2025-11-04  7:50 UTC (permalink / raw)
  To: Julien Grall, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com,
	Stefano Stabellini, Bertrand Marquis, Michal Orzel

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

Hi,

> Does this mean only one ioctl call will be issue per blob will be used?
Yes, one ioctl is issued to add all pages to physmap IOCTL_PRIVCMD_MMAPBATCH_V2 then all pages are removed from physmap as a result of munmap().

> At least to me, it feels like switching to TLBI range (or a series os IPAS2E1IS) is an easier win. But if you feel like doing the larger rework, I would be happy to have a look to check whether it would be an acceptable change for upstream.
Thank you. Yes, I agree. I just wanted a solution that also works for older CPUs. A series of IPAS2E1IS can work for older CPUs but there will be a lot of invocations (222815 * 4K, using the same example). Although, each invocation would be much less costly as compared to VMALLS12E1IS, so still seems like a viable solution. I shall evaluate this and let you know.

> IPAS2E1IS only allows you to invalidate one address at the time and is available on all processors. The R version is only available when the processor support TLBI range and allow you to invalidate multiple contiguous address.
Thanks, got it.

Regards,
Haseeb


[-- Attachment #2: Type: text/html, Size: 2951 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-11-04  7:50                   ` haseeb.ashraf
@ 2025-11-05 13:39                     ` haseeb.ashraf
  2025-11-05 17:44                       ` Julien Grall
  0 siblings, 1 reply; 18+ messages in thread
From: haseeb.ashraf @ 2025-11-05 13:39 UTC (permalink / raw)
  To: Julien Grall, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com,
	Stefano Stabellini, Bertrand Marquis, Michal Orzel

[-- Attachment #1: Type: text/plain, Size: 610 bytes --]

Hi,

I have sent out a patch using IPAS2E1IS. The R version RIPAS2E1IS would only be helpful if we have to invalidate more than one page at a time and this is not possible unless a batch version of hypercall is implemented because otherwise there is only one page removed per hypercall. Although IPAS2E1IS can be used and the number of invocations is still same as VMALLS12E1IS, but the execution time is much smaller. With Ftrace I got:
handle_ipas2e1is: min-max: 17.580 - 68.260 us.

Thanks again for your great suggestions. Please review my patch, you should've received an email.

Regards,
Haseeb

[-- Attachment #2: Type: text/html, Size: 2008 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Limitations for Running Xen on KVM Arm64
  2025-11-05 13:39                     ` haseeb.ashraf
@ 2025-11-05 17:44                       ` Julien Grall
  0 siblings, 0 replies; 18+ messages in thread
From: Julien Grall @ 2025-11-05 17:44 UTC (permalink / raw)
  To: haseeb.ashraf@siemens.com, Mohamed Mediouni
  Cc: xen-devel@lists.xenproject.org, Volodymyr_Babchuk@epam.com,
	Driscoll, Dan, Bachtel, Andrew, fahad.arslan@siemens.com,
	noor.ahsan@siemens.com, brian.sheppard@siemens.com,
	Stefano Stabellini, Bertrand Marquis, Michal Orzel



On 05/11/2025 13:39, haseeb.ashraf@siemens.com wrote:
> Hi,

Hi Haseeb,

> I have sent out a patch using IPAS2E1IS. The R version RIPAS2E1IS would only be helpful if we have to invalidate more than one page at a time and this is not possible unless a batch version of hypercall is implemented because otherwise there is only one page removed per hypercall.

I have only briefly look at your patch. You have the following loop:

     /* Invalidate stage-2 TLB entries by IPA range */
     for ( i = 0; i < page_count; i++ ) {
         flush_guest_tlb_one_s2(ipa);
         ipa += 1UL << PAGE_SHIFT;
     }

With RIPAS2E1IS, you would be able to replace this loop with a single 
instruction. This may not have any value in your setup because you are 
unmaping 4KB at the time. But there are other hypercalls (such as 
XENMEM_decrease_reservation) where you can remove larger mapping.

So I think there are some values to use the range version. Although, I 
would be fine if this is not handled in your current patch.

> Thanks again for your great suggestions. Please review my patch, you should've received an email.

I will add it in my list of reviews.

Cheers,

-- 
Julien Grall



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-11-05 17:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30  6:12 Limitations for Running Xen on KVM Arm64 haseeb.ashraf
2025-10-30 13:41 ` haseeb.ashraf
2025-10-30 18:33   ` Mohamed Mediouni
2025-10-30 23:55     ` Julien Grall
2025-10-31  0:20       ` Mohamed Mediouni
2025-10-31  0:38         ` Mohamed Mediouni
2025-10-31  9:18         ` Julien Grall
2025-10-31 11:54           ` Mohamed Mediouni
2025-11-01 17:20             ` Julien Grall
2025-10-31 13:01           ` haseeb.ashraf
2025-11-01 18:23             ` Julien Grall
2025-11-03 13:09               ` haseeb.ashraf
2025-11-03 14:30                 ` Julien Grall
2025-11-04  7:50                   ` haseeb.ashraf
2025-11-05 13:39                     ` haseeb.ashraf
2025-11-05 17:44                       ` Julien Grall
2025-10-31 15:17 ` Mohamed Mediouni
2025-11-01  2:04 ` Demi Marie Obenour

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).