Re: Limitations for Running Xen on KVM Arm64

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Julien Grall <julien@xen.org>
To: "haseeb.ashraf@siemens.com" <haseeb.ashraf@siemens.com>,
	Mohamed Mediouni <mohamed@unpredictable.fr>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	"Volodymyr_Babchuk@epam.com" <Volodymyr_Babchuk@epam.com>,
	"Driscoll, Dan" <dan.driscoll@siemens.com>,
	"Bachtel, Andrew" <andrew.bachtel@siemens.com>,
	"fahad.arslan@siemens.com" <fahad.arslan@siemens.com>,
	"noor.ahsan@siemens.com" <noor.ahsan@siemens.com>,
	"brian.sheppard@siemens.com" <brian.sheppard@siemens.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Bertrand Marquis <Bertrand.Marquis@arm.com>,
	Michal Orzel <michal.orzel@amd.com>
Subject: Re: Limitations for Running Xen on KVM Arm64
Date: Mon, 3 Nov 2025 14:30:24 +0000	[thread overview]
Message-ID: <0fd2b8e4-bdea-4d01-a2dd-8d2e4b37090d@xen.org> (raw)
In-Reply-To: <KL1PR0601MB45883069D3725975B49761D0E6C7A@KL1PR0601MB4588.apcprd06.prod.outlook.com>



On 03/11/2025 13:09, haseeb.ashraf@siemens.com wrote:
> Hi,

Hi,
> 
>> To clarify, Xen is using the local TLB version. So it should be vmalls12e1.
> If I understood correctly, won't HCR_EL2.FB makes local TLB, a broadcast one?

HCR_EL2.FB only applies to EL1. So it depends who is setting it in the 
this situation. If it is Xen, then it would only apply to its VM. If it 
is KVM, then it would also apply to the nested Xen.

> Can you explain in what scenario exactly, can we use vmalle1?

We can use vmalle1 in Xen for the situation we discussed. I was only 
pointing out that the implementation in KVM seems suboptimal.

> 
>> Before going into batching, do you have any data showing how often XENMEM_remove_from_physmap is called in your setup? Similar, I would be interested to know the number of TLBs flush within one hypercalls and whether the regions unmapped were contiguous.
> The number of times XENMEM_remove_from_physmap is invoked depends upon the size of each binary. Each hypercall invokes TLB instruction once. If I use persistent rootfs, then this hypercall is invoked almost 7458 times (+8 approx) which is equal to sum of kernel and DTB image pages:
> domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
> domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x48000000 -> 0x4800188d  (pfn 0x48000 + 0x2 pages)
> 
> And if I use ramdisk image, then this hypercall is invoked almost 222815 times (+8 approx) which is equal to sum of kernel, ramdisk and DTB image 4k pages.
> domainbuilder: detail: xc_dom_alloc_segment:   kernel       : 0x40000000 -> 0x41d1f200  (pfn 0x40000 + 0x1d20 pages)
> domainbuilder: detail: xc_dom_alloc_segment:   module0      : 0x48000000 -> 0x7c93d000  (pfn 0x48000 + 0x3493d pages)
> domainbuilder: detail: xc_dom_alloc_segment:   devicetree   : 0x7c93d000 -> 0x7c93e8d9  (pfn 0x7c93d + 0x2 pages)
> 
> You can see the address ranges in above logs, the addresses seem contiguous in this address space and at best we can reduce the number of calls to 3, each at the end of every image when removed from physmap.

Thanks for the log. I haven't looked at the toolstack code. Does this 
mean only one ioctl call will be issue per blob will be used?

> 
>> we may still send a few TLBs because:
>> * We need to avoid long-running operations, so the hypercall may restart. So we will have to flush at mininum before every restart
>> * The current way we handle batching is we will process one item at the time. As this may free memory (either leaf or intermediate page-tables), we will need to flush the TLBs first to prevent the domain accessing the wrong memory. This could be solved by keeping track of the list of memory to free. But this is going to require some work and I am not entirely sure this is worth it at the moment.
> I think now you have the figure that 222815 TLBs are too much and a few TLBs would still be a lot better. TLBs less than 10 are not much noticeable.

I agree this is too much but this is going to require quite a bit of 
work (as I said we would need to keep track of pages to be freed before 
the TLB flush).

At least to me, it feels like switching to TLBI range (or a series os 
IPAS2E1IS) is an easier win. But if you feel like doing the larger 
rework, I would be happy to have a look to check whether it would be an 
acceptable change for upstream.

> 
>> We could use a series of TLBI IPAS2E1IS which I think is what TLBI range is meant to replace (so long the addresses are contiguous in the given space).
> Isn't IPAS2E1IS a range tlbi instruction? My understanding is that this instruction is available on processors with range TLBI support, I could be wrong. I saw its KVM emulation which does full invalidation if range TLBI is not supported (https://github.com/torvalds/linux/blob/master/arch/arm64/kvm/hyp/pgtable.c#L647).

IPAS2E1IS only allows you to invalidate one address at the time and is 
available on all processors. The R version is only available when the 
processor support TLBI range and allow you to invalidate multiple 
contiguous address.

Cheers,

-- 
Julien Grall

next prev parent reply	other threads:[~2025-11-03 14:30 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-30  6:12 Limitations for Running Xen on KVM Arm64 haseeb.ashraf
2025-10-30 13:41 ` haseeb.ashraf
2025-10-30 18:33   ` Mohamed Mediouni
2025-10-30 23:55     ` Julien Grall
2025-10-31  0:20       ` Mohamed Mediouni
2025-10-31  0:38         ` Mohamed Mediouni
2025-10-31  9:18         ` Julien Grall
2025-10-31 11:54           ` Mohamed Mediouni
2025-11-01 17:20             ` Julien Grall
2025-10-31 13:01           ` haseeb.ashraf
2025-11-01 18:23             ` Julien Grall
2025-11-03 13:09               ` haseeb.ashraf
2025-11-03 14:30                 ` Julien Grall [this message]
2025-11-04  7:50                   ` haseeb.ashraf
2025-11-05 13:39                     ` haseeb.ashraf
2025-11-05 17:44                       ` Julien Grall
2025-10-31 15:17 ` Mohamed Mediouni
2025-11-01  2:04 ` Demi Marie Obenour

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0fd2b8e4-bdea-4d01-a2dd-8d2e4b37090d@xen.org \
    --to=julien@xen.org \
    --cc=Bertrand.Marquis@arm.com \
    --cc=Volodymyr_Babchuk@epam.com \
    --cc=andrew.bachtel@siemens.com \
    --cc=brian.sheppard@siemens.com \
    --cc=dan.driscoll@siemens.com \
    --cc=fahad.arslan@siemens.com \
    --cc=haseeb.ashraf@siemens.com \
    --cc=michal.orzel@amd.com \
    --cc=mohamed@unpredictable.fr \
    --cc=noor.ahsan@siemens.com \
    --cc=sstabellini@kernel.org \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).