Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling
       [not found] ` <171878647493.242213.9111337124987897859.b4-ty@linux.dev>
@ 2024-11-21  8:11   ` Ganapatrao Kulkarni
  2024-11-21 16:44     ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Ganapatrao Kulkarni @ 2024-11-21  8:11 UTC (permalink / raw)
  To: Oliver Upton, kvm, Marc Zyngier, kvmarm, linux-arm-kernel
  Cc: Christoffer Dall, Suzuki K Poulose, Alexandru Elisei, James Morse,
	Joey Gouly, Zenghui Yu


Hi Marc,

On 19-06-2024 02:11 pm, Oliver Upton wrote:
> On Fri, 14 Jun 2024 15:45:36 +0100, Marc Zyngier wrote:
>> Here's the thurd version of the shadow stage-2 handling for NV
>> support on arm64.
>>
>> * From v2 [2]
>>
>>    - Simplified the S2 walker by dropping a bunch of redundant
>>      fields from the walker info structure
>>
>> [...]
> 
> Applied to kvmarm/next, thanks!
> 
> [01/16] KVM: arm64: nv: Support multiple nested Stage-2 mmu structures
>          https://git.kernel.org/kvmarm/kvmarm/c/4f128f8e1aaa
> [02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic
>          https://git.kernel.org/kvmarm/kvmarm/c/61e30b9eef7f
> [03/16] KVM: arm64: nv: Handle shadow stage 2 page faults
>          https://git.kernel.org/kvmarm/kvmarm/c/fd276e71d1e7
> [04/16] KVM: arm64: nv: Unmap/flush shadow stage 2 page tables
>          https://git.kernel.org/kvmarm/kvmarm/c/ec14c272408a
> [05/16] KVM: arm64: nv: Add Stage-1 EL2 invalidation primitives
>          https://git.kernel.org/kvmarm/kvmarm/c/82e86326ec58
> [06/16] KVM: arm64: nv: Handle EL2 Stage-1 TLB invalidation
>          https://git.kernel.org/kvmarm/kvmarm/c/67fda56e76da
> [07/16] KVM: arm64: nv: Handle TLB invalidation targeting L2 stage-1
>          https://git.kernel.org/kvmarm/kvmarm/c/8e236efa4cd2
> [08/16] KVM: arm64: nv: Handle TLBI VMALLS12E1{,IS} operations
>          https://git.kernel.org/kvmarm/kvmarm/c/e6c9a3015ff2
> [09/16] KVM: arm64: nv: Handle TLBI ALLE1{,IS} operations
>          https://git.kernel.org/kvmarm/kvmarm/c/5cfb6cec62f2
> [10/16] KVM: arm64: nv: Handle TLBI IPAS2E1{,IS} operations
>          https://git.kernel.org/kvmarm/kvmarm/c/70109bcd701e
> [11/16] KVM: arm64: nv: Handle FEAT_TTL hinted TLB operations
>          https://git.kernel.org/kvmarm/kvmarm/c/d1de1576dc21
> [12/16] KVM: arm64: nv: Tag shadow S2 entries with guest's leaf S2 level
>          https://git.kernel.org/kvmarm/kvmarm/c/b1a3a94812b9
> [13/16] KVM: arm64: nv: Invalidate TLBs based on shadow S2 TTL-like information
>          https://git.kernel.org/kvmarm/kvmarm/c/809b2e6013a5
> [14/16] KVM: arm64: nv: Add handling of outer-shareable TLBI operations
>          https://git.kernel.org/kvmarm/kvmarm/c/0cb8aae22676
> [15/16] KVM: arm64: nv: Add handling of range-based TLBI operations
>          https://git.kernel.org/kvmarm/kvmarm/c/5d476ca57d7d
> [16/16] KVM: arm64: nv: Add handling of NXS-flavoured TLBI operations
>          https://git.kernel.org/kvmarm/kvmarm/c/0feec7769a63
> 
> --
> Best,
> Oliver

IIRC, Most of the patches that are specific to NV have been merged 
upstream. However I do see that, some of the vGIC and Timer related 
patches are still in your private NV repository. Can these patches be 
prioritized to upstream, so that we can have have the first working 
version of NV on mainline.

-- 
Thanks,
Ganapat/GK


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling
  2024-11-21  8:11   ` [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling Ganapatrao Kulkarni
@ 2024-11-21 16:44     ` Marc Zyngier
  2024-11-22 16:54       ` Ganapatrao Kulkarni
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-11-21 16:44 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Oliver Upton, kvm, kvmarm, linux-arm-kernel, Christoffer Dall,
	Suzuki K Poulose, Alexandru Elisei, James Morse, Joey Gouly,
	Zenghui Yu

On Thu, 21 Nov 2024 08:11:00 +0000,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:

Hi Ganapatrao,

> IIRC, Most of the patches that are specific to NV have been merged
> upstream. However I do see that, some of the vGIC and Timer related
> patches are still in your private NV repository. Can these patches be
> prioritized to upstream, so that we can have have the first working
> version of NV on mainline.

Who is *we*?

Things get upstreamed when we (people doing the actual work) decide
they are ready. At the moment, they are not.

Also, while I enjoy working on NV, this isn't *my* priority.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling
  2024-11-21 16:44     ` Marc Zyngier
@ 2024-11-22 16:54       ` Ganapatrao Kulkarni
  2024-11-22 19:04         ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Ganapatrao Kulkarni @ 2024-11-22 16:54 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Oliver Upton, kvm, kvmarm, linux-arm-kernel, Christoffer Dall,
	Suzuki K Poulose, Alexandru Elisei, James Morse, Joey Gouly,
	Zenghui Yu



On 21-11-2024 10:14 pm, Marc Zyngier wrote:
> On Thu, 21 Nov 2024 08:11:00 +0000,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> 
> Hi Ganapatrao,
> 
>> IIRC, Most of the patches that are specific to NV have been merged
>> upstream. However I do see that, some of the vGIC and Timer related
>> patches are still in your private NV repository. Can these patches be
>> prioritized to upstream, so that we can have have the first working
>> version of NV on mainline.
> 
> Who is *we*?
> 
> Things get upstreamed when we (people doing the actual work) decide
> they are ready. At the moment, they are not.
> 
> Also, while I enjoy working on NV, this isn't *my* priority.

Sure, I understand that it's not your priority right now.  I'm happy to 
spend some time on it, please do let us know in what areas/patches needs 
attention before the code would be ready to merge?

-- 
Thanks,
Ganapat/GK


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling
  2024-11-22 16:54       ` Ganapatrao Kulkarni
@ 2024-11-22 19:04         ` Marc Zyngier
  2024-11-23  9:49           ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-11-22 19:04 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Oliver Upton, kvm, kvmarm, linux-arm-kernel, Christoffer Dall,
	Suzuki K Poulose, Alexandru Elisei, James Morse, Joey Gouly,
	Zenghui Yu

On Fri, 22 Nov 2024 16:54:16 +0000,
Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> 
> 
> 
> On 21-11-2024 10:14 pm, Marc Zyngier wrote:
> > On Thu, 21 Nov 2024 08:11:00 +0000,
> > Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> > 
> > Hi Ganapatrao,
> > 
> >> IIRC, Most of the patches that are specific to NV have been merged
> >> upstream. However I do see that, some of the vGIC and Timer related
> >> patches are still in your private NV repository. Can these patches be
> >> prioritized to upstream, so that we can have have the first working
> >> version of NV on mainline.
> > 
> > Who is *we*?
> > 
> > Things get upstreamed when we (people doing the actual work) decide
> > they are ready. At the moment, they are not.
> > 
> > Also, while I enjoy working on NV, this isn't *my* priority.
> 
> Sure, I understand that it's not your priority right now.  I'm happy
> to spend some time on it, please do let us know in what areas/patches
> needs attention before the code would be ready to merge?

Please understand that NV isn't special, and while there is still a
bunch of things that need to be merged, it is the whole of KVM/arm64
that needs attention.

For example, there is the debug series from Oliver, the feature
handling from Fuad. They may not have NV written all over them, but
they do have an impact on the NV behaviour one way or another.

By paying attention to these series, you would help with the
groundwork that is required before we can actually enable NV. This is
what matters now, not the next 50 or so NV-specific patches.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling
  2024-11-22 19:04         ` Marc Zyngier
@ 2024-11-23  9:49           ` Marc Zyngier
  2024-12-05 11:50             ` Darren Hart
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-11-23  9:49 UTC (permalink / raw)
  To: Ganapatrao Kulkarni
  Cc: Oliver Upton, kvm, kvmarm, linux-arm-kernel, Christoffer Dall,
	Suzuki K Poulose, Alexandru Elisei, James Morse, Joey Gouly,
	Zenghui Yu

On Fri, 22 Nov 2024 19:04:47 +0000,
Marc Zyngier <maz@kernel.org> wrote:
> 
> On Fri, 22 Nov 2024 16:54:16 +0000,
> Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> > 
> > 
> > 
> > On 21-11-2024 10:14 pm, Marc Zyngier wrote:
> > > On Thu, 21 Nov 2024 08:11:00 +0000,
> > > Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> > > 
> > > Hi Ganapatrao,
> > > 
> > >> IIRC, Most of the patches that are specific to NV have been merged
> > >> upstream. However I do see that, some of the vGIC and Timer related
> > >> patches are still in your private NV repository. Can these patches be
> > >> prioritized to upstream, so that we can have have the first working
> > >> version of NV on mainline.
> > > 
> > > Who is *we*?
> > > 
> > > Things get upstreamed when we (people doing the actual work) decide
> > > they are ready. At the moment, they are not.
> > > 
> > > Also, while I enjoy working on NV, this isn't *my* priority.
> > 
> > Sure, I understand that it's not your priority right now.  I'm happy
> > to spend some time on it, please do let us know in what areas/patches
> > needs attention before the code would be ready to merge?
> 
> Please understand that NV isn't special, and while there is still a
> bunch of things that need to be merged, it is the whole of KVM/arm64
> that needs attention.
> 
> For example, there is the debug series from Oliver, the feature
> handling from Fuad. They may not have NV written all over them, but
> they do have an impact on the NV behaviour one way or another.
> 
> By paying attention to these series, you would help with the
> groundwork that is required before we can actually enable NV. This is
> what matters now, not the next 50 or so NV-specific patches.

And one last thing, while I think of it: NV is very unlikely to get
merged without a testing infrastructure. Which means that the current
selftests must run at EL2, and that *new* selftests must be created to
test the NV implementation (all the trap behaviours, for example).

So if you (and I assume your employer, as you keep using the plural)
want to help moving NV support upstream, this is an area where you
could help and make a massive difference.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling
  2024-11-23  9:49           ` Marc Zyngier
@ 2024-12-05 11:50             ` Darren Hart
  0 siblings, 0 replies; 9+ messages in thread
From: Darren Hart @ 2024-12-05 11:50 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ganapatrao Kulkarni, Oliver Upton, kvm, kvmarm, linux-arm-kernel,
	Christoffer Dall, Suzuki K Poulose, Alexandru Elisei, James Morse,
	Joey Gouly, Zenghui Yu

On Sat, Nov 23, 2024 at 09:49:08AM +0000, Marc Zyngier wrote:
> On Fri, 22 Nov 2024 19:04:47 +0000,
> Marc Zyngier <maz@kernel.org> wrote:
> > 
> > On Fri, 22 Nov 2024 16:54:16 +0000,
> > Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> > > 
> > > 
> > > 
> > > On 21-11-2024 10:14 pm, Marc Zyngier wrote:
> > > > On Thu, 21 Nov 2024 08:11:00 +0000,
> > > > Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> wrote:
> > > > 
> > > > Hi Ganapatrao,
> > > > 
> > > >> IIRC, Most of the patches that are specific to NV have been merged
> > > >> upstream. However I do see that, some of the vGIC and Timer related
> > > >> patches are still in your private NV repository. Can these patches be
> > > >> prioritized to upstream, so that we can have have the first working
> > > >> version of NV on mainline.
> > > > 
> > > > Who is *we*?
> > > > 
> > > > Things get upstreamed when we (people doing the actual work) decide
> > > > they are ready. At the moment, they are not.
> > > > 
> > > > Also, while I enjoy working on NV, this isn't *my* priority.
> > > 
> > > Sure, I understand that it's not your priority right now.  I'm happy
> > > to spend some time on it, please do let us know in what areas/patches
> > > needs attention before the code would be ready to merge?
> > 
> > Please understand that NV isn't special, and while there is still a
> > bunch of things that need to be merged, it is the whole of KVM/arm64
> > that needs attention.
> > 
> > For example, there is the debug series from Oliver, the feature
> > handling from Fuad. They may not have NV written all over them, but
> > they do have an impact on the NV behaviour one way or another.
> > 
> > By paying attention to these series, you would help with the
> > groundwork that is required before we can actually enable NV. This is
> > what matters now, not the next 50 or so NV-specific patches.
> 
> And one last thing, while I think of it: NV is very unlikely to get
> merged without a testing infrastructure. Which means that the current
> selftests must run at EL2, and that *new* selftests must be created to
> test the NV implementation (all the trap behaviours, for example).
> 
> So if you (and I assume your employer, as you keep using the plural)
> want to help moving NV support upstream, this is an area where you
> could help and make a massive difference.

Appreciate the specific examples Marc, thank you. I'll work with my team
at Ampere in this direction - specifically the testing infrastructure
work and engaging in the NV-adjacent kvm patche series.

-- 
Darren Hart
Ampere Computing / Linux Enabling


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic
       [not found] ` <20240614144552.2773592-3-maz@kernel.org>
@ 2026-02-04  8:28   ` Zenghui Yu
  2026-02-06 11:05     ` Marc Zyngier
  0 siblings, 1 reply; 9+ messages in thread
From: Zenghui Yu @ 2026-02-04  8:28 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, kvm, linux-arm-kernel, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Ganapatrao Kulkarni, wanghaibin.wang

Hi Marc,

[ chewing through the NV code.. ;-) ]

On 6/14/24 10:45 PM, Marc Zyngier wrote:
> From: Christoffer Dall <christoffer.dall@linaro.org>
> 
> Based on the pseudo-code in the ARM ARM, implement a stage 2 software
> page table walker.

[...]

> +static u32 compute_fsc(int level, u32 fsc)
> +{
> +	return fsc | (level & 0x3);
> +}
> +
> +static int get_ia_size(struct s2_walk_info *wi)
> +{
> +	return 64 - wi->t0sz;
> +}
> +
> +static int check_base_s2_limits(struct s2_walk_info *wi,
> +				int level, int input_size, int stride)
> +{
> +	int start_size, ia_size;
> +
> +	ia_size = get_ia_size(wi);
> +
> +	/* Check translation limits */
> +	switch (BIT(wi->pgshift)) {
> +	case SZ_64K:
> +		if (level == 0 || (level == 1 && ia_size <= 42))

It looks broken as the pseudocode checks the limits based on
*implemented PA size*, rather than on ia_size, which is essentially the
input address size (64 - T0SZ) programmed by L1 hypervisor. They're
different.

We can probably get the implemented PA size by:

AArch64.PAMax()
{
	parange = get_idreg_field_enum(kvm, ID_AA64MMFR0_EL1, PARANGE);
	return id_aa64mmfr0_parange_to_phys_shift(parange);
}

Not sure if I've read the spec correctly.

> +			return -EFAULT;
> +		break;
> +	case SZ_16K:
> +		if (level == 0 || (level == 1 && ia_size <= 40))
> +			return -EFAULT;
> +		break;
> +	case SZ_4K:
> +		if (level < 0 || (level == 0 && ia_size <= 42))
> +			return -EFAULT;
> +		break;
> +	}
> +
> +	/* Check input size limits */
> +	if (input_size > ia_size)

This is always false for the current code. ;-)

> +		return -EFAULT;
> +
> +	/* Check number of entries in starting level table */
> +	start_size = input_size - ((3 - level) * stride + wi->pgshift);
> +	if (start_size < 1 || start_size > stride + 4)
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
> +/* Check if output is within boundaries */
> +static int check_output_size(struct s2_walk_info *wi, phys_addr_t output)
> +{
> +	unsigned int output_size = wi->max_oa_bits;
> +
> +	if (output_size != 48 && (output & GENMASK_ULL(47, output_size)))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +/*
> + * This is essentially a C-version of the pseudo code from the ARM ARM
> + * AArch64.TranslationTableWalk  function.  I strongly recommend looking at
> + * that pseudocode in trying to understand this.
> + *
> + * Must be called with the kvm->srcu read lock held
> + */
> +static int walk_nested_s2_pgd(phys_addr_t ipa,
> +			      struct s2_walk_info *wi, struct kvm_s2_trans *out)
> +{
> +	int first_block_level, level, stride, input_size, base_lower_bound;
> +	phys_addr_t base_addr;
> +	unsigned int addr_top, addr_bottom;
> +	u64 desc;  /* page table entry */
> +	int ret;
> +	phys_addr_t paddr;
> +
> +	switch (BIT(wi->pgshift)) {
> +	default:
> +	case SZ_64K:
> +	case SZ_16K:
> +		level = 3 - wi->sl;
> +		first_block_level = 2;
> +		break;
> +	case SZ_4K:
> +		level = 2 - wi->sl;
> +		first_block_level = 1;
> +		break;
> +	}
> +
> +	stride = wi->pgshift - 3;
> +	input_size = get_ia_size(wi);
> +	if (input_size > 48 || input_size < 25)
> +		return -EFAULT;
> +
> +	ret = check_base_s2_limits(wi, level, input_size, stride);
> +	if (WARN_ON(ret))
> +		return ret;
> +
> +	base_lower_bound = 3 + input_size - ((3 - level) * stride +
> +			   wi->pgshift);
> +	base_addr = wi->baddr & GENMASK_ULL(47, base_lower_bound);
> +
> +	if (check_output_size(wi, base_addr)) {
> +		out->esr = compute_fsc(level, ESR_ELx_FSC_ADDRSZ);

With a wrongly programmed base address, we should report the ADDRSZ
fault at level 0 (as per R_BFHQH and the pseudocode). It's easy to fix.

> +static void vtcr_to_walk_info(u64 vtcr, struct s2_walk_info *wi)
> +{
> +	wi->t0sz = vtcr & TCR_EL2_T0SZ_MASK;
> +
> +	switch (vtcr & VTCR_EL2_TG0_MASK) {
> +	case VTCR_EL2_TG0_4K:
> +		wi->pgshift = 12;	 break;
> +	case VTCR_EL2_TG0_16K:
> +		wi->pgshift = 14;	 break;
> +	case VTCR_EL2_TG0_64K:
> +	default:	    /* IMPDEF: treat any other value as 64k */
> +		wi->pgshift = 16;	 break;
> +	}
> +
> +	wi->sl = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
> +	/* Global limit for now, should eventually be per-VM */
> +	wi->max_oa_bits = min(get_kvm_ipa_limit(),
                              ^^^

Should we use AArch64.PAMax() instead? As the output address size is
never larger than the implemented PA size.

Now I'm wondering if we can let kvm_get_pa_bits() just return PAMax for
(based on the exposed (to-L1) AA64MFR0.PARange value) and use it when
possible.

Thanks,
Zenghui


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic
  2026-02-04  8:28   ` [PATCH v3 02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic Zenghui Yu
@ 2026-02-06 11:05     ` Marc Zyngier
  2026-02-08 18:34       ` Zenghui Yu
  0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2026-02-06 11:05 UTC (permalink / raw)
  To: Zenghui Yu
  Cc: kvmarm, kvm, linux-arm-kernel, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Ganapatrao Kulkarni, wanghaibin.wang

On Wed, 04 Feb 2026 08:28:57 +0000,
Zenghui Yu <zenghui.yu@linux.dev> wrote:
> 
> Hi Marc,
> 
> [ chewing through the NV code.. ;-) ]

You fool! :)

> 
> On 6/14/24 10:45 PM, Marc Zyngier wrote:
> > From: Christoffer Dall <christoffer.dall@linaro.org>
> > 
> > Based on the pseudo-code in the ARM ARM, implement a stage 2 software
> > page table walker.
> 
> [...]
> 
> > +static u32 compute_fsc(int level, u32 fsc)
> > +{
> > +	return fsc | (level & 0x3);
> > +}
> > +
> > +static int get_ia_size(struct s2_walk_info *wi)
> > +{
> > +	return 64 - wi->t0sz;
> > +}
> > +
> > +static int check_base_s2_limits(struct s2_walk_info *wi,
> > +				int level, int input_size, int stride)
> > +{
> > +	int start_size, ia_size;
> > +
> > +	ia_size = get_ia_size(wi);
> > +
> > +	/* Check translation limits */
> > +	switch (BIT(wi->pgshift)) {
> > +	case SZ_64K:
> > +		if (level == 0 || (level == 1 && ia_size <= 42))
> 
> It looks broken as the pseudocode checks the limits based on
> *implemented PA size*, rather than on ia_size, which is essentially the
> input address size (64 - T0SZ) programmed by L1 hypervisor. They're
> different.
>
> We can probably get the implemented PA size by:
> 
> AArch64.PAMax()
> {
> 	parange = get_idreg_field_enum(kvm, ID_AA64MMFR0_EL1, PARANGE);
> 	return id_aa64mmfr0_parange_to_phys_shift(parange);
> }
> 
> Not sure if I've read the spec correctly.

I think that's also the way I read AArch64_S2InvalidSL(), which more
or less mirrors the above.

The question is what should we limit it to? Is it PARange, as you
suggest? Or the IPA range defined by userspace at VM creation (the
type argument, which ends up in kvm->arch.mmu.pgt->ia_bits)?

I think this is the former, but we probably also need to handle the
later on actual access (when reading the descriptor). Failure to read
the descriptor (because it is outside of a memslot) should result in a
SEA being injected in the guest.

> 
> > +			return -EFAULT;
> > +		break;
> > +	case SZ_16K:
> > +		if (level == 0 || (level == 1 && ia_size <= 40))
> > +			return -EFAULT;
> > +		break;
> > +	case SZ_4K:
> > +		if (level < 0 || (level == 0 && ia_size <= 42))
> > +			return -EFAULT;
> > +		break;
> > +	}
> > +
> > +	/* Check input size limits */
> > +	if (input_size > ia_size)
> 
> This is always false for the current code. ;-)

Yup. At least that doesn't introduce any extra bug! :)

>
> > +		return -EFAULT;
> > +
> > +	/* Check number of entries in starting level table */
> > +	start_size = input_size - ((3 - level) * stride + wi->pgshift);
> > +	if (start_size < 1 || start_size > stride + 4)
> > +		return -EFAULT;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Check if output is within boundaries */
> > +static int check_output_size(struct s2_walk_info *wi, phys_addr_t output)
> > +{
> > +	unsigned int output_size = wi->max_oa_bits;
> > +
> > +	if (output_size != 48 && (output & GENMASK_ULL(47, output_size)))
> > +		return -1;
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * This is essentially a C-version of the pseudo code from the ARM ARM
> > + * AArch64.TranslationTableWalk  function.  I strongly recommend looking at
> > + * that pseudocode in trying to understand this.
> > + *
> > + * Must be called with the kvm->srcu read lock held
> > + */
> > +static int walk_nested_s2_pgd(phys_addr_t ipa,
> > +			      struct s2_walk_info *wi, struct kvm_s2_trans *out)
> > +{
> > +	int first_block_level, level, stride, input_size, base_lower_bound;
> > +	phys_addr_t base_addr;
> > +	unsigned int addr_top, addr_bottom;
> > +	u64 desc;  /* page table entry */
> > +	int ret;
> > +	phys_addr_t paddr;
> > +
> > +	switch (BIT(wi->pgshift)) {
> > +	default:
> > +	case SZ_64K:
> > +	case SZ_16K:
> > +		level = 3 - wi->sl;
> > +		first_block_level = 2;
> > +		break;
> > +	case SZ_4K:
> > +		level = 2 - wi->sl;
> > +		first_block_level = 1;
> > +		break;
> > +	}
> > +
> > +	stride = wi->pgshift - 3;
> > +	input_size = get_ia_size(wi);
> > +	if (input_size > 48 || input_size < 25)
> > +		return -EFAULT;
> > +
> > +	ret = check_base_s2_limits(wi, level, input_size, stride);
> > +	if (WARN_ON(ret))
> > +		return ret;
> > +
> > +	base_lower_bound = 3 + input_size - ((3 - level) * stride +
> > +			   wi->pgshift);
> > +	base_addr = wi->baddr & GENMASK_ULL(47, base_lower_bound);
> > +
> > +	if (check_output_size(wi, base_addr)) {
> > +		out->esr = compute_fsc(level, ESR_ELx_FSC_ADDRSZ);
> 
> With a wrongly programmed base address, we should report the ADDRSZ
> fault at level 0 (as per R_BFHQH and the pseudocode). It's easy to fix.
>

Yup. Although this rule describe S1 rather than S2 (we don't seem to
have anything saying the same thing for S2), but I expect the
behaviour to be exactly the same.

> > +static void vtcr_to_walk_info(u64 vtcr, struct s2_walk_info *wi)
> > +{
> > +	wi->t0sz = vtcr & TCR_EL2_T0SZ_MASK;
> > +
> > +	switch (vtcr & VTCR_EL2_TG0_MASK) {
> > +	case VTCR_EL2_TG0_4K:
> > +		wi->pgshift = 12;	 break;
> > +	case VTCR_EL2_TG0_16K:
> > +		wi->pgshift = 14;	 break;
> > +	case VTCR_EL2_TG0_64K:
> > +	default:	    /* IMPDEF: treat any other value as 64k */
> > +		wi->pgshift = 16;	 break;
> > +	}
> > +
> > +	wi->sl = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
> > +	/* Global limit for now, should eventually be per-VM */
> > +	wi->max_oa_bits = min(get_kvm_ipa_limit(),
>                               ^^^
> 
> Should we use AArch64.PAMax() instead? As the output address size is
> never larger than the implemented PA size.
> 
> Now I'm wondering if we can let kvm_get_pa_bits() just return PAMax for
> (based on the exposed (to-L1) AA64MFR0.PARange value) and use it when
> possible.

Yes, that was the plan all along, but I got sidetracked.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic
  2026-02-06 11:05     ` Marc Zyngier
@ 2026-02-08 18:34       ` Zenghui Yu
  0 siblings, 0 replies; 9+ messages in thread
From: Zenghui Yu @ 2026-02-08 18:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, kvm, linux-arm-kernel, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly, Alexandru Elisei,
	Christoffer Dall, Ganapatrao Kulkarni, wanghaibin.wang

On 2/6/26 7:05 PM, Marc Zyngier wrote:
> On Wed, 04 Feb 2026 08:28:57 +0000,
> Zenghui Yu <zenghui.yu@linux.dev> wrote:
> >
> > Hi Marc,
> >
> > [ chewing through the NV code.. ;-) ]
> 
> You fool! :)
> 
> >
> > On 6/14/24 10:45 PM, Marc Zyngier wrote:
> > > From: Christoffer Dall <christoffer.dall@linaro.org>
> > >
> > > Based on the pseudo-code in the ARM ARM, implement a stage 2 software
> > > page table walker.
> >
> > [...]
> >
> > > +static u32 compute_fsc(int level, u32 fsc)
> > > +{
> > > +	return fsc | (level & 0x3);
> > > +}
> > > +
> > > +static int get_ia_size(struct s2_walk_info *wi)
> > > +{
> > > +	return 64 - wi->t0sz;
> > > +}
> > > +
> > > +static int check_base_s2_limits(struct s2_walk_info *wi,
> > > +				int level, int input_size, int stride)
> > > +{
> > > +	int start_size, ia_size;
> > > +
> > > +	ia_size = get_ia_size(wi);
> > > +
> > > +	/* Check translation limits */
> > > +	switch (BIT(wi->pgshift)) {
> > > +	case SZ_64K:
> > > +		if (level == 0 || (level == 1 && ia_size <= 42))
> >
> > It looks broken as the pseudocode checks the limits based on
> > *implemented PA size*, rather than on ia_size, which is essentially the
> > input address size (64 - T0SZ) programmed by L1 hypervisor. They're
> > different.
> >
> > We can probably get the implemented PA size by:
> >
> > AArch64.PAMax()
> > {
> > 	parange = get_idreg_field_enum(kvm, ID_AA64MMFR0_EL1, PARANGE);
> > 	return id_aa64mmfr0_parange_to_phys_shift(parange);
> > }
> >
> > Not sure if I've read the spec correctly.
> 
> I think that's also the way I read AArch64_S2InvalidSL(), which more
> or less mirrors the above.
> 
> The question is what should we limit it to? Is it PARange, as you
> suggest? Or the IPA range defined by userspace at VM creation (the
> type argument, which ends up in kvm->arch.mmu.pgt->ia_bits)?
> 
> I think this is the former, but we probably also need to handle the
> later on actual access (when reading the descriptor). Failure to read
> the descriptor (because it is outside of a memslot) should result in a
> SEA being injected in the guest.

Yup. Do you suggest something like

diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index cdeeb8f09e72..cdd900305047 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -293,8 +293,10 @@ static int walk_nested_s2_pgd(struct kvm_vcpu
*vcpu, phys_addr_t ipa,

 		paddr = base_addr | index;
 		ret = read_guest_s2_desc(vcpu, paddr, &desc, wi);
-		if (ret < 0)
+		if (ret < 0) {
+			out->esr = ESR_ELx_FSC_SEA_TTW(level);
 			return ret;
+		}

 		new_desc = desc;

The current code doesn't specify the FSC when we fail to read the
descriptor.

> >
> > > +			return -EFAULT;
> > > +		break;
> > > +	case SZ_16K:
> > > +		if (level == 0 || (level == 1 && ia_size <= 40))
> > > +			return -EFAULT;
> > > +		break;
> > > +	case SZ_4K:
> > > +		if (level < 0 || (level == 0 && ia_size <= 42))
> > > +			return -EFAULT;
> > > +		break;
> > > +	}
> > > +
> > > +	/* Check input size limits */
> > > +	if (input_size > ia_size)
> >
> > This is always false for the current code. ;-)
> 
> Yup. At least that doesn't introduce any extra bug! :)
> 
> >
> > > +		return -EFAULT;
> > > +
> > > +	/* Check number of entries in starting level table */
> > > +	start_size = input_size - ((3 - level) * stride + wi->pgshift);
> > > +	if (start_size < 1 || start_size > stride + 4)
> > > +		return -EFAULT;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Check if output is within boundaries */
> > > +static int check_output_size(struct s2_walk_info *wi, phys_addr_t output)
> > > +{
> > > +	unsigned int output_size = wi->max_oa_bits;
> > > +
> > > +	if (output_size != 48 && (output & GENMASK_ULL(47, output_size)))
> > > +		return -1;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/*
> > > + * This is essentially a C-version of the pseudo code from the ARM ARM
> > > + * AArch64.TranslationTableWalk  function.  I strongly recommend looking at
> > > + * that pseudocode in trying to understand this.
> > > + *
> > > + * Must be called with the kvm->srcu read lock held
> > > + */
> > > +static int walk_nested_s2_pgd(phys_addr_t ipa,
> > > +			      struct s2_walk_info *wi, struct kvm_s2_trans *out)
> > > +{
> > > +	int first_block_level, level, stride, input_size, base_lower_bound;
> > > +	phys_addr_t base_addr;
> > > +	unsigned int addr_top, addr_bottom;
> > > +	u64 desc;  /* page table entry */
> > > +	int ret;
> > > +	phys_addr_t paddr;
> > > +
> > > +	switch (BIT(wi->pgshift)) {
> > > +	default:
> > > +	case SZ_64K:
> > > +	case SZ_16K:
> > > +		level = 3 - wi->sl;
> > > +		first_block_level = 2;
> > > +		break;
> > > +	case SZ_4K:
> > > +		level = 2 - wi->sl;
> > > +		first_block_level = 1;
> > > +		break;
> > > +	}
> > > +
> > > +	stride = wi->pgshift - 3;
> > > +	input_size = get_ia_size(wi);
> > > +	if (input_size > 48 || input_size < 25)
> > > +		return -EFAULT;
> > > +
> > > +	ret = check_base_s2_limits(wi, level, input_size, stride);
> > > +	if (WARN_ON(ret))
> > > +		return ret;
> > > +
> > > +	base_lower_bound = 3 + input_size - ((3 - level) * stride +
> > > +			   wi->pgshift);
> > > +	base_addr = wi->baddr & GENMASK_ULL(47, base_lower_bound);
> > > +
> > > +	if (check_output_size(wi, base_addr)) {
> > > +		out->esr = compute_fsc(level, ESR_ELx_FSC_ADDRSZ);
> >
> > With a wrongly programmed base address, we should report the ADDRSZ
> > fault at level 0 (as per R_BFHQH and the pseudocode). It's easy to fix.
> >
> 
> Yup. Although this rule describe S1 rather than S2 (we don't seem to
> have anything saying the same thing for S2), but I expect the
> behaviour to be exactly the same.

The rule does describe s2. :)

R_BFHQH:

" When an Address size fault is generated, the reported fault code
  indicates one of the following:

  If the fault was generated due to the TTBR_ELx used in the translation
  having nonzero address bits above the OA size, then a fault at level
  0. "

where TTBR_ELx is the general name which also applies to VTTBR_EL2.

AArch64.S2Walk also clearly describes this behaviour (I read DDI0478G.b
J1-8122).

> > > +static void vtcr_to_walk_info(u64 vtcr, struct s2_walk_info *wi)
> > > +{
> > > +	wi->t0sz = vtcr & TCR_EL2_T0SZ_MASK;
> > > +
> > > +	switch (vtcr & VTCR_EL2_TG0_MASK) {
> > > +	case VTCR_EL2_TG0_4K:
> > > +		wi->pgshift = 12;	 break;
> > > +	case VTCR_EL2_TG0_16K:
> > > +		wi->pgshift = 14;	 break;
> > > +	case VTCR_EL2_TG0_64K:
> > > +	default:	    /* IMPDEF: treat any other value as 64k */
> > > +		wi->pgshift = 16;	 break;
> > > +	}
> > > +
> > > +	wi->sl = FIELD_GET(VTCR_EL2_SL0_MASK, vtcr);
> > > +	/* Global limit for now, should eventually be per-VM */
> > > +	wi->max_oa_bits = min(get_kvm_ipa_limit(),
> >                               ^^^
> >
> > Should we use AArch64.PAMax() instead? As the output address size is
> > never larger than the implemented PA size.
> >
> > Now I'm wondering if we can let kvm_get_pa_bits() just return PAMax for
> > (based on the exposed (to-L1) AA64MFR0.PARange value) and use it when
> > possible.
> 
> Yes, that was the plan all along, but I got sidetracked.

OK. The concern is that PARange is writable from userspace,
init_nested_s2_mmu() can not get a fixed PAMax vaule (kvm_get_pa_bits())
until AA64MMFR0_EL1 has became immutable (i.e., VM has started).

I need to think it over.

Thanks,
Zenghui


^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-08 18:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20240614144552.2773592-1-maz@kernel.org>
     [not found] ` <171878647493.242213.9111337124987897859.b4-ty@linux.dev>
2024-11-21  8:11   ` [PATCH v3 00/16] KVM: arm64: nv: Shadow stage-2 page table handling Ganapatrao Kulkarni
2024-11-21 16:44     ` Marc Zyngier
2024-11-22 16:54       ` Ganapatrao Kulkarni
2024-11-22 19:04         ` Marc Zyngier
2024-11-23  9:49           ` Marc Zyngier
2024-12-05 11:50             ` Darren Hart
     [not found] ` <20240614144552.2773592-3-maz@kernel.org>
2026-02-04  8:28   ` [PATCH v3 02/16] KVM: arm64: nv: Implement nested Stage-2 page table walk logic Zenghui Yu
2026-02-06 11:05     ` Marc Zyngier
2026-02-08 18:34       ` Zenghui Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox