linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
From: Marc Zyngier <maz@kernel.org>
To: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org,
	kvm@vger.kernel.org, James Morse <james.morse@arm.com>,
	Suzuki K Poulose <suzuki.poulose@arm.com>,
	Oliver Upton <oliver.upton@linux.dev>,
	Zenghui Yu <yuzenghui@huawei.com>,
	Joey Gouly <joey.gouly@arm.com>,
	Anshuman Khandual <anshuman.khandual@arm.com>,
	Przemyslaw Gaj <pgaj@cadence.com>
Subject: Re: [PATCH v2 13/17] KVM: arm64: nv: Add SW walker for AT S1 emulation
Date: Sat, 10 Aug 2024 11:16:15 +0100	[thread overview]
Message-ID: <867cco1y4w.wl-maz@kernel.org> (raw)
In-Reply-To: <ZrYO9SK52rHhGvEd@arm.com>

Hi Alex,

On Fri, 09 Aug 2024 13:43:33 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> Finally managed to go through the patch. Bunch of nitpicks from me (can be
> safely ignored), and some corner cases where KVM deviates from the spec.

Thanks for taking the time to go through this mess.

[...]

> > +		break;
> > +	default:
> > +		return (vcpu_el2_e2h_is_set(vcpu) &&
> > +			vcpu_el2_tge_is_set(vcpu)) ? TR_EL20 : TR_EL10;
> 
> This also looks correct to me. Following the pseudocode was not trivial, so I'm
> leaving this here in case I made a mistake somewhere.
> 
> For the S1E0* variants: AArch64.AT(el_in=EL0) => TranslationRegime(el=EL0) =>
> ELIsInHost(el=EL0); ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}, and in this
> case TranslationRegime(el=EL0) returns Regime_EL20, otherwise Regime_EL10.
> 
> For the S1E1* variants: AArch64.AT(el_in=EL1), where:
> 
> - if ELIsInHost(el=EL0) is true, then 'el' is changed to EL2, where
>   ELIsInHost(el=EL0) is true if {E2H,TGE} = {1,1}. In this case,
>   TranslationRegime(el=EL2) will return Regime_EL20.
> 
> - if ELIsInHost(el=EL0) is false, then 'el' remains EL1, and
>   TranslationRegime(el=EL1) returns Regime_EL10.

Yup. Somehow, the C version is far easier to understand!

> 
> > +	}
> > +}
> > +
> > +static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
> > +			 struct s1_walk_result *wr, u64 va)
> > +{
> > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > +	unsigned int stride, x;
> > +	bool va55, tbi, lva, as_el0;
> > +
> > +	wi->regime = compute_translation_regime(vcpu, op);
> > +	as_el0 = (op == OP_AT_S1E0R || op == OP_AT_S1E0W);
> > +
> > +	va55 = va & BIT(55);
> > +
> > +	if (wi->regime == TR_EL2 && va55)
> > +		goto addrsz;
> > +
> > +	wi->s2 = wi->regime == TR_EL10 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> 
> According to AArch64.NSS2TTWParams(), stage 2 is enabled if HCR_EL2.VM or
> HCR_EL2.DC.
> 
> From the description of HCR_EL2.DC, when the bit is set:
> 
> 'The PE behaves as if the value of the HCR_EL2.VM field is 1 for all purposes
> other than returning the value of a direct read of HCR_EL2.'

Yup, good point. And as you noticed further down, the HCR_EL2.DC
handling is currently a mess.

[...]

> > +	/* Let's put the MMU disabled case aside immediately */
> > +	if (!(sctlr & SCTLR_ELx_M) ||
> > +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> 
> I think the condition doesn't match AArch64.S1Enabled():
> 
> - if regime is Regime_EL20 or Regime_EL2, then S1 is disabled if and only if
>   SCTLR_EL2.M is not set; it doesn't matter if HCR_EL2.DC is set, because "[..]
>   this field has no effect on the EL2, EL2&0, and EL3 translation regimes."
>   (HCR_EL2.DC bit field description).
> 
> - if regime is Regime_EL10, then S1 is disabled if SCTLR_EL1.M == 0 or
>   HCR_EL2.TGE = 0 or the effective value of HCR_EL2.DC* is 0.
> 
> From the description of HCR_EL2.TGE, when the bit is set:
> 
> 'If EL1 is using AArch64, the SCTLR_EL1.M field is treated as being 0 for all
> purposes other than returning the result of a direct read of SCTLR_EL1.'
> 
> From the description of HCR_EL2.DC, when the bit is set:
> 
> 'When EL1 is using AArch64, the PE behaves as if the value of the SCTLR_EL1.M
> field is 0 for all purposes other than returning the value of a direct read of
> SCTLR_EL1.'
> 
> *The description of HCR_EL2.DC states:
> 
> 'When the Effective value of HCR_EL2.{E2H,TGE} is {1,1}, this field behaves as
> 0 for all purposes other than a direct read of the value of this field.'
> 
> But if {E2H,TGE} = {1,1} then the regime is Regime_EL20, which ignores the DC
> bit.  If we're looking at the DC bit, then that means that the regime is EL10,
> ({E2H,TGE} != {1,1} in compute_translation_regime()) and the effective value of
> HCR_EL2.DC is the same as the DC bit.

Yup. That's what I've stashed on top:

@@ -136,12 +137,22 @@ static int setup_s1_walk(struct kvm_vcpu *vcpu, u32 op, struct s1_walk_info *wi,
 	va = (u64)sign_extend64(va, 55);
 
 	/* Let's put the MMU disabled case aside immediately */
-	if (!(sctlr & SCTLR_ELx_M) ||
-	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
+	switch (wi->regime) {
+	case TR_EL10:
+		if (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)
+			wr->level = S1_MMU_DISABLED;
+		fallthrough;
+	case TR_EL2:
+	case TR_EL20:
+		if (!(sctlr & SCTLR_ELx_M))
+			wr->level = S1_MMU_DISABLED;
+		break;
+	}
+
+	if (wr->level == S1_MMU_DISABLED) {
 		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
 			goto addrsz;
 
-		wr->level = S1_MMU_DISABLED;
 		wr->pa = va;
 		return 0;
 	}

[...]

> > +	/* R_PLCGL, R_YXNYW */
> > +	if ((!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47) &&
> > +	     wi->txsz > 39) ||
> > +	    (wi->pgshift == 16 && wi->txsz > 47) || wi->txsz > 48)
> > +		goto transfault_l0;
> 
> It's probably just me, but I find this hard to parse. There are three cases when
> the condition (!FEAT_TTST && TxSZ > 39) evaluates to false. But the other two
> conditions make sense to check only if !FEAT_TTST is false and wi->txsz > 39 is
> true.
> 
> I find this easier to read:
> 
> 	if (!kvm_has_feat_enum(vcpu->kvm, ID_AA64MMFR2_EL1, ST, 48_47)) {
> 		if (wi->txsz > 39)
> 			goto transfault_l0;
> 	} else {
> 		if (wi->txsz > 48 || (wi->pgshift == 16 && wi->txsz > 47))
> 			goto transfault_l0;
> 	}
> 
> What do you think?

Sure, happy to rewrite it like this if it helps.

[...]

> > +	/* R_YYVYV, I_THCZK */
> > +	if ((!va55 && va > GENMASK(ia_bits - 1, 0)) ||
> > +	    (va55 && va < GENMASK(63, ia_bits)))
> > +		goto transfault_l0;
> > +
> > +	/* R_ZFSYQ */
> 
> This is rather pedantic, but I think that should be I_ZFSYQ.

Well, these references are supposed to help. If they are incorrect,
they serve no purpose. Thanks for spotting this.

[...]

> > +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
> > +			goto transfault;
> 
> The check for block mapping at level 3 is replicated below, when the level of
> the block is checked for correctnes.
> 
> Also, would you consider rewriting the valid descriptor check to
> (desc & BIT(0)), to match the block level check?
> 
> > +
> > +		/* We found a leaf, handle that */
> > +		if ((desc & 3) == 1 || level == 3)
> > +			break;
> 
> Here we know that (desc & 1), do you think rewriting the check to !(desc &
> BIT(1)), again matching the block level check, would be a good idea?

Other that the BIT() stuff, I'm not completely clear what you are
asking for. Are you objecting to the fact that the code is slightly
redundant? If so, I may be able to clean things up for you:

@@ -309,13 +323,19 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 		else
 			desc = le64_to_cpu((__force __le64)desc);
 
-		if (!(desc & 1) || ((desc & 3) == 1 && level == 3))
+		/* Invalid descriptor */
+		if (!(desc & BIT(0)))
 			goto transfault;
 
-		/* We found a leaf, handle that */
-		if ((desc & 3) == 1 || level == 3)
+		/* Block mapping, check validity down the line */
+		if (!(desc & BIT(1)))
+			break;
+
+		/* Page mapping */
+		if (level == 3)
 			break;
 
+		/* Table handling */
 		if (!wi->hpd) {
 			wr->APTable  |= FIELD_GET(S1_TABLE_AP, desc);
 			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);

Do you like this one better? Each bit only gets tested once.

[...]

> > +
> > +	if (check_output_size(desc & GENMASK(47, va_bottom), wi))
> > +		goto addrsz;
> > +
> > +	wr->failed = false;
> > +	wr->level = level;
> > +	wr->desc = desc;
> > +	wr->pa = desc & GENMASK(47, va_bottom);
> > +	if (va_bottom > 12)
> > +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> 
> I'm looking at StageOA(), and I don't think this matches what happens when the
> contiguous bit is set and the contiguous OA isn't aligned as per Table D8-98.
> Yes, I know, that's something super niche and unlikely to happen in practice.
> 
> Let's assume 4K pages, level = 3 (so va_bottom = 12), the first page in the
> contiguous OA range is 0x23_000 (so not aligned to 64K), and the input address
> that maps to the first page from the contiguous OA range is 0xf0_000 (aligned to
> 64K).
> 
> According to the present code:
> 
> wr->pa = 0x23_000
> 
> According to StageOA():
> 
> tsize  = 12
> csize  = 4
> oa     = 0x23_000 & GENMASK_ULL(47, 16) | 0xf0_000 & GENMASK_ULL(15, 0) = 0x20_000
> wr->pa = oa & ~GENMASK_ULL(11, 0) = 0x20_000
> 
> If the stage 1 is correctly programmed and the OA aligned to the required
> alignment, the two outputs match

Huh, that's another nice catch. I had the (dumb) idea that if we
didn't use the Contiguous bit as a TLB hint, we didn't need to do
anything special when it came to the alignment of the OA.

But clearly, the spec says that this alignment must be honoured, and
there is no way out of it. Which means that the S2 walker also has a
bit of a problem on that front.

> On a different topic, but still regarding wr->pa. I guess the code aligns wr->pa
> to 4K because that's how the OA in PAR_EL1 is reported.
> 
> Would it make more sense to have wr->pa represent the real output address, i.e,
> also contain the 12 least significant bits of the input address?  It wouldn't
> change how PAR_EL1 is computed (bits 11:0 are already masked out), but it would
> make wr->pa consistent with what the output address of a given input address
> should be (according to StageOA()).

Yup. That'd be consistent with the way wr->pa is reported when S1 is disabled.

I ended-up with this (untested):

@@ -354,12 +374,24 @@ static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
 	if (check_output_size(desc & GENMASK(47, va_bottom), wi))
 		goto addrsz;
 
+	if (desc & PTE_CONT) {
+		switch (BIT(wi->pgshift)) {
+		case SZ_4K:
+			va_bottom += 4;
+			break;
+		case SZ_16K:
+			va_bottom += level == 2 ? 5 : 7;
+			break;
+		case SZ_64K:
+			va_bottom += 5;
+		}
+	}
+
 	wr->failed = false;
 	wr->level = level;
 	wr->desc = desc;
 	wr->pa = desc & GENMASK(47, va_bottom);
-	if (va_bottom > 12)
-		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
+	wr->pa |= va & GENMASK_ULL(va_bottom - 1, 0);
 
 	return 0;
 

Note that I'm still checking for the address size before the
contiguous bit alignment, as per R_JHQPP.

[...]

> > +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> 
> s/0b10/ATTR_OSH ?
> 
> > +		} else {
> > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> > +					  MEMATTR(WbRaWa, WbRaWa));
> > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> 
> s/0b00/ATTR_NSH ?
> 
> > +		}
> 
> HCR_EL2.DC applies only to Regime_EL10, I think the bit should be ignored for
> the EL2 and EL20 regimes.

Yup, now fixed.

> 
> > +	} else {
> > +		u64 mair, sctlr;
> > +		u8 sh;
> > +
> > +		mair = (regime == TR_EL10 ?
> > +			vcpu_read_sys_reg(vcpu, MAIR_EL1) :
> > +			vcpu_read_sys_reg(vcpu, MAIR_EL2));
> > +
> > +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> > +		mair &= 0xff;
> > +
> > +		sctlr = (regime == TR_EL10 ?
> > +			 vcpu_read_sys_reg(vcpu, SCTLR_EL1) :
> > +			 vcpu_read_sys_reg(vcpu, SCTLR_EL2));
> > +
> > +		/* Force NC for memory if SCTLR_ELx.C is clear */
> > +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> > +			mair = MEMATTR(NC, NC);
> > +
> > +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> > +		par |= wr->pa & GENMASK_ULL(47, 12);
> > +
> > +		sh = compute_sh(mair, wr->desc);
> > +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> > +	}
> 
> When PAR_EL1.F = 0 and !FEAT_RME, bit 11 (NSE) is RES1, according to the
> description of the register and EncodePAR().

Fixed.

[...]

> >  void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> >  {
> >  	u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
> >  
> > +	/*
> > +	 * If we see a permission fault at S1, then the fast path
> > +	 * succeedded. By failing. Otherwise, take a walk on the wild
> > +	 * side.
> 
> This is rather ambiguous. Maybe something along the lines would be more helpful:
> 
> 'If PAR_EL1 encodes a permission fault, we know for sure that the hardware
> translation table walker was able to walk the stage 1 tables and there's nothing
> else that KVM needs to do. On the other hand, if the AT instruction failed for
> any other reason, then KVM must walk the guest stage 1 tables (and possibly the
> virtual stage 2 tables) to emulate the instruction.'

Sure. I've adopted a slightly less verbose version of this:

@@ -930,9 +966,12 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	u64 par = __kvm_at_s1e01_fast(vcpu, op, vaddr);
 
 	/*
-	 * If we see a permission fault at S1, then the fast path
-	 * succeedded. By failing. Otherwise, take a walk on the wild
-	 * side.
+	 * If PAR_EL1 reports that AT failed on a S1 permission fault, we
+	 * know for sure that the PTW was able to walk the S1 tables and
+	 * there's nothing else to do.
+	 *
+	 * If AT failed for any other reason, then we must walk the guest S1
+	 * to emulate the instruction.
 	 */
 	if ((par & SYS_PAR_EL1_F) && !par_check_s1_perm_fault(par))
 		par = handle_at_slow(vcpu, op, vaddr);


I'll retest things over the weekend and post a new version early next
week.

Thanks again for your review, this is awesome work!

	M.

-- 
Without deviation from the norm, progress is not possible.


  reply	other threads:[~2024-08-10 10:17 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-31 19:40 [PATCH v2 00/17] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 01/17] arm64: Add missing APTable and TCR_ELx.HPD masks Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 02/17] arm64: Add PAR_EL1 field description Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 03/17] arm64: Add system register encoding for PSTATE.PAN Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 04/17] arm64: Add ESR_ELx_FSC_ADDRSZ_L() helper Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 05/17] KVM: arm64: Make kvm_at() take an OP_AT_* Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 06/17] KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 07/17] KVM: arm64: nv: Honor absence of FEAT_PAN2 Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 08/17] KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W} Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 09/17] KVM: arm64: nv: Add basic emulation of AT S1E1{R,W}P Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 10/17] KVM: arm64: nv: Add basic emulation of AT S1E2{R,W} Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 11/17] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W} Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 12/17] KVM: arm64: nv: Make ps_to_output_size() generally available Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 13/17] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
2024-08-09 12:43   ` Alexandru Elisei
2024-08-10 10:16     ` Marc Zyngier [this message]
2024-08-12 15:11       ` Alexandru Elisei
2024-08-12 17:58         ` Marc Zyngier
2024-08-12 18:04           ` Marc Zyngier
2024-08-13  9:17           ` Alexandru Elisei
2024-07-31 19:40 ` [PATCH v2 14/17] KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 15/17] KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3 Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 16/17] KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 Marc Zyngier
2024-07-31 19:40 ` [PATCH v2 17/17] KVM: arm64: nv: Add support for FEAT_ATS1A Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=867cco1y4w.wl-maz@kernel.org \
    --to=maz@kernel.org \
    --cc=alexandru.elisei@arm.com \
    --cc=anshuman.khandual@arm.com \
    --cc=james.morse@arm.com \
    --cc=joey.gouly@arm.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=oliver.upton@linux.dev \
    --cc=pgaj@cadence.com \
    --cc=suzuki.poulose@arm.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).