linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions
@ 2024-06-25 13:34 Marc Zyngier
  2024-06-25 13:35 ` [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks Marc Zyngier
                   ` (11 more replies)
  0 siblings, 12 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:34 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Another task that a hypervisor supporting NV on arm64 has to deal with
is to emulate the AT instruction, because we multiplex all the S1
translations on a single set of registers, and the guest S2 is never
truly resident on the CPU.

So given that we lie about page tables, we also have to lie about
translation instructions, hence the emulation. Things are made
complicated by the fact that guest S1 page tables can be swapped out,
and that our shadow S2 is likely to be incomplete. So while using AT
to emulate AT is tempting (and useful), it is not going to always
work, and we thus need a fallback in the shape of a SW S1 walker.

This series is built in 4 basic blocks:

- Add missing definition and basic reworking

- Dumb emulation of all relevant AT instructions using AT instructions

- Add a SW S1 walker that is using our S2 walker

- Add FEAT_ATS1A support, which is almost trivial

This has been tested by comparing the output of a HW walker with the
output of the SW one. Obviously, this isn't bullet proof, and I'm
pretty sure there are some nasties in there.

In a departure from my usual habit, this series is on top of
kvmarm/next, as it depends on the NV S2 shadow code.

Joey Gouly (1):
  KVM: arm64: make kvm_at() take an OP_AT_*

Marc Zyngier (11):
  arm64: Add missing APTable and TCR_ELx.HPD masks
  arm64: Add PAR_EL1 field description
  KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor
  KVM: arm64: nv: Honor absence of FEAT_PAN2
  KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P]
  KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}
  KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
  KVM: arm64: nv: Make ps_to_output_size() generally available
  KVM: arm64: nv: Add SW walker for AT S1 emulation
  KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
  KVM: arm64: nv: Add support for FEAT_ATS1A

 arch/arm64/include/asm/kvm_arm.h       |    1 +
 arch/arm64/include/asm/kvm_asm.h       |    6 +-
 arch/arm64/include/asm/kvm_nested.h    |   18 +-
 arch/arm64/include/asm/pgtable-hwdef.h |    7 +
 arch/arm64/include/asm/sysreg.h        |   19 +
 arch/arm64/kvm/Makefile                |    2 +-
 arch/arm64/kvm/at.c                    | 1007 ++++++++++++++++++++++++
 arch/arm64/kvm/emulate-nested.c        |    2 +
 arch/arm64/kvm/hyp/include/hyp/fault.h |    2 +-
 arch/arm64/kvm/nested.c                |   26 +-
 arch/arm64/kvm/sys_regs.c              |   60 ++
 11 files changed, 1125 insertions(+), 25 deletions(-)
 create mode 100644 arch/arm64/kvm/at.c

-- 
2.39.2



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-07-12  8:32   ` Anshuman Khandual
  2024-06-25 13:35 ` [PATCH 02/12] arm64: Add PAR_EL1 field description Marc Zyngier
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Although Linux doesn't make use of hierarchical permissions (TFFT!),
KVM needs to know where the various bits related to this feature
live in the TCR_ELx registers as well as in the page tables.

Add the missing bits.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_arm.h       | 1 +
 arch/arm64/include/asm/pgtable-hwdef.h | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
index b2adc2c6c82a5..c93ee1036cb09 100644
--- a/arch/arm64/include/asm/kvm_arm.h
+++ b/arch/arm64/include/asm/kvm_arm.h
@@ -108,6 +108,7 @@
 /* TCR_EL2 Registers bits */
 #define TCR_EL2_DS		(1UL << 32)
 #define TCR_EL2_RES1		((1U << 31) | (1 << 23))
+#define TCR_EL2_HPD		(1 << 24)
 #define TCR_EL2_TBI		(1 << 20)
 #define TCR_EL2_PS_SHIFT	16
 #define TCR_EL2_PS_MASK		(7 << TCR_EL2_PS_SHIFT)
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index 9943ff0af4c96..f75c9a7e6bd68 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -146,6 +146,7 @@
 #define PMD_SECT_UXN		(_AT(pmdval_t, 1) << 54)
 #define PMD_TABLE_PXN		(_AT(pmdval_t, 1) << 59)
 #define PMD_TABLE_UXN		(_AT(pmdval_t, 1) << 60)
+#define PMD_TABLE_AP		(_AT(pmdval_t, 3) << 61)
 
 /*
  * AttrIndx[2:0] encoding (mapping attributes defined in the MAIR* registers).
@@ -307,6 +308,12 @@
 #define TCR_TCMA1		(UL(1) << 58)
 #define TCR_DS			(UL(1) << 59)
 
+#define TCR_HPD0_SHIFT		41
+#define TCR_HPD0		BIT(TCR_HPD0_SHIFT)
+
+#define TCR_HPD1_SHIFT		42
+#define TCR_HPD1		BIT(TCR_HPD1_SHIFT)
+
 /*
  * TTBR.
  */
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 02/12] arm64: Add PAR_EL1 field description
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
  2024-06-25 13:35 ` [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-07-12  7:06   ` Anshuman Khandual
  2024-06-25 13:35 ` [PATCH 03/12] KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor Marc Zyngier
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

As KVM is about to grow a full emulation for the AT instructions,
add the layout of the PAR_EL1 register in its non-D128 configuration.

Note that the constants are a bit ugly, as the register has two
layouts, based on the state of the F bit.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/sysreg.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index be41528194569..15c073359c9e9 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -325,7 +325,25 @@
 #define SYS_PAR_EL1			sys_reg(3, 0, 7, 4, 0)
 
 #define SYS_PAR_EL1_F			BIT(0)
+/* When PAR_EL1.F == 1 */
 #define SYS_PAR_EL1_FST			GENMASK(6, 1)
+#define SYS_PAR_EL1_PTW			BIT(8)
+#define SYS_PAR_EL1_S			BIT(9)
+#define SYS_PAR_EL1_AssuredOnly		BIT(12)
+#define SYS_PAR_EL1_TopLevel		BIT(13)
+#define SYS_PAR_EL1_Overlay		BIT(14)
+#define SYS_PAR_EL1_DirtyBit		BIT(15)
+#define SYS_PAR_EL1_F1_IMPDEF		GENMASK_ULL(63, 48)
+#define SYS_PAR_EL1_F1_RES0		(BIT(7) | BIT(10) | GENMASK_ULL(47, 16))
+#define SYS_PAR_EL1_RES1		BIT(11)
+/* When PAR_EL1.F == 0 */
+#define SYS_PAR_EL1_SH			GENMASK_ULL(8, 7)
+#define SYS_PAR_EL1_NS			BIT(9)
+#define SYS_PAR_EL1_F0_IMPDEF		BIT(10)
+#define SYS_PAR_EL1_NSE			BIT(11)
+#define SYS_PAR_EL1_PA			GENMASK_ULL(51, 12)
+#define SYS_PAR_EL1_ATTR		GENMASK_ULL(63, 56)
+#define SYS_PAR_EL1_F0_RES0		(GENMASK_ULL(6, 1) | GENMASK_ULL(55, 52))
 
 /*** Statistical Profiling Extension ***/
 #define PMSEVFR_EL1_RES0_IMP	\
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/12] KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
  2024-06-25 13:35 ` [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks Marc Zyngier
  2024-06-25 13:35 ` [PATCH 02/12] arm64: Add PAR_EL1 field description Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-06-25 13:35 ` [PATCH 04/12] KVM: arm64: nv: Honor absence of FEAT_PAN2 Marc Zyngier
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

The upper_attr attribute has been badly named, as it most of the
time carries the full "last walked descriptor".

Rename it to "desc" and make ti contain the full 64bit descriptor.
This will be used by the S1 PTW.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_nested.h |  4 ++--
 arch/arm64/kvm/nested.c             | 12 ++++++------
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index 5b06c31035a24..b2fe759964d83 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -85,7 +85,7 @@ struct kvm_s2_trans {
 	bool readable;
 	int level;
 	u32 esr;
-	u64 upper_attr;
+	u64 desc;
 };
 
 static inline phys_addr_t kvm_s2_trans_output(struct kvm_s2_trans *trans)
@@ -115,7 +115,7 @@ static inline bool kvm_s2_trans_writable(struct kvm_s2_trans *trans)
 
 static inline bool kvm_s2_trans_executable(struct kvm_s2_trans *trans)
 {
-	return !(trans->upper_attr & BIT(54));
+	return !(trans->desc & BIT(54));
 }
 
 extern int kvm_walk_nested_s2(struct kvm_vcpu *vcpu, phys_addr_t gipa,
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 96029a95d1062..73544e0e64dcb 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -256,7 +256,7 @@ static int walk_nested_s2_pgd(phys_addr_t ipa,
 		/* Check for valid descriptor at this point */
 		if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) {
 			out->esr = compute_fsc(level, ESR_ELx_FSC_FAULT);
-			out->upper_attr = desc;
+			out->desc = desc;
 			return 1;
 		}
 
@@ -266,7 +266,7 @@ static int walk_nested_s2_pgd(phys_addr_t ipa,
 
 		if (check_output_size(wi, desc)) {
 			out->esr = compute_fsc(level, ESR_ELx_FSC_ADDRSZ);
-			out->upper_attr = desc;
+			out->desc = desc;
 			return 1;
 		}
 
@@ -278,7 +278,7 @@ static int walk_nested_s2_pgd(phys_addr_t ipa,
 
 	if (level < first_block_level) {
 		out->esr = compute_fsc(level, ESR_ELx_FSC_FAULT);
-		out->upper_attr = desc;
+		out->desc = desc;
 		return 1;
 	}
 
@@ -289,13 +289,13 @@ static int walk_nested_s2_pgd(phys_addr_t ipa,
 
 	if (check_output_size(wi, desc)) {
 		out->esr = compute_fsc(level, ESR_ELx_FSC_ADDRSZ);
-		out->upper_attr = desc;
+		out->desc = desc;
 		return 1;
 	}
 
 	if (!(desc & BIT(10))) {
 		out->esr = compute_fsc(level, ESR_ELx_FSC_ACCESS);
-		out->upper_attr = desc;
+		out->desc = desc;
 		return 1;
 	}
 
@@ -307,7 +307,7 @@ static int walk_nested_s2_pgd(phys_addr_t ipa,
 	out->readable = desc & (0b01 << 6);
 	out->writable = desc & (0b10 << 6);
 	out->level = level;
-	out->upper_attr = desc & GENMASK_ULL(63, 52);
+	out->desc = desc;
 	return 0;
 }
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 04/12] KVM: arm64: nv: Honor absence of FEAT_PAN2
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (2 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 03/12] KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-07-12  8:40   ` Anshuman Khandual
  2024-06-25 13:35 ` [PATCH 05/12] KVM: arm64: make kvm_at() take an OP_AT_* Marc Zyngier
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

If our guest has been configured without PAN2, make sure that
AT S1E1{R,W}P will generate an UNDEF.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/sys_regs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 832c6733db307..06c39f191b5ec 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -4585,6 +4585,10 @@ void kvm_calculate_traps(struct kvm_vcpu *vcpu)
 						HFGITR_EL2_TLBIRVAAE1OS	|
 						HFGITR_EL2_TLBIRVAE1OS);
 
+	if (!kvm_has_feat(kvm, ID_AA64MMFR1_EL1, PAN, PAN2))
+		kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_ATS1E1RP |
+						HFGITR_EL2_ATS1E1WP);
+
 	if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S1PIE, IMP))
 		kvm->arch.fgu[HFGxTR_GROUP] |= (HFGxTR_EL2_nPIRE0_EL1 |
 						HFGxTR_EL2_nPIR_EL1);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 05/12] KVM: arm64: make kvm_at() take an OP_AT_*
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (3 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 04/12] KVM: arm64: nv: Honor absence of FEAT_PAN2 Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-07-12  8:52   ` Anshuman Khandual
  2024-06-25 13:35 ` [PATCH 06/12] KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P] Marc Zyngier
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

From: Joey Gouly <joey.gouly@arm.com>

To allow using newer instructions that current assemblers don't know about,
replace the `at` instruction with the underlying SYS instruction.

Signed-off-by: Joey Gouly <joey.gouly@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_asm.h       | 3 ++-
 arch/arm64/kvm/hyp/include/hyp/fault.h | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 2181a11b9d925..25f49f5fc4a63 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -10,6 +10,7 @@
 #include <asm/hyp_image.h>
 #include <asm/insn.h>
 #include <asm/virt.h>
+#include <asm/sysreg.h>
 
 #define ARM_EXIT_WITH_SERROR_BIT  31
 #define ARM_EXCEPTION_CODE(x)	  ((x) & ~(1U << ARM_EXIT_WITH_SERROR_BIT))
@@ -259,7 +260,7 @@ extern u64 __kvm_get_mdcr_el2(void);
 	asm volatile(							\
 	"	mrs	%1, spsr_el2\n"					\
 	"	mrs	%2, elr_el2\n"					\
-	"1:	at	"at_op", %3\n"					\
+	"1:	" __msr_s(at_op, "%3") "\n"				\
 	"	isb\n"							\
 	"	b	9f\n"						\
 	"2:	msr	spsr_el2, %1\n"					\
diff --git a/arch/arm64/kvm/hyp/include/hyp/fault.h b/arch/arm64/kvm/hyp/include/hyp/fault.h
index 9e13c1bc2ad54..487c06099d6fc 100644
--- a/arch/arm64/kvm/hyp/include/hyp/fault.h
+++ b/arch/arm64/kvm/hyp/include/hyp/fault.h
@@ -27,7 +27,7 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
 	 * saved the guest context yet, and we may return early...
 	 */
 	par = read_sysreg_par();
-	if (!__kvm_at("s1e1r", far))
+	if (!__kvm_at(OP_AT_S1E1R, far))
 		tmp = read_sysreg_par();
 	else
 		tmp = SYS_PAR_EL1_F; /* back to the guest */
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/12] KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P]
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (4 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 05/12] KVM: arm64: make kvm_at() take an OP_AT_* Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-06-25 13:35 ` [PATCH 07/12] KVM: arm64: nv: Add basic emulation of AT S1E2{R,W} Marc Zyngier
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Emulating AT instructions is one the tasks devolved to the host
hypervisor when NV is on.

Here, we take the basic approach of emulating AT S1E{0,1}{R,W}[P]
using the AT instructions themselves. While this mostly work,
it doesn't *always* work:

- S1 page tables can be swapped out

- shadow S2 can be incomplete and not contain mappings for
  the S1 page tables

We are not trying to handle these case here, and defer it to
a later patch. Suitable comments indicate where we are in dire
need of better handling.

Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_asm.h |   1 +
 arch/arm64/kvm/Makefile          |   2 +-
 arch/arm64/kvm/at.c              | 197 +++++++++++++++++++++++++++++++
 3 files changed, 199 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm64/kvm/at.c

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 25f49f5fc4a63..9b6c9f4f4d885 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -236,6 +236,7 @@ extern void __kvm_tlb_flush_vmid(struct kvm_s2_mmu *mmu);
 extern int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding);
 
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
+extern void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 
diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
index a6497228c5a8c..8a3ae76b4da22 100644
--- a/arch/arm64/kvm/Makefile
+++ b/arch/arm64/kvm/Makefile
@@ -14,7 +14,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
 	 inject_fault.o va_layout.o handle_exit.o \
 	 guest.o debug.o reset.o sys_regs.o stacktrace.o \
 	 vgic-sys-reg-v3.o fpsimd.o pkvm.o \
-	 arch_timer.o trng.o vmid.o emulate-nested.o nested.o \
+	 arch_timer.o trng.o vmid.o emulate-nested.o nested.o at.o \
 	 vgic/vgic.o vgic/vgic-init.o \
 	 vgic/vgic-irqfd.o vgic/vgic-v2.o \
 	 vgic/vgic-v3.o vgic/vgic-v4.o \
diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
new file mode 100644
index 0000000000000..eb0aa49e61f68
--- /dev/null
+++ b/arch/arm64/kvm/at.c
@@ -0,0 +1,197 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2017 - Linaro Ltd
+ * Author: Jintack Lim <jintack.lim@linaro.org>
+ */
+
+#include <asm/kvm_hyp.h>
+#include <asm/kvm_mmu.h>
+
+struct mmu_config {
+	u64	ttbr0;
+	u64	ttbr1;
+	u64	tcr;
+	u64	mair;
+	u64	sctlr;
+	u64	vttbr;
+	u64	vtcr;
+	u64	hcr;
+};
+
+static void __mmu_config_save(struct mmu_config *config)
+{
+	config->ttbr0	= read_sysreg_el1(SYS_TTBR0);
+	config->ttbr1	= read_sysreg_el1(SYS_TTBR1);
+	config->tcr	= read_sysreg_el1(SYS_TCR);
+	config->mair	= read_sysreg_el1(SYS_MAIR);
+	config->sctlr	= read_sysreg_el1(SYS_SCTLR);
+	config->vttbr	= read_sysreg(vttbr_el2);
+	config->vtcr	= read_sysreg(vtcr_el2);
+	config->hcr	= read_sysreg(hcr_el2);
+}
+
+static void __mmu_config_restore(struct mmu_config *config)
+{
+	write_sysreg_el1(config->ttbr0,	SYS_TTBR0);
+	write_sysreg_el1(config->ttbr1,	SYS_TTBR1);
+	write_sysreg_el1(config->tcr,	SYS_TCR);
+	write_sysreg_el1(config->mair,	SYS_MAIR);
+	write_sysreg_el1(config->sctlr,	SYS_SCTLR);
+	write_sysreg(config->vttbr,	vttbr_el2);
+	write_sysreg(config->vtcr,	vtcr_el2);
+	/*
+	 * ARM errata 1165522 and 1530923 require the actual execution of the
+	 * above before we can switch to the EL1/EL0 translation regime used by
+	 * the guest.
+	 */
+	asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
+
+	write_sysreg(config->hcr,	hcr_el2);
+
+	isb();
+}
+
+static bool check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
+{
+	u64 par_e0;
+	bool fail;
+
+	/*
+	 * For PAN-involved AT operations, perform the same translation,
+	 * using EL0 this time. Twice. Much fun.
+	 */
+	fail = __kvm_at(OP_AT_S1E0R, vaddr);
+	if (fail)
+		return true;
+
+	par_e0 = read_sysreg_par();
+	if (!(par_e0 & SYS_PAR_EL1_F))
+		goto out;
+
+	fail = __kvm_at(OP_AT_S1E0W, vaddr);
+	if (fail)
+		return true;
+
+	par_e0 = read_sysreg_par();
+out:
+	*res = par_e0;
+	return false;
+}
+
+void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
+{
+	struct mmu_config config;
+	struct kvm_s2_mmu *mmu;
+	unsigned long flags;
+	bool fail;
+	u64 par;
+
+	write_lock(&vcpu->kvm->mmu_lock);
+
+	/*
+	 * We've trapped, so everything is live on the CPU. As we will
+	 * be switching contexts behind everybody's back, disable
+	 * interrupts...
+	 */
+	local_irq_save(flags);
+	__mmu_config_save(&config);
+
+	/*
+	 * If HCR_EL2.{E2H,TGE} == {1,1}, the MMU context is already
+	 * the right one (as we trapped from vEL2). We have done too
+	 * much work by saving the full MMU context, but who cares?
+	 */
+	if (vcpu_el2_e2h_is_set(vcpu) && vcpu_el2_tge_is_set(vcpu))
+		goto skip_mmu_switch;
+
+	/*
+	 * FIXME: Obtaining the S2 MMU for a L2 is horribly racy, and
+	 * we may not find it (recycled by another vcpu, for example).
+	 * See the other FIXME comment below about the need for a SW
+	 * PTW in this case.
+	 */
+	mmu = lookup_s2_mmu(vcpu);
+	if (WARN_ON(!mmu))
+		goto out;
+
+	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR0_EL1),	SYS_TTBR0);
+	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR1_EL1),	SYS_TTBR1);
+	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TCR_EL1),	SYS_TCR);
+	write_sysreg_el1(vcpu_read_sys_reg(vcpu, MAIR_EL1),	SYS_MAIR);
+	write_sysreg_el1(vcpu_read_sys_reg(vcpu, SCTLR_EL1),	SYS_SCTLR);
+	__load_stage2(mmu, mmu->arch);
+
+skip_mmu_switch:
+	/* Clear TGE, enable S2 translation, we're rolling */
+	write_sysreg((config.hcr & ~HCR_TGE) | HCR_VM,	hcr_el2);
+	isb();
+
+	switch (op) {
+	case OP_AT_S1E1R:
+	case OP_AT_S1E1RP:
+		fail = __kvm_at(OP_AT_S1E1R, vaddr);
+		break;
+	case OP_AT_S1E1W:
+	case OP_AT_S1E1WP:
+		fail = __kvm_at(OP_AT_S1E1W, vaddr);
+		break;
+	case OP_AT_S1E0R:
+		fail = __kvm_at(OP_AT_S1E0R, vaddr);
+		break;
+	case OP_AT_S1E0W:
+		fail = __kvm_at(OP_AT_S1E0W, vaddr);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		fail = true;
+		break;
+	}
+
+	if (!fail)
+		par = read_sysreg(par_el1);
+	else
+		par = SYS_PAR_EL1_F;
+
+	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
+
+	/*
+	 * Failed? let's leave the building now.
+	 *
+	 * FIXME: how about a failed translation because the shadow S2
+	 * wasn't populated? We may need to perform a SW PTW,
+	 * populating our shadow S2 and retry the instruction.
+	 */
+	if (par & SYS_PAR_EL1_F)
+		goto nopan;
+
+	/* No PAN? No problem. */
+	if (!(*vcpu_cpsr(vcpu) & PSR_PAN_BIT))
+		goto nopan;
+
+	switch (op) {
+	case OP_AT_S1E1RP:
+	case OP_AT_S1E1WP:
+		fail = check_at_pan(vcpu, vaddr, &par);
+		break;
+	default:
+		goto nopan;
+	}
+
+	/*
+	 * If the EL0 translation has succeeded, we need to pretend
+	 * the AT operation has failed, as the PAN setting forbids
+	 * such a translation.
+	 *
+	 * FIXME: we hardcode a Level-3 permission fault. We really
+	 * should return the real fault level.
+	 */
+	if (fail || !(par & SYS_PAR_EL1_F))
+		vcpu_write_sys_reg(vcpu, (0xf << 1) | SYS_PAR_EL1_F, PAR_EL1);
+
+nopan:
+	__mmu_config_restore(&config);
+out:
+	local_irq_restore(flags);
+
+	write_unlock(&vcpu->kvm->mmu_lock);
+}
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 07/12] KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (5 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 06/12] KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P] Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-06-25 13:35 ` [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W} Marc Zyngier
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Similar to our AT S1E{0,1} emulation, we implement the AT S1E2
handling.

This emulation of course suffers from the same problems, but is
somehow simpler due to the lack of PAN2 and the fact that we are
guaranteed to execute it from the correct context.

Co-developed-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Jintack Lim <jintack.lim@linaro.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_asm.h |  1 +
 arch/arm64/kvm/at.c              | 57 ++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 9b6c9f4f4d885..6ec0622969766 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -237,6 +237,7 @@ extern int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding);
 
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
 extern void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr);
+extern void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 
diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index eb0aa49e61f68..147df5a9cc4e0 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -195,3 +195,60 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 
 	write_unlock(&vcpu->kvm->mmu_lock);
 }
+
+void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
+{
+	struct kvm_s2_mmu *mmu;
+	unsigned long flags;
+	u64 val, hcr, par;
+	bool fail;
+
+	write_lock(&vcpu->kvm->mmu_lock);
+
+	mmu = &vcpu->kvm->arch.mmu;
+
+	/*
+	 * We've trapped, so everything is live on the CPU. As we will
+	 * be switching context behind everybody's back, disable
+	 * interrupts...
+	 */
+	local_irq_save(flags);
+
+	val = hcr = read_sysreg(hcr_el2);
+	val &= ~HCR_TGE;
+	val |= HCR_VM;
+
+	if (!vcpu_el2_e2h_is_set(vcpu))
+		val |= HCR_NV | HCR_NV1;
+
+	write_sysreg(val, hcr_el2);
+	isb();
+
+	switch (op) {
+	case OP_AT_S1E2R:
+		fail = __kvm_at(OP_AT_S1E1R, vaddr);
+		break;
+	case OP_AT_S1E2W:
+		fail = __kvm_at(OP_AT_S1E1W, vaddr);
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		fail = true;
+	}
+
+	isb();
+
+	if (!fail)
+		par = read_sysreg_par();
+	else
+		par = SYS_PAR_EL1_F;
+
+	write_sysreg(hcr, hcr_el2);
+	isb();
+
+	local_irq_restore(flags);
+
+	write_unlock(&vcpu->kvm->mmu_lock);
+
+	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
+}
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (6 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 07/12] KVM: arm64: nv: Add basic emulation of AT S1E2{R,W} Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-07-18 15:10   ` Alexandru Elisei
  2024-06-25 13:35 ` [PATCH 09/12] KVM: arm64: nv: Make ps_to_output_size() generally available Marc Zyngier
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the
combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk.

However, there is a great deal of complexity coming from combining
the S1 and S2 attributes to report something consistent in PAR_EL1.

This is an absolute mine field, and I have a splitting headache.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_asm.h |   1 +
 arch/arm64/kvm/at.c              | 242 +++++++++++++++++++++++++++++++
 2 files changed, 243 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 6ec0622969766..b36a3b6cc0116 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -238,6 +238,7 @@ extern int __kvm_tlbi_s1e2(struct kvm_s2_mmu *mmu, u64 va, u64 sys_encoding);
 extern void __kvm_timer_set_cntvoff(u64 cntvoff);
 extern void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr);
 extern void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr);
+extern void __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 
diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index 147df5a9cc4e0..71e3390b43b4c 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -51,6 +51,189 @@ static void __mmu_config_restore(struct mmu_config *config)
 	isb();
 }
 
+#define MEMATTR(ic, oc)		(MEMATTR_##oc << 4 | MEMATTR_##ic)
+#define MEMATTR_NC		0b0100
+#define MEMATTR_Wt		0b1000
+#define MEMATTR_Wb		0b1100
+#define MEMATTR_WbRaWa		0b1111
+
+#define MEMATTR_IS_DEVICE(m)	(((m) & GENMASK(7, 4)) == 0)
+
+static u8 s2_memattr_to_attr(u8 memattr)
+{
+	memattr &= 0b1111;
+
+	switch (memattr) {
+	case 0b0000:
+	case 0b0001:
+	case 0b0010:
+	case 0b0011:
+		return memattr << 2;
+	case 0b0100:
+		return MEMATTR(Wb, Wb);
+	case 0b0101:
+		return MEMATTR(NC, NC);
+	case 0b0110:
+		return MEMATTR(Wt, NC);
+	case 0b0111:
+		return MEMATTR(Wb, NC);
+	case 0b1000:
+		/* Reserved, assume NC */
+		return MEMATTR(NC, NC);
+	case 0b1001:
+		return MEMATTR(NC, Wt);
+	case 0b1010:
+		return MEMATTR(Wt, Wt);
+	case 0b1011:
+		return MEMATTR(Wb, Wt);
+	case 0b1100:
+		/* Reserved, assume NC */
+		return MEMATTR(NC, NC);
+	case 0b1101:
+		return MEMATTR(NC, Wb);
+	case 0b1110:
+		return MEMATTR(Wt, Wb);
+	case 0b1111:
+		return MEMATTR(Wb, Wb);
+	default:
+		unreachable();
+	}
+}
+
+static u8 combine_s1_s2_attr(u8 s1, u8 s2)
+{
+	bool transient;
+	u8 final = 0;
+
+	/* Upgrade transient s1 to non-transient to simplify things */
+	switch (s1) {
+	case 0b0001 ... 0b0011:	/* Normal, Write-Through Transient */
+		transient = true;
+		s1 = MEMATTR_Wt | (s1 & GENMASK(1,0));
+		break;
+	case 0b0101 ... 0b0111:	/* Normal, Write-Back Transient */
+		transient = true;
+		s1 = MEMATTR_Wb | (s1 & GENMASK(1,0));
+		break;
+	default:
+		transient = false;
+	}
+
+	/* S2CombineS1AttrHints() */
+	if ((s1 & GENMASK(3, 2)) == MEMATTR_NC ||
+	    (s2 & GENMASK(3, 2)) == MEMATTR_NC)
+		final = MEMATTR_NC;
+	else if ((s1 & GENMASK(3, 2)) == MEMATTR_Wt ||
+		 (s2 & GENMASK(3, 2)) == MEMATTR_Wt)
+		final = MEMATTR_Wt;
+	else
+		final = MEMATTR_Wb;
+
+	if (final != MEMATTR_NC) {
+		/* Inherit RaWa hints form S1 */
+		if (transient) {
+			switch (s1 & GENMASK(3, 2)) {
+			case MEMATTR_Wt:
+				final = 0;
+				break;
+			case MEMATTR_Wb:
+				final = MEMATTR_NC;
+				break;
+			}
+		}
+
+		final |= s1 & GENMASK(1, 0);
+	}
+
+	return final;
+}
+
+static u8 compute_sh(u8 attr, u64 desc)
+{
+	/* Any form of device, as well as NC has SH[1:0]=0b10 */
+	if (MEMATTR_IS_DEVICE(attr) || attr == MEMATTR(NC, NC))
+		return 0b10;
+
+	return FIELD_GET(PTE_SHARED, desc) == 0b11 ? 0b11 : 0b10;
+}
+
+static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
+			   struct kvm_s2_trans *tr)
+{
+	u8 s1_parattr, s2_memattr, final_attr;
+	u64 par;
+
+	/* If S2 has failed to translate, report the damage */
+	if (tr->esr) {
+		par = SYS_PAR_EL1_RES1;
+		par |= SYS_PAR_EL1_F;
+		par |= SYS_PAR_EL1_S;
+		par |= FIELD_PREP(SYS_PAR_EL1_FST, tr->esr);
+		return par;
+	}
+
+	s1_parattr = FIELD_GET(SYS_PAR_EL1_ATTR, s1_par);
+	s2_memattr = FIELD_GET(GENMASK(5, 2), tr->desc);
+
+	if (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_FWB) {
+		if (!kvm_has_feat(vcpu->kvm, ID_AA64PFR2_EL1, MTEPERM, IMP))
+			s2_memattr &= ~BIT(3);
+
+		/* Combination of R_VRJSW and R_RHWZM */
+		switch (s2_memattr) {
+		case 0b0101:
+			if (MEMATTR_IS_DEVICE(s1_parattr))
+				final_attr = s1_parattr;
+			else
+				final_attr = MEMATTR(NC, NC);
+			break;
+		case 0b0110:
+		case 0b1110:
+			final_attr = MEMATTR(WbRaWa, WbRaWa);
+			break;
+		case 0b0111:
+		case 0b1111:
+			/* Preserve S1 attribute */
+			final_attr = s1_parattr;
+			break;
+		case 0b0100:
+		case 0b1100:
+		case 0b1101:
+			/* Reserved, do something non-silly */
+			final_attr = s1_parattr;
+			break;
+		default:
+			/* MemAttr[2]=0, Device from S2 */
+			final_attr = s2_memattr & GENMASK(1,0) << 2;
+		}
+	} else {
+		/* Combination of R_HMNDG, R_TNHFM and R_GQFSF */
+		u8 s2_parattr = s2_memattr_to_attr(s2_memattr);
+
+		if (MEMATTR_IS_DEVICE(s1_parattr) ||
+		    MEMATTR_IS_DEVICE(s2_parattr)) {
+			final_attr = min(s1_parattr, s2_parattr);
+		} else {
+			/* At this stage, this is memory vs memory */
+			final_attr  = combine_s1_s2_attr(s1_parattr & 0xf,
+							 s2_parattr & 0xf);
+			final_attr |= combine_s1_s2_attr(s1_parattr >> 4,
+							 s2_parattr >> 4) << 4;
+		}
+	}
+
+	if ((__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_CD) &&
+	    !MEMATTR_IS_DEVICE(final_attr))
+		final_attr = MEMATTR(NC, NC);
+
+	par  = FIELD_PREP(SYS_PAR_EL1_ATTR, final_attr);
+	par |= tr->output & GENMASK(47, 12);
+	par |= FIELD_PREP(SYS_PAR_EL1_SH,
+			  compute_sh(final_attr, tr->desc));
+
+	return par;
+}
+
 static bool check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
 {
 	u64 par_e0;
@@ -252,3 +435,62 @@ void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 
 	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
 }
+
+void __kvm_at_s12(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
+{
+	struct kvm_s2_trans out = {};
+	u64 ipa, par;
+	bool write;
+	int ret;
+
+	/* Do the stage-1 translation */
+	switch (op) {
+	case OP_AT_S12E1R:
+		op = OP_AT_S1E1R;
+		write = false;
+		break;
+	case OP_AT_S12E1W:
+		op = OP_AT_S1E1W;
+		write = true;
+		break;
+	case OP_AT_S12E0R:
+		op = OP_AT_S1E0R;
+		write = false;
+		break;
+	case OP_AT_S12E0W:
+		op = OP_AT_S1E0W;
+		write = true;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	__kvm_at_s1e01(vcpu, op, vaddr);
+	par = vcpu_read_sys_reg(vcpu, PAR_EL1);
+	if (par & SYS_PAR_EL1_F)
+		return;
+
+	/*
+	 * If we only have a single stage of translation (E2H=0 or
+	 * TGE=1), exit early. Same thing if {VM,DC}=={0,0}.
+	 */
+	if (!vcpu_el2_e2h_is_set(vcpu) || vcpu_el2_tge_is_set(vcpu) ||
+	    !(vcpu_read_sys_reg(vcpu, HCR_EL2) & (HCR_VM | HCR_DC)))
+		return;
+
+	/* Do the stage-2 translation */
+	ipa = (par & GENMASK_ULL(47, 12)) | (vaddr & GENMASK_ULL(11, 0));
+	out.esr = 0;
+	ret = kvm_walk_nested_s2(vcpu, ipa, &out);
+	if (ret < 0)
+		return;
+
+	/* Check the access permission */
+	if (!out.esr &&
+	    ((!write && !out.readable) || (write && !out.writable)))
+		out.esr = ESR_ELx_FSC_PERM | (out.level & 0x3);
+
+	par = compute_par_s12(vcpu, par, &out);
+	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
+}
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/12] KVM: arm64: nv: Make ps_to_output_size() generally available
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (7 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W} Marc Zyngier
@ 2024-06-25 13:35 ` Marc Zyngier
  2024-07-08 16:28 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-06-25 13:35 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Make this helper visible to at.c, we are going to need it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/kvm_nested.h | 14 ++++++++++++++
 arch/arm64/kvm/nested.c             | 14 --------------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index b2fe759964d83..c7adbddbab33a 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -205,4 +205,18 @@ static inline u64 kvm_encode_nested_level(struct kvm_s2_trans *trans)
 	return FIELD_PREP(KVM_NV_GUEST_MAP_SZ, trans->level);
 }
 
+static inline unsigned int ps_to_output_size(unsigned int ps)
+{
+	switch (ps) {
+	case 0: return 32;
+	case 1: return 36;
+	case 2: return 40;
+	case 3: return 42;
+	case 4: return 44;
+	case 5:
+	default:
+		return 48;
+	}
+}
+
 #endif /* __ARM64_KVM_NESTED_H */
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 73544e0e64dcb..a77b3181cd65d 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -103,20 +103,6 @@ struct s2_walk_info {
 	bool	     be;
 };
 
-static unsigned int ps_to_output_size(unsigned int ps)
-{
-	switch (ps) {
-	case 0: return 32;
-	case 1: return 36;
-	case 2: return 40;
-	case 3: return 42;
-	case 4: return 44;
-	case 5:
-	default:
-		return 48;
-	}
-}
-
 static u32 compute_fsc(int level, u32 fsc)
 {
 	return fsc | (level & 0x3);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (8 preceding siblings ...)
  2024-06-25 13:35 ` [PATCH 09/12] KVM: arm64: nv: Make ps_to_output_size() generally available Marc Zyngier
@ 2024-07-08 16:28 ` Alexandru Elisei
  2024-07-08 17:00   ` Marc Zyngier
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
  2024-07-31 10:05 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
  11 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-08 16:28 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Tue, Jun 25, 2024 at 02:34:59PM +0100, Marc Zyngier wrote:
> Another task that a hypervisor supporting NV on arm64 has to deal with
> is to emulate the AT instruction, because we multiplex all the S1
> translations on a single set of registers, and the guest S2 is never
> truly resident on the CPU.
> 
> So given that we lie about page tables, we also have to lie about
> translation instructions, hence the emulation. Things are made
> complicated by the fact that guest S1 page tables can be swapped out,
> and that our shadow S2 is likely to be incomplete. So while using AT
> to emulate AT is tempting (and useful), it is not going to always
> work, and we thus need a fallback in the shape of a SW S1 walker.
> 
> This series is built in 4 basic blocks:
> 
> - Add missing definition and basic reworking
> 
> - Dumb emulation of all relevant AT instructions using AT instructions
> 
> - Add a SW S1 walker that is using our S2 walker

I wanted to have a look at the S1 walker, and in my inbox I only have
patches #1 to #9 ("KVM: arm64: nv: Make ps_to_output_size() generally
available"). Checked on the kvm mailing list archive [1], same thing; a
google search for the string "KVM: arm64: nv: Add SW walker for AT S1
emulation" (quotes included) turns up the cover letter.

Am I looking in the wrong places?

[1] https://www.spinics.net/lists/kvm/msg351826.html

Thanks,
Alex

> 
> - Add FEAT_ATS1A support, which is almost trivial
> 
> This has been tested by comparing the output of a HW walker with the
> output of the SW one. Obviously, this isn't bullet proof, and I'm
> pretty sure there are some nasties in there.
> 
> In a departure from my usual habit, this series is on top of
> kvmarm/next, as it depends on the NV S2 shadow code.
> 
> Joey Gouly (1):
>   KVM: arm64: make kvm_at() take an OP_AT_*
> 
> Marc Zyngier (11):
>   arm64: Add missing APTable and TCR_ELx.HPD masks
>   arm64: Add PAR_EL1 field description
>   KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor
>   KVM: arm64: nv: Honor absence of FEAT_PAN2
>   KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P]
>   KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}
>   KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
>   KVM: arm64: nv: Make ps_to_output_size() generally available
>   KVM: arm64: nv: Add SW walker for AT S1 emulation
>   KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
>   KVM: arm64: nv: Add support for FEAT_ATS1A
> 
>  arch/arm64/include/asm/kvm_arm.h       |    1 +
>  arch/arm64/include/asm/kvm_asm.h       |    6 +-
>  arch/arm64/include/asm/kvm_nested.h    |   18 +-
>  arch/arm64/include/asm/pgtable-hwdef.h |    7 +
>  arch/arm64/include/asm/sysreg.h        |   19 +
>  arch/arm64/kvm/Makefile                |    2 +-
>  arch/arm64/kvm/at.c                    | 1007 ++++++++++++++++++++++++
>  arch/arm64/kvm/emulate-nested.c        |    2 +
>  arch/arm64/kvm/hyp/include/hyp/fault.h |    2 +-
>  arch/arm64/kvm/nested.c                |   26 +-
>  arch/arm64/kvm/sys_regs.c              |   60 ++
>  11 files changed, 1125 insertions(+), 25 deletions(-)
>  create mode 100644 arch/arm64/kvm/at.c
> 
> -- 
> 2.39.2
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (9 preceding siblings ...)
  2024-07-08 16:28 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
@ 2024-07-08 16:57 ` Marc Zyngier
  2024-07-08 16:57   ` [PATCH 11/12] KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 Marc Zyngier
                     ` (8 more replies)
  2024-07-31 10:05 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
  11 siblings, 9 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-08 16:57 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

In order to plug the brokenness of our current AT implementation,
we need a SW walker that is going to... err.. walk the S1 tables
and tell us what it finds.

Of course, it builds on top of our S2 walker, and share similar
concepts. The beauty of it is that since it uses kvm_read_guest(),
it is able to bring back pages that have been otherwise evicted.

This is then plugged in the two AT S1 emulation functions as
a "slow path" fallback. I'm not sure it is that slow, but hey.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 520 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index 71e3390b43b4c..8452273cbff6d 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -4,9 +4,305 @@
  * Author: Jintack Lim <jintack.lim@linaro.org>
  */
 
+#include <linux/kvm_host.h>
+
+#include <asm/esr.h>
 #include <asm/kvm_hyp.h>
 #include <asm/kvm_mmu.h>
 
+struct s1_walk_info {
+	u64	     baddr;
+	unsigned int max_oa_bits;
+	unsigned int pgshift;
+	unsigned int txsz;
+	int 	     sl;
+	bool	     hpd;
+	bool	     be;
+	bool	     nvhe;
+	bool	     s2;
+};
+
+struct s1_walk_result {
+	union {
+		struct {
+			u64	desc;
+			u64	pa;
+			s8	level;
+			u8	APTable;
+			bool	UXNTable;
+			bool	PXNTable;
+		};
+		struct {
+			u8	fst;
+			bool	ptw;
+			bool	s2;
+		};
+	};
+	bool	failed;
+};
+
+static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
+{
+	wr->fst		= fst;
+	wr->ptw		= ptw;
+	wr->s2		= s2;
+	wr->failed	= true;
+}
+
+#define S1_MMU_DISABLED		(-127)
+
+static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
+			 struct s1_walk_result *wr, const u64 va, const int el)
+{
+	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
+	unsigned int stride, x;
+	bool va55, tbi;
+
+	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
+
+	va55 = va & BIT(55);
+
+	if (wi->nvhe && va55)
+		goto addrsz;
+
+	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
+
+	switch (el) {
+	case 1:
+		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
+		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
+		ttbr	= (va55 ?
+			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
+			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
+		break;
+	case 2:
+		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
+		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
+		ttbr	= (va55 ?
+			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
+			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
+		break;
+	default:
+		BUG();
+	}
+
+	/* Let's put the MMU disabled case aside immediately */
+	if (!(sctlr & SCTLR_ELx_M) ||
+	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
+		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
+			goto addrsz;
+
+		wr->level = S1_MMU_DISABLED;
+		wr->desc = va;
+		return 0;
+	}
+
+	wi->be = sctlr & SCTLR_ELx_EE;
+
+	wi->hpd  = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HPDS, IMP);
+	wi->hpd &= (wi->nvhe ?
+		    FIELD_GET(TCR_EL2_HPD, tcr) :
+		    (va55 ?
+		     FIELD_GET(TCR_HPD1, tcr) :
+		     FIELD_GET(TCR_HPD0, tcr)));
+
+	tbi = (wi->nvhe ?
+	       FIELD_GET(TCR_EL2_TBI, tcr) :
+	       (va55 ?
+		FIELD_GET(TCR_TBI1, tcr) :
+		FIELD_GET(TCR_TBI0, tcr)));
+
+	if (!tbi && sign_extend64(va, 55) != (s64)va)
+		goto addrsz;
+
+	/* Someone was silly enough to encode TG0/TG1 differently */
+	if (va55) {
+		wi->txsz = FIELD_GET(TCR_T1SZ_MASK, tcr);
+		tg = FIELD_GET(TCR_TG1_MASK, tcr);
+
+		switch (tg << TCR_TG1_SHIFT) {
+		case TCR_TG1_4K:
+			wi->pgshift = 12;	 break;
+		case TCR_TG1_16K:
+			wi->pgshift = 14;	 break;
+		case TCR_TG1_64K:
+		default:	    /* IMPDEF: treat any other value as 64k */
+			wi->pgshift = 16;	 break;
+		}
+	} else {
+		wi->txsz = FIELD_GET(TCR_T0SZ_MASK, tcr);
+		tg = FIELD_GET(TCR_TG0_MASK, tcr);
+
+		switch (tg << TCR_TG0_SHIFT) {
+		case TCR_TG0_4K:
+			wi->pgshift = 12;	 break;
+		case TCR_TG0_16K:
+			wi->pgshift = 14;	 break;
+		case TCR_TG0_64K:
+		default:	    /* IMPDEF: treat any other value as 64k */
+			wi->pgshift = 16;	 break;
+		}
+	}
+
+	ia_bits = 64 - wi->txsz;
+
+	/* AArch64.S1StartLevel() */
+	stride = wi->pgshift - 3;
+	wi->sl = 3 - (((ia_bits - 1) - wi->pgshift) / stride);
+
+	/* Check for SL mandating LPA2 (which we don't support yet) */
+	switch (BIT(wi->pgshift)) {
+	case SZ_4K:
+		if (wi->sl == -1 &&
+		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN4, 52_BIT))
+			goto addrsz;
+		break;
+	case SZ_16K:
+		if (wi->sl == 0 &&
+		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN16, 52_BIT))
+			goto addrsz;
+		break;
+	}
+
+	ps = (wi->nvhe ?
+	      FIELD_GET(TCR_EL2_PS_MASK, tcr) : FIELD_GET(TCR_IPS_MASK, tcr));
+
+	wi->max_oa_bits = min(get_kvm_ipa_limit(), ps_to_output_size(ps));
+
+	/* Compute minimal alignment */
+	x = 3 + ia_bits - ((3 - wi->sl) * stride + wi->pgshift);
+
+	wi->baddr = ttbr & TTBRx_EL1_BADDR;
+	wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x);
+
+	return 0;
+
+addrsz:	/* Address Size Fault level 0 */
+	fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ, false, false);
+
+	return -EFAULT;
+}
+
+static int get_ia_size(struct s1_walk_info *wi)
+{
+	return 64 - wi->txsz;
+}
+
+static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
+		   struct s1_walk_result *wr, u64 va)
+{
+	u64 va_top, va_bottom, baddr, desc;
+	int level, stride, ret;
+
+	level = wi->sl;
+	stride = wi->pgshift - 3;
+	baddr = wi->baddr;
+
+	va_top = get_ia_size(wi) - 1;
+
+	while (1) {
+		u64 index, ipa;
+
+		va_bottom = (3 - level) * stride + wi->pgshift;
+		index = (va & GENMASK_ULL(va_top, va_bottom)) >> (va_bottom - 3);
+
+		ipa = baddr | index;
+
+		if (wi->s2) {
+			struct kvm_s2_trans s2_trans = {};
+
+			ret = kvm_walk_nested_s2(vcpu, ipa, &s2_trans);
+			if (ret) {
+				fail_s1_walk(wr,
+					     (s2_trans.esr & ~ESR_ELx_FSC_LEVEL) | level,
+					     true, true);
+				return ret;
+			}
+
+			if (!kvm_s2_trans_readable(&s2_trans)) {
+				fail_s1_walk(wr, ESR_ELx_FSC_PERM | level,
+					     true, true);
+
+				return -EPERM;
+			}
+
+			ipa = kvm_s2_trans_output(&s2_trans);
+		}
+
+		ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc));
+		if (ret) {
+			fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level),
+				     true, false);
+			return ret;
+		}
+
+		if (wi->be)
+			desc = be64_to_cpu((__force __be64)desc);
+		else
+			desc = le64_to_cpu((__force __le64)desc);
+
+		if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) {
+			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
+				     true, false);
+			return -ENOENT;
+		}
+
+		/* We found a leaf, handle that */
+		if ((desc & 3) == 1 || level == 3)
+			break;
+
+		if (!wi->hpd) {
+			wr->APTable  |= FIELD_GET(PMD_TABLE_AP, desc);
+			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
+			wr->PXNTable |= FIELD_GET(PMD_TABLE_PXN, desc);
+		}
+
+		baddr = GENMASK_ULL(47, wi->pgshift);
+
+		/* Check for out-of-range OA */
+		if (wi->max_oa_bits < 48 &&
+		    (baddr & GENMASK_ULL(47, wi->max_oa_bits))) {
+			fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ | level,
+				     true, false);
+			return -EINVAL;
+		}
+
+		/* Prepare for next round */
+		va_top = va_bottom - 1;
+		level++;
+	}
+
+	/* Block mapping, check the validity of the level */
+	if (!(desc & BIT(1))) {
+		bool valid_block = false;
+
+		switch (BIT(wi->pgshift)) {
+		case SZ_4K:
+			valid_block = level == 1 || level == 2;
+			break;
+		case SZ_16K:
+		case SZ_64K:
+			valid_block = level == 2;
+			break;
+		}
+
+		if (!valid_block) {
+			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
+				     true, false);
+			return -EINVAL;
+		}
+	}
+
+	wr->failed = false;
+	wr->level = level;
+	wr->desc = desc;
+	wr->pa = desc & GENMASK(47, va_bottom);
+	if (va_bottom > 12)
+		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
+
+	return 0;
+}
+
 struct mmu_config {
 	u64	ttbr0;
 	u64	ttbr1;
@@ -234,6 +530,177 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
 	return par;
 }
 
+static u64 compute_par_s1(struct kvm_vcpu *vcpu, struct s1_walk_result *wr)
+{
+	u64 par;
+
+	if (wr->failed) {
+		par = SYS_PAR_EL1_RES1;
+		par |= SYS_PAR_EL1_F;
+		par |= FIELD_PREP(SYS_PAR_EL1_FST, wr->fst);
+		par |= wr->ptw ? SYS_PAR_EL1_PTW : 0;
+		par |= wr->s2 ? SYS_PAR_EL1_S : 0;
+	} else if (wr->level == S1_MMU_DISABLED) {
+		/* MMU off or HCR_EL2.DC == 1 */
+		par = wr->pa & GENMASK_ULL(47, 12);
+
+		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
+			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
+			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
+		} else {
+			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
+					  MEMATTR(WbRaWa, WbRaWa));
+			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
+		}
+	} else {
+		u64 mair, sctlr;
+		int el;
+		u8 sh;
+
+		el = (vcpu_el2_e2h_is_set(vcpu) &&
+		      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
+
+		mair = ((el == 2) ?
+			vcpu_read_sys_reg(vcpu, MAIR_EL2) :
+			vcpu_read_sys_reg(vcpu, MAIR_EL1));
+
+		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
+		mair &= 0xff;
+
+		sctlr = ((el == 2) ?
+			vcpu_read_sys_reg(vcpu, SCTLR_EL2) :
+			vcpu_read_sys_reg(vcpu, SCTLR_EL1));
+
+		/* Force NC for memory if SCTLR_ELx.C is clear */
+		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
+			mair = MEMATTR(NC, NC);
+
+		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
+		par |= wr->pa & GENMASK_ULL(47, 12);
+
+		sh = compute_sh(mair, wr->desc);
+		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
+	}
+
+	return par;
+}
+
+static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
+{
+	bool perm_fail, ur, uw, ux, pr, pw, pan;
+	struct s1_walk_result wr = {};
+	struct s1_walk_info wi = {};
+	int ret, idx, el;
+
+	/*
+	 * We only get here from guest EL2, so the translation regime
+	 * AT applies to is solely defined by {E2H,TGE}.
+	 */
+	el = (vcpu_el2_e2h_is_set(vcpu) &&
+	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
+
+	ret = setup_s1_walk(vcpu, &wi, &wr, vaddr, el);
+	if (ret)
+		goto compute_par;
+
+	if (wr.level == S1_MMU_DISABLED)
+		goto compute_par;
+
+	idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+	ret = walk_s1(vcpu, &wi, &wr, vaddr);
+
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+
+	if (ret)
+		goto compute_par;
+
+	/* FIXME: revisit when adding indirect permission support */
+	if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3) &&
+	    !wi.nvhe) {
+		u64 sctlr;
+
+		if (el == 1)
+			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
+		else
+			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL2);
+
+		ux = (sctlr & SCTLR_EL1_EPAN) && !(wr.desc & PTE_UXN);
+	} else {
+		ux = false;
+	}
+
+	pw = !(wr.desc & PTE_RDONLY);
+
+	if (wi.nvhe) {
+		ur = uw = false;
+		pr = true;
+	} else {
+		if (wr.desc & PTE_USER) {
+			ur = pr = true;
+			uw = pw;
+		} else {
+			ur = uw = false;
+			pr = true;
+		}
+	}
+
+	/* Apply the Hierarchical Permission madness */
+	if (wi.nvhe) {
+		wr.APTable &= BIT(1);
+		wr.PXNTable = wr.UXNTable;
+	}
+
+	ur &= !(wr.APTable & BIT(0));
+	uw &= !(wr.APTable != 0);
+	ux &= !wr.UXNTable;
+
+	pw &= !(wr.APTable & BIT(1));
+
+	pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT;
+
+	perm_fail = false;
+
+	switch (op) {
+	case OP_AT_S1E1RP:
+		perm_fail |= pan && (ur || uw || ux);
+		fallthrough;
+	case OP_AT_S1E1R:
+	case OP_AT_S1E2R:
+		perm_fail |= !pr;
+		break;
+	case OP_AT_S1E1WP:
+		perm_fail |= pan && (ur || uw || ux);
+		fallthrough;
+	case OP_AT_S1E1W:
+	case OP_AT_S1E2W:
+		perm_fail |= !pw;
+		break;
+	case OP_AT_S1E0R:
+		perm_fail |= !ur;
+		break;
+	case OP_AT_S1E0W:
+		perm_fail |= !uw;
+		break;
+	default:
+		BUG();
+	}
+
+	if (perm_fail) {
+		struct s1_walk_result tmp;
+
+		tmp.failed = true;
+		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
+		tmp.s2 = false;
+		tmp.ptw = false;
+
+		wr = tmp;
+	}
+
+compute_par:
+	return compute_par_s1(vcpu, &wr);
+}
+
 static bool check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
 {
 	u64 par_e0;
@@ -266,9 +733,11 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	struct mmu_config config;
 	struct kvm_s2_mmu *mmu;
 	unsigned long flags;
-	bool fail;
+	bool fail, retry_slow;
 	u64 par;
 
+	retry_slow = false;
+
 	write_lock(&vcpu->kvm->mmu_lock);
 
 	/*
@@ -288,14 +757,15 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 		goto skip_mmu_switch;
 
 	/*
-	 * FIXME: Obtaining the S2 MMU for a L2 is horribly racy, and
-	 * we may not find it (recycled by another vcpu, for example).
-	 * See the other FIXME comment below about the need for a SW
-	 * PTW in this case.
+	 * Obtaining the S2 MMU for a L2 is horribly racy, and we may not
+	 * find it (recycled by another vcpu, for example). When this
+	 * happens, use the SW (slow) path.
 	 */
 	mmu = lookup_s2_mmu(vcpu);
-	if (WARN_ON(!mmu))
+	if (!mmu) {
+		retry_slow = true;
 		goto out;
+	}
 
 	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR0_EL1),	SYS_TTBR0);
 	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR1_EL1),	SYS_TTBR1);
@@ -331,18 +801,17 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	}
 
 	if (!fail)
-		par = read_sysreg(par_el1);
+		par = read_sysreg_par();
 	else
 		par = SYS_PAR_EL1_F;
 
+	retry_slow = !fail;
+
 	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
 
 	/*
-	 * Failed? let's leave the building now.
-	 *
-	 * FIXME: how about a failed translation because the shadow S2
-	 * wasn't populated? We may need to perform a SW PTW,
-	 * populating our shadow S2 and retry the instruction.
+	 * Failed? let's leave the building now, unless we retry on
+	 * the slow path.
 	 */
 	if (par & SYS_PAR_EL1_F)
 		goto nopan;
@@ -354,29 +823,58 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	switch (op) {
 	case OP_AT_S1E1RP:
 	case OP_AT_S1E1WP:
+		retry_slow = false;
 		fail = check_at_pan(vcpu, vaddr, &par);
 		break;
 	default:
 		goto nopan;
 	}
 
+	if (fail) {
+		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
+		goto nopan;
+	}
+
 	/*
 	 * If the EL0 translation has succeeded, we need to pretend
 	 * the AT operation has failed, as the PAN setting forbids
 	 * such a translation.
-	 *
-	 * FIXME: we hardcode a Level-3 permission fault. We really
-	 * should return the real fault level.
 	 */
-	if (fail || !(par & SYS_PAR_EL1_F))
-		vcpu_write_sys_reg(vcpu, (0xf << 1) | SYS_PAR_EL1_F, PAR_EL1);
-
+	if (par & SYS_PAR_EL1_F) {
+		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
+
+		/*
+		 * If we get something other than a permission fault, we
+		 * need to retry, as we're likely to have missed in the PTs.
+		 */
+		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
+			retry_slow = true;
+	} else {
+		/*
+		 * The EL0 access succeded, but we don't have the full
+		 * syndrom information to synthetize the failure. Go slow.
+		 */
+		retry_slow = true;
+	}
 nopan:
 	__mmu_config_restore(&config);
 out:
 	local_irq_restore(flags);
 
 	write_unlock(&vcpu->kvm->mmu_lock);
+
+	/*
+	 * If retry_slow is true, then we either are missing shadow S2
+	 * entries, have paged out guest S1, or something is inconsistent.
+	 *
+	 * Either way, we need to walk the PTs by hand so that we can either
+	 * fault things back, in or record accurate fault information along
+	 * the way.
+	 */
+	if (retry_slow) {
+		par = handle_at_slow(vcpu, op, vaddr);
+		vcpu_write_sys_reg(vcpu, par, PAR_EL1);
+	}
 }
 
 void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
@@ -433,6 +931,10 @@ void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 
 	write_unlock(&vcpu->kvm->mmu_lock);
 
+	/* We failed the translation, let's replay it in slow motion */
+	if (!fail && (par & SYS_PAR_EL1_F))
+		par = handle_at_slow(vcpu, op, vaddr);
+
 	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
 }
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 11/12] KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
@ 2024-07-08 16:57   ` Marc Zyngier
  2024-07-08 16:58   ` [PATCH 12/12] KVM: arm64: nv: Add support for FEAT_ATS1A Marc Zyngier
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-08 16:57 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Hooray, we're done. Plug the AT traps into the system instruction
table, and let it rip.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/sys_regs.c | 45 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 06c39f191b5ec..d8dadcb9b5e3f 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2797,6 +2797,36 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 	EL2_REG(SP_EL2, NULL, reset_unknown, 0),
 };
 
+static bool handle_at_s1e01(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
+			    const struct sys_reg_desc *r)
+{
+	u32 op = sys_insn(p->Op0, p->Op1, p->CRn, p->CRm, p->Op2);
+
+	__kvm_at_s1e01(vcpu, op, p->regval);
+
+	return true;
+}
+
+static bool handle_at_s1e2(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
+			   const struct sys_reg_desc *r)
+{
+	u32 op = sys_insn(p->Op0, p->Op1, p->CRn, p->CRm, p->Op2);
+
+	__kvm_at_s1e2(vcpu, op, p->regval);
+
+	return true;
+}
+
+static bool handle_at_s12(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
+			  const struct sys_reg_desc *r)
+{
+	u32 op = sys_insn(p->Op0, p->Op1, p->CRn, p->CRm, p->Op2);
+
+	__kvm_at_s12(vcpu, op, p->regval);
+
+	return true;
+}
+
 static bool kvm_supported_tlbi_s12_op(struct kvm_vcpu *vpcu, u32 instr)
 {
 	struct kvm *kvm = vpcu->kvm;
@@ -3059,6 +3089,14 @@ static struct sys_reg_desc sys_insn_descs[] = {
 	{ SYS_DESC(SYS_DC_ISW), access_dcsw },
 	{ SYS_DESC(SYS_DC_IGSW), access_dcgsw },
 	{ SYS_DESC(SYS_DC_IGDSW), access_dcgsw },
+
+	SYS_INSN(AT_S1E1R, handle_at_s1e01),
+	SYS_INSN(AT_S1E1W, handle_at_s1e01),
+	SYS_INSN(AT_S1E0R, handle_at_s1e01),
+	SYS_INSN(AT_S1E0W, handle_at_s1e01),
+	SYS_INSN(AT_S1E1RP, handle_at_s1e01),
+	SYS_INSN(AT_S1E1WP, handle_at_s1e01),
+
 	{ SYS_DESC(SYS_DC_CSW), access_dcsw },
 	{ SYS_DESC(SYS_DC_CGSW), access_dcgsw },
 	{ SYS_DESC(SYS_DC_CGDSW), access_dcgsw },
@@ -3138,6 +3176,13 @@ static struct sys_reg_desc sys_insn_descs[] = {
 	SYS_INSN(TLBI_VALE1NXS, handle_tlbi_el1),
 	SYS_INSN(TLBI_VAALE1NXS, handle_tlbi_el1),
 
+	SYS_INSN(AT_S1E2R, handle_at_s1e2),
+	SYS_INSN(AT_S1E2W, handle_at_s1e2),
+	SYS_INSN(AT_S12E1R, handle_at_s12),
+	SYS_INSN(AT_S12E1W, handle_at_s12),
+	SYS_INSN(AT_S12E0R, handle_at_s12),
+	SYS_INSN(AT_S12E0W, handle_at_s12),
+
 	SYS_INSN(TLBI_IPAS2E1IS, handle_ipas2e1is),
 	SYS_INSN(TLBI_RIPAS2E1IS, handle_ripas2e1is),
 	SYS_INSN(TLBI_IPAS2LE1IS, handle_ipas2e1is),
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/12] KVM: arm64: nv: Add support for FEAT_ATS1A
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
  2024-07-08 16:57   ` [PATCH 11/12] KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 Marc Zyngier
@ 2024-07-08 16:58   ` Marc Zyngier
  2024-07-10 15:12   ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Alexandru Elisei
                     ` (6 subsequent siblings)
  8 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-08 16:58 UTC (permalink / raw)
  To: kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

Handling FEAT_ATS1A (which provides the AT S1E{1,2}A instructions)
is pretty easy, as it is just the usual AT without the permission
check.

This basically amounts to plumbing the instructions in the various
dispatch tables, and handling FEAT_ATS1A being disabled in the
ID registers.

Signed-off-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/include/asm/sysreg.h |  1 +
 arch/arm64/kvm/at.c             |  9 +++++++++
 arch/arm64/kvm/emulate-nested.c |  2 ++
 arch/arm64/kvm/sys_regs.c       | 11 +++++++++++
 4 files changed, 23 insertions(+)

diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 15c073359c9e9..73fa79b5a51d1 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -670,6 +670,7 @@
 #define OP_AT_S12E1W	sys_insn(AT_Op0, 4, AT_CRn, 8, 5)
 #define OP_AT_S12E0R	sys_insn(AT_Op0, 4, AT_CRn, 8, 6)
 #define OP_AT_S12E0W	sys_insn(AT_Op0, 4, AT_CRn, 8, 7)
+#define OP_AT_S1E2A	sys_insn(AT_Op0, 4, AT_CRn, 9, 2)
 
 /* TLBI instructions */
 #define TLBI_Op0	1
diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index 8452273cbff6d..1e1255d244712 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -682,6 +682,9 @@ static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	case OP_AT_S1E0W:
 		perm_fail |= !uw;
 		break;
+	case OP_AT_S1E1A:
+	case OP_AT_S1E2A:
+		break;
 	default:
 		BUG();
 	}
@@ -794,6 +797,9 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	case OP_AT_S1E0W:
 		fail = __kvm_at(OP_AT_S1E0W, vaddr);
 		break;
+	case OP_AT_S1E1A:
+		fail = __kvm_at(OP_AT_S1E1A, vaddr);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		fail = true;
@@ -912,6 +918,9 @@ void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	case OP_AT_S1E2W:
 		fail = __kvm_at(OP_AT_S1E1W, vaddr);
 		break;
+	case OP_AT_S1E2A:
+		fail = __kvm_at(OP_AT_S1E1A, vaddr);
+		break;
 	default:
 		WARN_ON_ONCE(1);
 		fail = true;
diff --git a/arch/arm64/kvm/emulate-nested.c b/arch/arm64/kvm/emulate-nested.c
index 96b837fe51562..b5ac298f76705 100644
--- a/arch/arm64/kvm/emulate-nested.c
+++ b/arch/arm64/kvm/emulate-nested.c
@@ -774,6 +774,7 @@ static const struct encoding_to_trap_config encoding_to_cgt[] __initconst = {
 	SR_TRAP(OP_AT_S12E1W,		CGT_HCR_NV),
 	SR_TRAP(OP_AT_S12E0R,		CGT_HCR_NV),
 	SR_TRAP(OP_AT_S12E0W,		CGT_HCR_NV),
+	SR_TRAP(OP_AT_S1E2A,		CGT_HCR_NV),
 	SR_TRAP(OP_TLBI_IPAS2E1,	CGT_HCR_NV),
 	SR_TRAP(OP_TLBI_RIPAS2E1,	CGT_HCR_NV),
 	SR_TRAP(OP_TLBI_IPAS2LE1,	CGT_HCR_NV),
@@ -855,6 +856,7 @@ static const struct encoding_to_trap_config encoding_to_cgt[] __initconst = {
 	SR_TRAP(OP_AT_S1E0W, 		CGT_HCR_AT),
 	SR_TRAP(OP_AT_S1E1RP, 		CGT_HCR_AT),
 	SR_TRAP(OP_AT_S1E1WP, 		CGT_HCR_AT),
+	SR_TRAP(OP_AT_S1E1A,		CGT_HCR_AT),
 	SR_TRAP(SYS_ERXPFGF_EL1,	CGT_HCR_nFIEN),
 	SR_TRAP(SYS_ERXPFGCTL_EL1,	CGT_HCR_nFIEN),
 	SR_TRAP(SYS_ERXPFGCDN_EL1,	CGT_HCR_nFIEN),
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index d8dadcb9b5e3f..834893e461451 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -2812,6 +2812,13 @@ static bool handle_at_s1e2(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
 {
 	u32 op = sys_insn(p->Op0, p->Op1, p->CRn, p->CRm, p->Op2);
 
+	/* There is no FGT associated with AT S1E2A :-( */
+	if (op == OP_AT_S1E2A &&
+	    !kvm_has_feat(vcpu->kvm, ID_AA64ISAR2_EL1, ATS1A, IMP)) {
+		kvm_inject_undefined(vcpu);
+		return false;
+	}
+
 	__kvm_at_s1e2(vcpu, op, p->regval);
 
 	return true;
@@ -3182,6 +3189,7 @@ static struct sys_reg_desc sys_insn_descs[] = {
 	SYS_INSN(AT_S12E1W, handle_at_s12),
 	SYS_INSN(AT_S12E0R, handle_at_s12),
 	SYS_INSN(AT_S12E0W, handle_at_s12),
+	SYS_INSN(AT_S1E2A, handle_at_s1e2),
 
 	SYS_INSN(TLBI_IPAS2E1IS, handle_ipas2e1is),
 	SYS_INSN(TLBI_RIPAS2E1IS, handle_ripas2e1is),
@@ -4630,6 +4638,9 @@ void kvm_calculate_traps(struct kvm_vcpu *vcpu)
 						HFGITR_EL2_TLBIRVAAE1OS	|
 						HFGITR_EL2_TLBIRVAE1OS);
 
+	if (!kvm_has_feat(kvm, ID_AA64ISAR2_EL1, ATS1A, IMP))
+		kvm->arch.fgu[HFGITR_GROUP] |= HFGITR_EL2_ATS1E1A;
+
 	if (!kvm_has_feat(kvm, ID_AA64MMFR1_EL1, PAN, PAN2))
 		kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_ATS1E1RP |
 						HFGITR_EL2_ATS1E1WP);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions
  2024-07-08 16:28 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
@ 2024-07-08 17:00   ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-08 17:00 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Alex,

On Mon, 08 Jul 2024 17:28:11 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On Tue, Jun 25, 2024 at 02:34:59PM +0100, Marc Zyngier wrote:
> > Another task that a hypervisor supporting NV on arm64 has to deal with
> > is to emulate the AT instruction, because we multiplex all the S1
> > translations on a single set of registers, and the guest S2 is never
> > truly resident on the CPU.
> > 
> > So given that we lie about page tables, we also have to lie about
> > translation instructions, hence the emulation. Things are made
> > complicated by the fact that guest S1 page tables can be swapped out,
> > and that our shadow S2 is likely to be incomplete. So while using AT
> > to emulate AT is tempting (and useful), it is not going to always
> > work, and we thus need a fallback in the shape of a SW S1 walker.
> > 
> > This series is built in 4 basic blocks:
> > 
> > - Add missing definition and basic reworking
> > 
> > - Dumb emulation of all relevant AT instructions using AT instructions
> > 
> > - Add a SW S1 walker that is using our S2 walker
> 
> I wanted to have a look at the S1 walker, and in my inbox I only have
> patches #1 to #9 ("KVM: arm64: nv: Make ps_to_output_size() generally
> available"). Checked on the kvm mailing list archive [1], same thing; a
> google search for the string "KVM: arm64: nv: Add SW walker for AT S1
> emulation" (quotes included) turns up the cover letter.
> 
> Am I looking in the wrong places?
> 
> [1] https://www.spinics.net/lists/kvm/msg351826.html

This is very odd. I probably have sent them by specifying 000*patch
instead of 00*patch, hence the truncation to 9 patches.

Let me try and send the delta. With a bit of luck, it won't make a
mess in the archive[1].

Thanks for the heads up,

	M.

[1] https://lore.kernel.org/all/20240625133508.259829-1-maz@kernel.or

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
  2024-07-08 16:57   ` [PATCH 11/12] KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 Marc Zyngier
  2024-07-08 16:58   ` [PATCH 12/12] KVM: arm64: nv: Add support for FEAT_ATS1A Marc Zyngier
@ 2024-07-10 15:12   ` Alexandru Elisei
  2024-07-11  8:05     ` Marc Zyngier
  2024-07-11 10:56   ` Alexandru Elisei
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-10 15:12 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> [..]
> @@ -331,18 +801,17 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	}
>  
>  	if (!fail)
> -		par = read_sysreg(par_el1);
> +		par = read_sysreg_par();
>  	else
>  		par = SYS_PAR_EL1_F;
>  
> +	retry_slow = !fail;
> +
>  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
>  
>  	/*
> -	 * Failed? let's leave the building now.
> -	 *
> -	 * FIXME: how about a failed translation because the shadow S2
> -	 * wasn't populated? We may need to perform a SW PTW,
> -	 * populating our shadow S2 and retry the instruction.
> +	 * Failed? let's leave the building now, unless we retry on
> +	 * the slow path.
>  	 */
>  	if (par & SYS_PAR_EL1_F)
>  		goto nopan;

This is what follows after the 'if' statement above, and before the 'switch'
below:

        /* No PAN? No problem. */
        if (!(*vcpu_cpsr(vcpu) & PSR_PAN_BIT))
                goto nopan;

When KVM is executing this statement, the following is true:

1. SYS_PAR_EL1_F is clear => the hardware translation table walk was successful.
2. retry_slow = true;

Then if the PAN bit is not set, the function jumps to the nopan label, and
performs a software translation table walk, even though the hardware walk
performed by AT was successful.

> @@ -354,29 +823,58 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	switch (op) {
>  	case OP_AT_S1E1RP:
>  	case OP_AT_S1E1WP:
> +		retry_slow = false;
>  		fail = check_at_pan(vcpu, vaddr, &par);
>  		break;
>  	default:
>  		goto nopan;
>  	}
>  
> +	if (fail) {
> +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> +		goto nopan;
> +	}
> +
>  	/*
>  	 * If the EL0 translation has succeeded, we need to pretend
>  	 * the AT operation has failed, as the PAN setting forbids
>  	 * such a translation.
> -	 *
> -	 * FIXME: we hardcode a Level-3 permission fault. We really
> -	 * should return the real fault level.
>  	 */
> -	if (fail || !(par & SYS_PAR_EL1_F))
> -		vcpu_write_sys_reg(vcpu, (0xf << 1) | SYS_PAR_EL1_F, PAR_EL1);
> -
> +	if (par & SYS_PAR_EL1_F) {
> +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> +
> +		/*
> +		 * If we get something other than a permission fault, we
> +		 * need to retry, as we're likely to have missed in the PTs.
> +		 */
> +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> +			retry_slow = true;

Shouldn't VCPU's PAR_EL1 register be updated here? As far as I can tell, at this
point the VCPU PAR_EL1 register has the result from the successful walk
performed by AT S1E1R or AT S1E1W in the first 'switch' statement.

Thanks,
Alex

> +	} else {
> +		/*
> +		 * The EL0 access succeded, but we don't have the full
> +		 * syndrom information to synthetize the failure. Go slow.
> +		 */
> +		retry_slow = true;
> +	}
>  nopan:
>  	__mmu_config_restore(&config);
>  out:
>  	local_irq_restore(flags);
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
> +
> +	/*
> +	 * If retry_slow is true, then we either are missing shadow S2
> +	 * entries, have paged out guest S1, or something is inconsistent.
> +	 *
> +	 * Either way, we need to walk the PTs by hand so that we can either
> +	 * fault things back, in or record accurate fault information along
> +	 * the way.
> +	 */
> +	if (retry_slow) {
> +		par = handle_at_slow(vcpu, op, vaddr);
> +		vcpu_write_sys_reg(vcpu, par, PAR_EL1);
> +	}
>  }


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-10 15:12   ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Alexandru Elisei
@ 2024-07-11  8:05     ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-11  8:05 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Wed, 10 Jul 2024 16:12:53 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > In order to plug the brokenness of our current AT implementation,
> > we need a SW walker that is going to... err.. walk the S1 tables
> > and tell us what it finds.
> > 
> > Of course, it builds on top of our S2 walker, and share similar
> > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > it is able to bring back pages that have been otherwise evicted.
> > 
> > This is then plugged in the two AT S1 emulation functions as
> > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > [..]
> > @@ -331,18 +801,17 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> >  	}
> >  
> >  	if (!fail)
> > -		par = read_sysreg(par_el1);
> > +		par = read_sysreg_par();
> >  	else
> >  		par = SYS_PAR_EL1_F;
> >  
> > +	retry_slow = !fail;
> > +
> >  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
> >  
> >  	/*
> > -	 * Failed? let's leave the building now.
> > -	 *
> > -	 * FIXME: how about a failed translation because the shadow S2
> > -	 * wasn't populated? We may need to perform a SW PTW,
> > -	 * populating our shadow S2 and retry the instruction.
> > +	 * Failed? let's leave the building now, unless we retry on
> > +	 * the slow path.
> >  	 */
> >  	if (par & SYS_PAR_EL1_F)
> >  		goto nopan;
> 
> This is what follows after the 'if' statement above, and before the 'switch'
> below:
> 
>         /* No PAN? No problem. */
>         if (!(*vcpu_cpsr(vcpu) & PSR_PAN_BIT))
>                 goto nopan;
> 
> When KVM is executing this statement, the following is true:
> 
> 1. SYS_PAR_EL1_F is clear => the hardware translation table walk was successful.
> 2. retry_slow = true;
>
> Then if the PAN bit is not set, the function jumps to the nopan label, and
> performs a software translation table walk, even though the hardware walk
> performed by AT was successful.

Hmmm. Are you being polite and trying to avoid saying that this code
is broken and that I should look for a retirement home instead?
There, I've said it for you! ;-)

The more I stare at this code, the more I hate it. Trying to
interleave the replay condition with the many potential failure modes
of the HW walker feels completely wrong, and I feel that I'd better
split the whole thing in two:

void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
{
	__kvm_at_s1e01_hw(vcpu, vaddr);
	if (vcpu_read_sys_reg(vcpu, PAR_EL1) & SYS_PAR_F)
		__kvm_at_s1e01_sw(vcpu, vaddr);
}

and completely stop messing with things. This is AT S1 we're talking
about, not something that has any sort of high-frequency. Apart for
Xen. But as Butch said: "Xen's dead, baby. Xen's dead.".

> 
> > @@ -354,29 +823,58 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> >  	switch (op) {
> >  	case OP_AT_S1E1RP:
> >  	case OP_AT_S1E1WP:
> > +		retry_slow = false;
> >  		fail = check_at_pan(vcpu, vaddr, &par);
> >  		break;
> >  	default:
> >  		goto nopan;
> >  	}
> >  
> > +	if (fail) {
> > +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> > +		goto nopan;
> > +	}
> > +
> >  	/*
> >  	 * If the EL0 translation has succeeded, we need to pretend
> >  	 * the AT operation has failed, as the PAN setting forbids
> >  	 * such a translation.
> > -	 *
> > -	 * FIXME: we hardcode a Level-3 permission fault. We really
> > -	 * should return the real fault level.
> >  	 */
> > -	if (fail || !(par & SYS_PAR_EL1_F))
> > -		vcpu_write_sys_reg(vcpu, (0xf << 1) | SYS_PAR_EL1_F, PAR_EL1);
> > -
> > +	if (par & SYS_PAR_EL1_F) {
> > +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> > +
> > +		/*
> > +		 * If we get something other than a permission fault, we
> > +		 * need to retry, as we're likely to have missed in the PTs.
> > +		 */
> > +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> > +			retry_slow = true;
> 
> Shouldn't VCPU's PAR_EL1 register be updated here? As far as I can tell, at this
> point the VCPU PAR_EL1 register has the result from the successful walk
> performed by AT S1E1R or AT S1E1W in the first 'switch' statement.

Yup, yet another sign that this flow is broken. I'll apply my last few
grey cells to it, and hopefully the next iteration will be a bit
better.

Thanks a lot for having a look!

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
                     ` (2 preceding siblings ...)
  2024-07-10 15:12   ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Alexandru Elisei
@ 2024-07-11 10:56   ` Alexandru Elisei
  2024-07-11 12:16     ` Marc Zyngier
  2024-07-18 15:16   ` Alexandru Elisei
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-11 10:56 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> [..]
>  	switch (op) {
>  	case OP_AT_S1E1RP:
>  	case OP_AT_S1E1WP:
> +		retry_slow = false;
>  		fail = check_at_pan(vcpu, vaddr, &par);
>  		break;
>  	default:
>  		goto nopan;
>  	}

For context, this is what check_at_pan() does:

static int check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
{
        u64 par_e0;
        int error;

        /*
         * For PAN-involved AT operations, perform the same translation,
         * using EL0 this time. Twice. Much fun.
         */
        error = __kvm_at(OP_AT_S1E0R, vaddr);
        if (error)
                return error;

        par_e0 = read_sysreg_par();
        if (!(par_e0 & SYS_PAR_EL1_F))
                goto out;

        error = __kvm_at(OP_AT_S1E0W, vaddr);
        if (error)
                return error;

        par_e0 = read_sysreg_par();
out:
        *res = par_e0;
        return 0;
}

I'm having a hard time understanding why KVM is doing both AT S1E0R and AT S1E0W
regardless of the type of the access (read/write) in the PAN-aware AT
instruction. Would you mind elaborating on that?

> +	if (fail) {
> +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> +		goto nopan;
> +	}
> [..]
> +	if (par & SYS_PAR_EL1_F) {
> +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> +
> +		/*
> +		 * If we get something other than a permission fault, we
> +		 * need to retry, as we're likely to have missed in the PTs.
> +		 */
> +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> +			retry_slow = true;
> +	} else {
> +		/*
> +		 * The EL0 access succeded, but we don't have the full
> +		 * syndrom information to synthetize the failure. Go slow.
> +		 */
> +		retry_slow = true;
> +	}

This is what PSTATE.PAN controls:

If the Effective value of PSTATE.PAN is 1, then a privileged data access from
any of the following Exception levels to a virtual memory address that is
accessible to data accesses at EL0 generates a stage 1 Permission fault:

- A privileged data access from EL1.
- If HCR_EL2.E2H is 1, then a privileged data access from EL2.

With that in mind, I am really struggling to understand the logic.

If AT S1E0{R,W} (from check_at_pan()) failed, doesn't that mean that the virtual
memory address is not accessible to EL0? Add that to the fact that the AT
S1E1{R,W} (from the beginning of __kvm_at_s1e01()) succeeded, doesn't that mean
that AT S1E1{R,W}P should succeed, and furthermore the PAR_EL1 value should be
the one KVM got from AT S1E1{R,W}?

Thanks,
Alex

>  nopan:
>  	__mmu_config_restore(&config);
>  out:
>  	local_irq_restore(flags);
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
> +
> +	/*
> +	 * If retry_slow is true, then we either are missing shadow S2
> +	 * entries, have paged out guest S1, or something is inconsistent.
> +	 *
> +	 * Either way, we need to walk the PTs by hand so that we can either
> +	 * fault things back, in or record accurate fault information along
> +	 * the way.
> +	 */
> +	if (retry_slow) {
> +		par = handle_at_slow(vcpu, op, vaddr);
> +		vcpu_write_sys_reg(vcpu, par, PAR_EL1);
> +	}
>  }
>  
>  void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> @@ -433,6 +931,10 @@ void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  
> +	/* We failed the translation, let's replay it in slow motion */
> +	if (!fail && (par & SYS_PAR_EL1_F))
> +		par = handle_at_slow(vcpu, op, vaddr);
> +
>  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
>  }
>  
> -- 
> 2.39.2
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-11 10:56   ` Alexandru Elisei
@ 2024-07-11 12:16     ` Marc Zyngier
  2024-07-15 15:30       ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-11 12:16 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Thu, 11 Jul 2024 11:56:13 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi,
> 
> On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > In order to plug the brokenness of our current AT implementation,
> > we need a SW walker that is going to... err.. walk the S1 tables
> > and tell us what it finds.
> > 
> > Of course, it builds on top of our S2 walker, and share similar
> > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > it is able to bring back pages that have been otherwise evicted.
> > 
> > This is then plugged in the two AT S1 emulation functions as
> > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > [..]
> >  	switch (op) {
> >  	case OP_AT_S1E1RP:
> >  	case OP_AT_S1E1WP:
> > +		retry_slow = false;
> >  		fail = check_at_pan(vcpu, vaddr, &par);
> >  		break;
> >  	default:
> >  		goto nopan;
> >  	}
> 
> For context, this is what check_at_pan() does:
> 
> static int check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
> {
>         u64 par_e0;
>         int error;
> 
>         /*
>          * For PAN-involved AT operations, perform the same translation,
>          * using EL0 this time. Twice. Much fun.
>          */
>         error = __kvm_at(OP_AT_S1E0R, vaddr);
>         if (error)
>                 return error;
> 
>         par_e0 = read_sysreg_par();
>         if (!(par_e0 & SYS_PAR_EL1_F))
>                 goto out;
> 
>         error = __kvm_at(OP_AT_S1E0W, vaddr);
>         if (error)
>                 return error;
> 
>         par_e0 = read_sysreg_par();
> out:
>         *res = par_e0;
>         return 0;
> }
> 
> I'm having a hard time understanding why KVM is doing both AT S1E0R and AT S1E0W
> regardless of the type of the access (read/write) in the PAN-aware AT
> instruction. Would you mind elaborating on that?

Because that's the very definition of an AT S1E1{W,R}P instruction
when PAN is set. If *any* EL0 permission is set, then the translation
must equally fail. Just like a load or a store from EL1 would fail if
any EL0 permission is set when PSTATE.PAN is set.

Since we cannot check for both permissions at once, we do it twice.
It is worth noting that we don't quite handle the PAN3 case correctly
(because we can't retrieve the *execution* property using AT). I'll
add that to the list of stuff to fix.

> 
> > +	if (fail) {
> > +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> > +		goto nopan;
> > +	}
> > [..]
> > +	if (par & SYS_PAR_EL1_F) {
> > +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> > +
> > +		/*
> > +		 * If we get something other than a permission fault, we
> > +		 * need to retry, as we're likely to have missed in the PTs.
> > +		 */
> > +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> > +			retry_slow = true;
> > +	} else {
> > +		/*
> > +		 * The EL0 access succeded, but we don't have the full
> > +		 * syndrom information to synthetize the failure. Go slow.
> > +		 */
> > +		retry_slow = true;
> > +	}
> 
> This is what PSTATE.PAN controls:
> 
> If the Effective value of PSTATE.PAN is 1, then a privileged data access from
> any of the following Exception levels to a virtual memory address that is
> accessible to data accesses at EL0 generates a stage 1 Permission fault:
> 
> - A privileged data access from EL1.
> - If HCR_EL2.E2H is 1, then a privileged data access from EL2.
> 
> With that in mind, I am really struggling to understand the logic.

I don't quite see what you don't understand, you'll have to be more
precise. Are you worried about the page tables we're looking at, the
value of PSTATE.PAN, the permission fault, or something else?

It also doesn't help that you're looking at the patch that contains
the integration with the slow-path, which is pretty hard to read (I
have a reworked version that's a bit better). You probably want to
look at the "fast" path alone.

> 
> If AT S1E0{R,W} (from check_at_pan()) failed, doesn't that mean that the virtual
> memory address is not accessible to EL0? Add that to the fact that the AT
> S1E1{R,W} (from the beginning of __kvm_at_s1e01()) succeeded, doesn't that mean
> that AT S1E1{R,W}P should succeed, and furthermore the PAR_EL1 value should be
> the one KVM got from AT S1E1{R,W}?

There are plenty of ways for AT S1E0 to fail when AT S1E1 succeeded:

- no EL0 permission: that's the best case, and the PAR_EL1 obtained
  from the AT S1E1 is the correct one. That's what we return.

- The EL0 access failed, but for another reason than a permission
  fault. This contradicts the EL1 walk, and is a sure sign that
  someone is playing behind our back. We fail.

- exception from AT S1E0: something went wrong (again the guest
  playing with the PTs behind our back). We fail as well.

Do you at least agree with these as goals? If you do, what in
the implementation does not satisfy these goals? If you don't, what in
these goals seem improper to you?

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/12] arm64: Add PAR_EL1 field description
  2024-06-25 13:35 ` [PATCH 02/12] arm64: Add PAR_EL1 field description Marc Zyngier
@ 2024-07-12  7:06   ` Anshuman Khandual
  2024-07-13  7:56     ` Marc Zyngier
  0 siblings, 1 reply; 50+ messages in thread
From: Anshuman Khandual @ 2024-07-12  7:06 UTC (permalink / raw)
  To: Marc Zyngier, kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly



On 6/25/24 19:05, Marc Zyngier wrote:
> As KVM is about to grow a full emulation for the AT instructions,
> add the layout of the PAR_EL1 register in its non-D128 configuration.

Right, there are two variants for PAR_EL1 i.e D128 and non-D128. Probably it makes
sense to define all these PAR_EL1 fields in arch/arm64/include/asm/sysreg.h, until
arch/arm64/tools/sysreg evolves to accommodate different bit field layouts for the
same register.

> 
> Note that the constants are a bit ugly, as the register has two
> layouts, based on the state of the F bit.

Just wondering if it would be better to append 'VALID/INVALID' suffix
for the fields to differentiate between when F = 0 and when F = 1 ?

s/SYS_PAR_EL1_FST/SYS_PAR_INVALID_FST_EL1
s/SYS_PAR_EL1_SH/SYS_PAR_VALID_SH_EL1

Or something similar.

> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/include/asm/sysreg.h | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index be41528194569..15c073359c9e9 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -325,7 +325,25 @@
>  #define SYS_PAR_EL1			sys_reg(3, 0, 7, 4, 0)
>  
>  #define SYS_PAR_EL1_F			BIT(0)
> +/* When PAR_EL1.F == 1 */
>  #define SYS_PAR_EL1_FST			GENMASK(6, 1)
> +#define SYS_PAR_EL1_PTW			BIT(8)
> +#define SYS_PAR_EL1_S			BIT(9)
> +#define SYS_PAR_EL1_AssuredOnly		BIT(12)
> +#define SYS_PAR_EL1_TopLevel		BIT(13)
> +#define SYS_PAR_EL1_Overlay		BIT(14)
> +#define SYS_PAR_EL1_DirtyBit		BIT(15)
> +#define SYS_PAR_EL1_F1_IMPDEF		GENMASK_ULL(63, 48)
> +#define SYS_PAR_EL1_F1_RES0		(BIT(7) | BIT(10) | GENMASK_ULL(47, 16))
> +#define SYS_PAR_EL1_RES1		BIT(11)
> +/* When PAR_EL1.F == 0 */
> +#define SYS_PAR_EL1_SH			GENMASK_ULL(8, 7)
> +#define SYS_PAR_EL1_NS			BIT(9)
> +#define SYS_PAR_EL1_F0_IMPDEF		BIT(10)
> +#define SYS_PAR_EL1_NSE			BIT(11)
> +#define SYS_PAR_EL1_PA			GENMASK_ULL(51, 12)
> +#define SYS_PAR_EL1_ATTR		GENMASK_ULL(63, 56)
> +#define SYS_PAR_EL1_F0_RES0		(GENMASK_ULL(6, 1) | GENMASK_ULL(55, 52))
>  
>  /*** Statistical Profiling Extension ***/
>  #define PMSEVFR_EL1_RES0_IMP	\


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks
  2024-06-25 13:35 ` [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks Marc Zyngier
@ 2024-07-12  8:32   ` Anshuman Khandual
  2024-07-13  8:04     ` Marc Zyngier
  0 siblings, 1 reply; 50+ messages in thread
From: Anshuman Khandual @ 2024-07-12  8:32 UTC (permalink / raw)
  To: Marc Zyngier, kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly



On 6/25/24 19:05, Marc Zyngier wrote:
> Although Linux doesn't make use of hierarchical permissions (TFFT!),
> KVM needs to know where the various bits related to this feature
> live in the TCR_ELx registers as well as in the page tables.
> 
> Add the missing bits.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/include/asm/kvm_arm.h       | 1 +
>  arch/arm64/include/asm/pgtable-hwdef.h | 7 +++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index b2adc2c6c82a5..c93ee1036cb09 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -108,6 +108,7 @@
>  /* TCR_EL2 Registers bits */
>  #define TCR_EL2_DS		(1UL << 32)
>  #define TCR_EL2_RES1		((1U << 31) | (1 << 23))
> +#define TCR_EL2_HPD		(1 << 24)
>  #define TCR_EL2_TBI		(1 << 20)
>  #define TCR_EL2_PS_SHIFT	16
>  #define TCR_EL2_PS_MASK		(7 << TCR_EL2_PS_SHIFT)
> diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
> index 9943ff0af4c96..f75c9a7e6bd68 100644
> --- a/arch/arm64/include/asm/pgtable-hwdef.h
> +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> @@ -146,6 +146,7 @@
>  #define PMD_SECT_UXN		(_AT(pmdval_t, 1) << 54)
>  #define PMD_TABLE_PXN		(_AT(pmdval_t, 1) << 59)
>  #define PMD_TABLE_UXN		(_AT(pmdval_t, 1) << 60)
> +#define PMD_TABLE_AP		(_AT(pmdval_t, 3) << 61)

APTable bits are also present in all table descriptors at each non-L3
level. Should not corresponding corresponding macros i.e PUD_TABLE_AP,
P4D_TABLE_AP, and PGD_TABLE_AP be added as well ?

>  
>  /*
>   * AttrIndx[2:0] encoding (mapping attributes defined in the MAIR* registers).
> @@ -307,6 +308,12 @@
>  #define TCR_TCMA1		(UL(1) << 58)
>  #define TCR_DS			(UL(1) << 59)
>  
> +#define TCR_HPD0_SHIFT		41
> +#define TCR_HPD0		BIT(TCR_HPD0_SHIFT)
> +
> +#define TCR_HPD1_SHIFT		42
> +#define TCR_HPD1		BIT(TCR_HPD1_SHIFT)

Should not these new register fields follow the current ascending bit
order in the listing i.e get added after TCR_HD (bit 40).

> +
>  /*
>   * TTBR.
>   */


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/12] KVM: arm64: nv: Honor absence of FEAT_PAN2
  2024-06-25 13:35 ` [PATCH 04/12] KVM: arm64: nv: Honor absence of FEAT_PAN2 Marc Zyngier
@ 2024-07-12  8:40   ` Anshuman Khandual
  0 siblings, 0 replies; 50+ messages in thread
From: Anshuman Khandual @ 2024-07-12  8:40 UTC (permalink / raw)
  To: Marc Zyngier, kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly

On 6/25/24 19:05, Marc Zyngier wrote:
> If our guest has been configured without PAN2, make sure that
> AT S1E1{R,W}P will generate an UNDEF.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/sys_regs.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 832c6733db307..06c39f191b5ec 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -4585,6 +4585,10 @@ void kvm_calculate_traps(struct kvm_vcpu *vcpu)
>  						HFGITR_EL2_TLBIRVAAE1OS	|
>  						HFGITR_EL2_TLBIRVAE1OS);
>  
> +	if (!kvm_has_feat(kvm, ID_AA64MMFR1_EL1, PAN, PAN2))
> +		kvm->arch.fgu[HFGITR_GROUP] |= (HFGITR_EL2_ATS1E1RP |
> +						HFGITR_EL2_ATS1E1WP);
> +
>  	if (!kvm_has_feat(kvm, ID_AA64MMFR3_EL1, S1PIE, IMP))
>  		kvm->arch.fgu[HFGxTR_GROUP] |= (HFGxTR_EL2_nPIRE0_EL1 |
>  						HFGxTR_EL2_nPIR_EL1);
As you had explained earlier about FGT UNDEF implementation, the above
code change makes sense.

FWIW

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 05/12] KVM: arm64: make kvm_at() take an OP_AT_*
  2024-06-25 13:35 ` [PATCH 05/12] KVM: arm64: make kvm_at() take an OP_AT_* Marc Zyngier
@ 2024-07-12  8:52   ` Anshuman Khandual
  0 siblings, 0 replies; 50+ messages in thread
From: Anshuman Khandual @ 2024-07-12  8:52 UTC (permalink / raw)
  To: Marc Zyngier, kvmarm, linux-arm-kernel, kvm
  Cc: James Morse, Suzuki K Poulose, Oliver Upton, Zenghui Yu,
	Joey Gouly



On 6/25/24 19:05, Marc Zyngier wrote:
> From: Joey Gouly <joey.gouly@arm.com>
> 
> To allow using newer instructions that current assemblers don't know about,
> replace the `at` instruction with the underlying SYS instruction.
> 
> Signed-off-by: Joey Gouly <joey.gouly@arm.com>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Oliver Upton <oliver.upton@linux.dev>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Reviewed-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/include/asm/kvm_asm.h       | 3 ++-
>  arch/arm64/kvm/hyp/include/hyp/fault.h | 2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 2181a11b9d925..25f49f5fc4a63 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -10,6 +10,7 @@
>  #include <asm/hyp_image.h>
>  #include <asm/insn.h>
>  #include <asm/virt.h>
> +#include <asm/sysreg.h>
>  
>  #define ARM_EXIT_WITH_SERROR_BIT  31
>  #define ARM_EXCEPTION_CODE(x)	  ((x) & ~(1U << ARM_EXIT_WITH_SERROR_BIT))
> @@ -259,7 +260,7 @@ extern u64 __kvm_get_mdcr_el2(void);
>  	asm volatile(							\
>  	"	mrs	%1, spsr_el2\n"					\
>  	"	mrs	%2, elr_el2\n"					\
> -	"1:	at	"at_op", %3\n"					\
> +	"1:	" __msr_s(at_op, "%3") "\n"				\
>  	"	isb\n"							\
>  	"	b	9f\n"						\
>  	"2:	msr	spsr_el2, %1\n"					\
> diff --git a/arch/arm64/kvm/hyp/include/hyp/fault.h b/arch/arm64/kvm/hyp/include/hyp/fault.h
> index 9e13c1bc2ad54..487c06099d6fc 100644
> --- a/arch/arm64/kvm/hyp/include/hyp/fault.h
> +++ b/arch/arm64/kvm/hyp/include/hyp/fault.h
> @@ -27,7 +27,7 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
>  	 * saved the guest context yet, and we may return early...
>  	 */
>  	par = read_sysreg_par();
> -	if (!__kvm_at("s1e1r", far))
> +	if (!__kvm_at(OP_AT_S1E1R, far))
>  		tmp = read_sysreg_par();
>  	else
>  		tmp = SYS_PAR_EL1_F; /* back to the guest */

LGTM, FWIW

Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 02/12] arm64: Add PAR_EL1 field description
  2024-07-12  7:06   ` Anshuman Khandual
@ 2024-07-13  7:56     ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-13  7:56 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Fri, 12 Jul 2024 08:06:31 +0100,
Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
> 
> 
> On 6/25/24 19:05, Marc Zyngier wrote:
> > As KVM is about to grow a full emulation for the AT instructions,
> > add the layout of the PAR_EL1 register in its non-D128 configuration.
> 
> Right, there are two variants for PAR_EL1 i.e D128 and non-D128. Probably it makes
> sense to define all these PAR_EL1 fields in arch/arm64/include/asm/sysreg.h, until
> arch/arm64/tools/sysreg evolves to accommodate different bit field layouts for the
> same register.

This is really sorely needed, because we can't describe any of the
registers that changes layout depending on another control bit. Take
for example any of the EL2 registers affected by HCR_EL2.E2H.

However, I have no interest in defining *any* D128 format. I take it
that whoever will eventually add D128 support to the kernel (and KVM)
will take care of that.

> 
> > 
> > Note that the constants are a bit ugly, as the register has two
> > layouts, based on the state of the F bit.
> 
> Just wondering if it would be better to append 'VALID/INVALID' suffix
> for the fields to differentiate between when F = 0 and when F = 1 ?
> 
> s/SYS_PAR_EL1_FST/SYS_PAR_INVALID_FST_EL1
> s/SYS_PAR_EL1_SH/SYS_PAR_VALID_SH_EL1
> 
> Or something similar.

I find it pretty horrible.

If anything, because "VALID/INVALID" doesn't say anything of *what* is
invalid. Also, there is no "VALID" definition in the register, and an
aborted translation does not make the register invalid, quite the
opposite -- it is full of crucial information.

Which is why I used the F0/F1 prefixes, making it clear (at least in
my view) that the description is tied to a particular value of the
PAR_EL1.F bit.

Finally, most of the bit layouts are unambiguous: a field of any given
name only exists in a given layout of the register. This means we can
safely have names that match the ARM ARM description without any
visual pollution.

The only ambiguities are with generic names such as RES0 and IMPDEF.
Given that we almost never use these bits for anything, I don't think
the use of a F-specific prefix is a problem.

But yeah, naming is hard.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks
  2024-07-12  8:32   ` Anshuman Khandual
@ 2024-07-13  8:04     ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-13  8:04 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Fri, 12 Jul 2024 09:32:12 +0100,
Anshuman Khandual <anshuman.khandual@arm.com> wrote:
> 
> 
> 
> On 6/25/24 19:05, Marc Zyngier wrote:
> > Although Linux doesn't make use of hierarchical permissions (TFFT!),
> > KVM needs to know where the various bits related to this feature
> > live in the TCR_ELx registers as well as in the page tables.
> > 
> > Add the missing bits.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/include/asm/kvm_arm.h       | 1 +
> >  arch/arm64/include/asm/pgtable-hwdef.h | 7 +++++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> > index b2adc2c6c82a5..c93ee1036cb09 100644
> > --- a/arch/arm64/include/asm/kvm_arm.h
> > +++ b/arch/arm64/include/asm/kvm_arm.h
> > @@ -108,6 +108,7 @@
> >  /* TCR_EL2 Registers bits */
> >  #define TCR_EL2_DS		(1UL << 32)
> >  #define TCR_EL2_RES1		((1U << 31) | (1 << 23))
> > +#define TCR_EL2_HPD		(1 << 24)
> >  #define TCR_EL2_TBI		(1 << 20)
> >  #define TCR_EL2_PS_SHIFT	16
> >  #define TCR_EL2_PS_MASK		(7 << TCR_EL2_PS_SHIFT)
> > diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
> > index 9943ff0af4c96..f75c9a7e6bd68 100644
> > --- a/arch/arm64/include/asm/pgtable-hwdef.h
> > +++ b/arch/arm64/include/asm/pgtable-hwdef.h
> > @@ -146,6 +146,7 @@
> >  #define PMD_SECT_UXN		(_AT(pmdval_t, 1) << 54)
> >  #define PMD_TABLE_PXN		(_AT(pmdval_t, 1) << 59)
> >  #define PMD_TABLE_UXN		(_AT(pmdval_t, 1) << 60)
> > +#define PMD_TABLE_AP		(_AT(pmdval_t, 3) << 61)
> 
> APTable bits are also present in all table descriptors at each non-L3
> level. Should not corresponding corresponding macros i.e PUD_TABLE_AP,
> P4D_TABLE_AP, and PGD_TABLE_AP be added as well ?

My problem with that is that it doesn't make much sense from an
architecture perspective. It doesn't define any of these, because
these names make no sense.

Maybe I should just drop the PMD prefix and write it as S1_TABLE_AP,
so that it can be reused if we ever need the P*D names.

> 
> >  
> >  /*
> >   * AttrIndx[2:0] encoding (mapping attributes defined in the MAIR* registers).
> > @@ -307,6 +308,12 @@
> >  #define TCR_TCMA1		(UL(1) << 58)
> >  #define TCR_DS			(UL(1) << 59)
> >  
> > +#define TCR_HPD0_SHIFT		41
> > +#define TCR_HPD0		BIT(TCR_HPD0_SHIFT)
> > +
> > +#define TCR_HPD1_SHIFT		42
> > +#define TCR_HPD1		BIT(TCR_HPD1_SHIFT)
> 
> Should not these new register fields follow the current ascending bit
> order in the listing i.e get added after TCR_HD (bit 40).

Yup, I'll move them up.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-11 12:16     ` Marc Zyngier
@ 2024-07-15 15:30       ` Alexandru Elisei
  2024-07-18 11:37         ` Marc Zyngier
  0 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-15 15:30 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Thu, Jul 11, 2024 at 01:16:42PM +0100, Marc Zyngier wrote:
> On Thu, 11 Jul 2024 11:56:13 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi,
> > 
> > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > In order to plug the brokenness of our current AT implementation,
> > > we need a SW walker that is going to... err.. walk the S1 tables
> > > and tell us what it finds.
> > > 
> > > Of course, it builds on top of our S2 walker, and share similar
> > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > it is able to bring back pages that have been otherwise evicted.
> > > 
> > > This is then plugged in the two AT S1 emulation functions as
> > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > [..]
> > >  	switch (op) {
> > >  	case OP_AT_S1E1RP:
> > >  	case OP_AT_S1E1WP:
> > > +		retry_slow = false;
> > >  		fail = check_at_pan(vcpu, vaddr, &par);
> > >  		break;
> > >  	default:
> > >  		goto nopan;
> > >  	}
> > 
> > For context, this is what check_at_pan() does:
> > 
> > static int check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
> > {
> >         u64 par_e0;
> >         int error;
> > 
> >         /*
> >          * For PAN-involved AT operations, perform the same translation,
> >          * using EL0 this time. Twice. Much fun.
> >          */
> >         error = __kvm_at(OP_AT_S1E0R, vaddr);
> >         if (error)
> >                 return error;
> > 
> >         par_e0 = read_sysreg_par();
> >         if (!(par_e0 & SYS_PAR_EL1_F))
> >                 goto out;
> > 
> >         error = __kvm_at(OP_AT_S1E0W, vaddr);
> >         if (error)
> >                 return error;
> > 
> >         par_e0 = read_sysreg_par();
> > out:
> >         *res = par_e0;
> >         return 0;
> > }
> > 
> > I'm having a hard time understanding why KVM is doing both AT S1E0R and AT S1E0W
> > regardless of the type of the access (read/write) in the PAN-aware AT
> > instruction. Would you mind elaborating on that?
> 
> Because that's the very definition of an AT S1E1{W,R}P instruction
> when PAN is set. If *any* EL0 permission is set, then the translation
> must equally fail. Just like a load or a store from EL1 would fail if
> any EL0 permission is set when PSTATE.PAN is set.
> 
> Since we cannot check for both permissions at once, we do it twice.
> It is worth noting that we don't quite handle the PAN3 case correctly
> (because we can't retrieve the *execution* property using AT). I'll
> add that to the list of stuff to fix.
> 
> > 
> > > +	if (fail) {
> > > +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> > > +		goto nopan;
> > > +	}
> > > [..]
> > > +	if (par & SYS_PAR_EL1_F) {
> > > +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> > > +
> > > +		/*
> > > +		 * If we get something other than a permission fault, we
> > > +		 * need to retry, as we're likely to have missed in the PTs.
> > > +		 */
> > > +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> > > +			retry_slow = true;
> > > +	} else {
> > > +		/*
> > > +		 * The EL0 access succeded, but we don't have the full
> > > +		 * syndrom information to synthetize the failure. Go slow.
> > > +		 */
> > > +		retry_slow = true;
> > > +	}
> > 
> > This is what PSTATE.PAN controls:
> > 
> > If the Effective value of PSTATE.PAN is 1, then a privileged data access from
> > any of the following Exception levels to a virtual memory address that is
> > accessible to data accesses at EL0 generates a stage 1 Permission fault:
> > 
> > - A privileged data access from EL1.
> > - If HCR_EL2.E2H is 1, then a privileged data access from EL2.
> > 
> > With that in mind, I am really struggling to understand the logic.
> 
> I don't quite see what you don't understand, you'll have to be more
> precise. Are you worried about the page tables we're looking at, the
> value of PSTATE.PAN, the permission fault, or something else?
> 
> It also doesn't help that you're looking at the patch that contains
> the integration with the slow-path, which is pretty hard to read (I
> have a reworked version that's a bit better). You probably want to
> look at the "fast" path alone.

I was referring to checking both unprivileged read and write permissions.

And you are right, sorry, I managed to get myself terribly confused. For
completeness sake, this matches AArch64.S1DirectBasePermissions(), where if PAN
&& (UnprivRead || UnprivWrite) then PrivRead = False and PrivWrite = False. So
you need to check that both UnprivRead and UnprivWrite are false for the PAN
variants of AT to succeed.

> 
> > 
> > If AT S1E0{R,W} (from check_at_pan()) failed, doesn't that mean that the virtual
> > memory address is not accessible to EL0? Add that to the fact that the AT
> > S1E1{R,W} (from the beginning of __kvm_at_s1e01()) succeeded, doesn't that mean
> > that AT S1E1{R,W}P should succeed, and furthermore the PAR_EL1 value should be
> > the one KVM got from AT S1E1{R,W}?
> 
> There are plenty of ways for AT S1E0 to fail when AT S1E1 succeeded:
> 
> - no EL0 permission: that's the best case, and the PAR_EL1 obtained
>   from the AT S1E1 is the correct one. That's what we return.

Yes, that is correct, the place where VCPUs PAR_EL1 register is set is far
enough from this code that I didn't make the connection.

> 
> - The EL0 access failed, but for another reason than a permission
>   fault. This contradicts the EL1 walk, and is a sure sign that
>   someone is playing behind our back. We fail.
> 
> - exception from AT S1E0: something went wrong (again the guest
>   playing with the PTs behind our back). We fail as well.
> 
> Do you at least agree with these as goals? If you do, what in
> the implementation does not satisfy these goals? If you don't, what in
> these goals seem improper to you?

I agree with the goals.

In this patch, if I'm reading the code right (and I'm starting to doubt myself)
if PAR_EL1.F is set and PAR_EL1 doesn't indicate a permissions fault, then KVM
falls back to walking the S1 tables:

        if (par & SYS_PAR_EL1_F) {
                u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);

                /*
                 * If we get something other than a permission fault, we
                 * need to retry, as we're likely to have missed in the PTs.
                 */
                if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
                        retry_slow = true;
	}

I suppose that's because KVM cannot distinguish between two very different
reasons for AT failing: 1, because of something being wrong with the stage 1
tables when the AT S1E0* instruction was executed and 2, because of missing
entries at stage 2, as per the comment. Is that correct?

Thanks,
Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-15 15:30       ` Alexandru Elisei
@ 2024-07-18 11:37         ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-18 11:37 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Alex,

On Mon, 15 Jul 2024 16:30:19 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> In this patch, if I'm reading the code right (and I'm starting to doubt myself)
> if PAR_EL1.F is set and PAR_EL1 doesn't indicate a permissions fault, then KVM
> falls back to walking the S1 tables:
> 
>         if (par & SYS_PAR_EL1_F) {
>                 u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> 
>                 /*
>                  * If we get something other than a permission fault, we
>                  * need to retry, as we're likely to have missed in the PTs.
>                  */
>                 if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
>                         retry_slow = true;
> 	}
> 
> I suppose that's because KVM cannot distinguish between two very different
> reasons for AT failing: 1, because of something being wrong with the stage 1
> tables when the AT S1E0* instruction was executed and 2, because of missing
> entries at stage 2, as per the comment. Is that correct?

Exactly. It doesn't help that I'm using 3 AT instructions to implement
a single one, and that makes the window of opportunity for things to
go wrong rather large.

Now, I've been thinking about this some more, and I came to the
conclusion that we can actually implement the FEAT_PAN2 instructions
using the PAN2 instructions themselves, which would greatly simplify
the code. We just need to switch PSTATE.PAN so that it reflects the
guest's state around the AT instruction.

With that scheme, the process becomes slightly clearer (and applies to
all AT instructions except for FEAT_ATS1A):

- either we have a successful translation and all is good

- or we have a failure for permission fault: all is good as well, as
  this is simply a "normal" failure

- or we have a failure for any other reason, and we must fall back to
  a SW walk to work things out properly.

I'll try to capture this reasoning as a comment in the next version.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
  2024-06-25 13:35 ` [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W} Marc Zyngier
@ 2024-07-18 15:10   ` Alexandru Elisei
  2024-07-20  9:49     ` Marc Zyngier
  0 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-18 15:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

On Tue, Jun 25, 2024 at 02:35:07PM +0100, Marc Zyngier wrote:
> On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the
> combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk.
> 
> However, there is a great deal of complexity coming from combining
> the S1 and S2 attributes to report something consistent in PAR_EL1.
> 
> This is an absolute mine field, and I have a splitting headache.
> 
> [..]
> +static u8 compute_sh(u8 attr, u64 desc)
> +{
> +	/* Any form of device, as well as NC has SH[1:0]=0b10 */
> +	if (MEMATTR_IS_DEVICE(attr) || attr == MEMATTR(NC, NC))
> +		return 0b10;
> +
> +	return FIELD_GET(PTE_SHARED, desc) == 0b11 ? 0b11 : 0b10;

If shareability is 0b00 (non-shareable), the PAR_EL1.SH field will be 0b10
(outer-shareable), which seems to be contradicting PAREncodeShareability().

> +}
> +
> +static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
> +			   struct kvm_s2_trans *tr)
> +{
> +	u8 s1_parattr, s2_memattr, final_attr;
> +	u64 par;
> +
> +	/* If S2 has failed to translate, report the damage */
> +	if (tr->esr) {
> +		par = SYS_PAR_EL1_RES1;
> +		par |= SYS_PAR_EL1_F;
> +		par |= SYS_PAR_EL1_S;
> +		par |= FIELD_PREP(SYS_PAR_EL1_FST, tr->esr);
> +		return par;
> +	}
> +
> +	s1_parattr = FIELD_GET(SYS_PAR_EL1_ATTR, s1_par);
> +	s2_memattr = FIELD_GET(GENMASK(5, 2), tr->desc);
> +
> +	if (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_FWB) {
> +		if (!kvm_has_feat(vcpu->kvm, ID_AA64PFR2_EL1, MTEPERM, IMP))
> +			s2_memattr &= ~BIT(3);
> +
> +		/* Combination of R_VRJSW and R_RHWZM */
> +		switch (s2_memattr) {
> +		case 0b0101:
> +			if (MEMATTR_IS_DEVICE(s1_parattr))
> +				final_attr = s1_parattr;
> +			else
> +				final_attr = MEMATTR(NC, NC);
> +			break;
> +		case 0b0110:
> +		case 0b1110:
> +			final_attr = MEMATTR(WbRaWa, WbRaWa);
> +			break;
> +		case 0b0111:
> +		case 0b1111:
> +			/* Preserve S1 attribute */
> +			final_attr = s1_parattr;
> +			break;
> +		case 0b0100:
> +		case 0b1100:
> +		case 0b1101:
> +			/* Reserved, do something non-silly */
> +			final_attr = s1_parattr;
> +			break;
> +		default:
> +			/* MemAttr[2]=0, Device from S2 */
> +			final_attr = s2_memattr & GENMASK(1,0) << 2;
> +		}
> +	} else {
> +		/* Combination of R_HMNDG, R_TNHFM and R_GQFSF */
> +		u8 s2_parattr = s2_memattr_to_attr(s2_memattr);
> +
> +		if (MEMATTR_IS_DEVICE(s1_parattr) ||
> +		    MEMATTR_IS_DEVICE(s2_parattr)) {
> +			final_attr = min(s1_parattr, s2_parattr);
> +		} else {
> +			/* At this stage, this is memory vs memory */
> +			final_attr  = combine_s1_s2_attr(s1_parattr & 0xf,
> +							 s2_parattr & 0xf);
> +			final_attr |= combine_s1_s2_attr(s1_parattr >> 4,
> +							 s2_parattr >> 4) << 4;
> +		}
> +	}
> +
> +	if ((__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_CD) &&
> +	    !MEMATTR_IS_DEVICE(final_attr))
> +		final_attr = MEMATTR(NC, NC);
> +
> +	par  = FIELD_PREP(SYS_PAR_EL1_ATTR, final_attr);
> +	par |= tr->output & GENMASK(47, 12);
> +	par |= FIELD_PREP(SYS_PAR_EL1_SH,
> +			  compute_sh(final_attr, tr->desc));
> +
> +	return par;
>

It seems that the code doesn't combine shareability attributes, as per rule
RGDTNP and S2CombineS1MemAttrs() or S2ApplyFWBMemAttrs(), which both end up
calling S2CombineS1Shareability().

Thanks,
Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
                     ` (3 preceding siblings ...)
  2024-07-11 10:56   ` Alexandru Elisei
@ 2024-07-18 15:16   ` Alexandru Elisei
  2024-07-20 13:49     ` Marc Zyngier
  2024-07-22 10:53   ` Alexandru Elisei
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-18 15:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

Managed to have a look at AT handling for stage 1, I've been comparing it with
AArch64.AT().

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 520 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index 71e3390b43b4c..8452273cbff6d 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -4,9 +4,305 @@
>   * Author: Jintack Lim <jintack.lim@linaro.org>
>   */
>  
> +#include <linux/kvm_host.h>
> +
> +#include <asm/esr.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/kvm_mmu.h>
>  
> +struct s1_walk_info {
> +	u64	     baddr;
> +	unsigned int max_oa_bits;
> +	unsigned int pgshift;
> +	unsigned int txsz;
> +	int 	     sl;
> +	bool	     hpd;
> +	bool	     be;
> +	bool	     nvhe;
> +	bool	     s2;
> +};
> +
> +struct s1_walk_result {
> +	union {
> +		struct {
> +			u64	desc;
> +			u64	pa;
> +			s8	level;
> +			u8	APTable;
> +			bool	UXNTable;
> +			bool	PXNTable;
> +		};
> +		struct {
> +			u8	fst;
> +			bool	ptw;
> +			bool	s2;
> +		};
> +	};
> +	bool	failed;
> +};
> +
> +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> +{
> +	wr->fst		= fst;
> +	wr->ptw		= ptw;
> +	wr->s2		= s2;
> +	wr->failed	= true;
> +}
> +
> +#define S1_MMU_DISABLED		(-127)
> +
> +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +			 struct s1_walk_result *wr, const u64 va, const int el)
> +{
> +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> +	unsigned int stride, x;
> +	bool va55, tbi;
> +
> +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> +
> +	va55 = va & BIT(55);
> +
> +	if (wi->nvhe && va55)
> +		goto addrsz;
> +
> +	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> +
> +	switch (el) {
> +	case 1:
> +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
> +		ttbr	= (va55 ?
> +			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
> +			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
> +		break;
> +	case 2:
> +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
> +		ttbr	= (va55 ?
> +			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
> +			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	/* Let's put the MMU disabled case aside immediately */
> +	if (!(sctlr & SCTLR_ELx_M) ||
> +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
> +			goto addrsz;
> +
> +		wr->level = S1_MMU_DISABLED;
> +		wr->desc = va;
> +		return 0;
> +	}
> +
> +	wi->be = sctlr & SCTLR_ELx_EE;
> +
> +	wi->hpd  = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HPDS, IMP);
> +	wi->hpd &= (wi->nvhe ?
> +		    FIELD_GET(TCR_EL2_HPD, tcr) :
> +		    (va55 ?
> +		     FIELD_GET(TCR_HPD1, tcr) :
> +		     FIELD_GET(TCR_HPD0, tcr)));
> +
> +	tbi = (wi->nvhe ?
> +	       FIELD_GET(TCR_EL2_TBI, tcr) :
> +	       (va55 ?
> +		FIELD_GET(TCR_TBI1, tcr) :
> +		FIELD_GET(TCR_TBI0, tcr)));
> +
> +	if (!tbi && sign_extend64(va, 55) != (s64)va)
> +		goto addrsz;
> +
> +	/* Someone was silly enough to encode TG0/TG1 differently */
> +	if (va55) {
> +		wi->txsz = FIELD_GET(TCR_T1SZ_MASK, tcr);
> +		tg = FIELD_GET(TCR_TG1_MASK, tcr);
> +
> +		switch (tg << TCR_TG1_SHIFT) {
> +		case TCR_TG1_4K:
> +			wi->pgshift = 12;	 break;
> +		case TCR_TG1_16K:
> +			wi->pgshift = 14;	 break;
> +		case TCR_TG1_64K:
> +		default:	    /* IMPDEF: treat any other value as 64k */
> +			wi->pgshift = 16;	 break;
> +		}
> +	} else {
> +		wi->txsz = FIELD_GET(TCR_T0SZ_MASK, tcr);
> +		tg = FIELD_GET(TCR_TG0_MASK, tcr);
> +
> +		switch (tg << TCR_TG0_SHIFT) {
> +		case TCR_TG0_4K:
> +			wi->pgshift = 12;	 break;
> +		case TCR_TG0_16K:
> +			wi->pgshift = 14;	 break;
> +		case TCR_TG0_64K:
> +		default:	    /* IMPDEF: treat any other value as 64k */
> +			wi->pgshift = 16;	 break;
> +		}
> +	}
> +
> +	ia_bits = 64 - wi->txsz;

get_ia_size()?

> +
> +	/* AArch64.S1StartLevel() */
> +	stride = wi->pgshift - 3;
> +	wi->sl = 3 - (((ia_bits - 1) - wi->pgshift) / stride);
> +
> +	/* Check for SL mandating LPA2 (which we don't support yet) */
> +	switch (BIT(wi->pgshift)) {
> +	case SZ_4K:
> +		if (wi->sl == -1 &&
> +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN4, 52_BIT))
> +			goto addrsz;
> +		break;
> +	case SZ_16K:
> +		if (wi->sl == 0 &&
> +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN16, 52_BIT))
> +			goto addrsz;
> +		break;
> +	}
> +
> +	ps = (wi->nvhe ?
> +	      FIELD_GET(TCR_EL2_PS_MASK, tcr) : FIELD_GET(TCR_IPS_MASK, tcr));
> +
> +	wi->max_oa_bits = min(get_kvm_ipa_limit(), ps_to_output_size(ps));
> +
> +	/* Compute minimal alignment */
> +	x = 3 + ia_bits - ((3 - wi->sl) * stride + wi->pgshift);
> +
> +	wi->baddr = ttbr & TTBRx_EL1_BADDR;
> +	wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x);
> +
> +	return 0;
> +
> +addrsz:	/* Address Size Fault level 0 */
> +	fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ, false, false);
> +
> +	return -EFAULT;
> +}

The function seems to be missing checks for:

- valid TxSZ
- VA is not larger than the maximum input size, as defined by TxSZ
- EPD{0,1}

> +
> +static int get_ia_size(struct s1_walk_info *wi)
> +{
> +	return 64 - wi->txsz;
> +}

This looks a lot like get_ia_size() from nested.c.

> +
> +static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +		   struct s1_walk_result *wr, u64 va)
> +{
> +	u64 va_top, va_bottom, baddr, desc;
> +	int level, stride, ret;
> +
> +	level = wi->sl;
> +	stride = wi->pgshift - 3;
> +	baddr = wi->baddr;

AArch64.S1Walk() also checks that baddr is not larger than the OA size.
check_output_size() from nested.c looks almost like what you want here.

> +
> +	va_top = get_ia_size(wi) - 1;
> +
> +	while (1) {
> +		u64 index, ipa;
> +
> +		va_bottom = (3 - level) * stride + wi->pgshift;
> +		index = (va & GENMASK_ULL(va_top, va_bottom)) >> (va_bottom - 3);
> +
> +		ipa = baddr | index;
> +
> +		if (wi->s2) {
> +			struct kvm_s2_trans s2_trans = {};
> +
> +			ret = kvm_walk_nested_s2(vcpu, ipa, &s2_trans);
> +			if (ret) {
> +				fail_s1_walk(wr,
> +					     (s2_trans.esr & ~ESR_ELx_FSC_LEVEL) | level,
> +					     true, true);
> +				return ret;
> +			}
> +
> +			if (!kvm_s2_trans_readable(&s2_trans)) {
> +				fail_s1_walk(wr, ESR_ELx_FSC_PERM | level,
> +					     true, true);
> +
> +				return -EPERM;
> +			}
> +
> +			ipa = kvm_s2_trans_output(&s2_trans);
> +		}
> +
> +		ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc));
> +		if (ret) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level),
> +				     true, false);
> +			return ret;
> +		}
> +
> +		if (wi->be)
> +			desc = be64_to_cpu((__force __be64)desc);
> +		else
> +			desc = le64_to_cpu((__force __le64)desc);
> +
> +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> +				     true, false);
> +			return -ENOENT;
> +		}
> +
> +		/* We found a leaf, handle that */
> +		if ((desc & 3) == 1 || level == 3)
> +			break;
> +
> +		if (!wi->hpd) {
> +			wr->APTable  |= FIELD_GET(PMD_TABLE_AP, desc);
> +			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
> +			wr->PXNTable |= FIELD_GET(PMD_TABLE_PXN, desc);
> +		}
> +
> +		baddr = GENMASK_ULL(47, wi->pgshift);

Where is baddr updated with the value read from the descriptor? Am I missing
something obvious here?

> +
> +		/* Check for out-of-range OA */
> +		if (wi->max_oa_bits < 48 &&
> +		    (baddr & GENMASK_ULL(47, wi->max_oa_bits))) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ | level,
> +				     true, false);
> +			return -EINVAL;
> +		}

This looks very much like check_output_size() from nested.c.

> +
> +		/* Prepare for next round */
> +		va_top = va_bottom - 1;
> +		level++;
> +	}
> +
> +	/* Block mapping, check the validity of the level */
> +	if (!(desc & BIT(1))) {
> +		bool valid_block = false;
> +
> +		switch (BIT(wi->pgshift)) {
> +		case SZ_4K:
> +			valid_block = level == 1 || level == 2;
> +			break;
> +		case SZ_16K:
> +		case SZ_64K:
> +			valid_block = level == 2;
> +			break;
> +		}
> +
> +		if (!valid_block) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> +				     true, false);
> +			return -EINVAL;
> +		}
> +	}

Matches AArch64.BlockDescSupported(), with the caveat that the walker currently
doesn't support 52 bit PAs.

> +
> +	wr->failed = false;
> +	wr->level = level;
> +	wr->desc = desc;
> +	wr->pa = desc & GENMASK(47, va_bottom);

No output size check for final PA.

> +	if (va_bottom > 12)
> +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> +
> +	return 0;
> +}
> +
>  struct mmu_config {
>  	u64	ttbr0;
>  	u64	ttbr1;
> @@ -234,6 +530,177 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
>  	return par;
>  }
>  
> +static u64 compute_par_s1(struct kvm_vcpu *vcpu, struct s1_walk_result *wr)
> +{
> +	u64 par;
> +
> +	if (wr->failed) {
> +		par = SYS_PAR_EL1_RES1;
> +		par |= SYS_PAR_EL1_F;
> +		par |= FIELD_PREP(SYS_PAR_EL1_FST, wr->fst);
> +		par |= wr->ptw ? SYS_PAR_EL1_PTW : 0;
> +		par |= wr->s2 ? SYS_PAR_EL1_S : 0;
> +	} else if (wr->level == S1_MMU_DISABLED) {
> +		/* MMU off or HCR_EL2.DC == 1 */
> +		par = wr->pa & GENMASK_ULL(47, 12);

That's interesting, setup_s1_walk() sets wr->desc = va and leaves wr->pa
unchanged (it's 0 from initialization in handle_at_slow()).

> +
> +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> +		} else {
> +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> +					  MEMATTR(WbRaWa, WbRaWa));
> +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> +		}

This matches AArch64.S1DisabledOutput().

> +	} else {
> +		u64 mair, sctlr;
> +		int el;
> +		u8 sh;
> +
> +		el = (vcpu_el2_e2h_is_set(vcpu) &&
> +		      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +		mair = ((el == 2) ?
> +			vcpu_read_sys_reg(vcpu, MAIR_EL2) :
> +			vcpu_read_sys_reg(vcpu, MAIR_EL1));
> +
> +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> +		mair &= 0xff;
> +
> +		sctlr = ((el == 2) ?
> +			vcpu_read_sys_reg(vcpu, SCTLR_EL2) :
> +			vcpu_read_sys_reg(vcpu, SCTLR_EL1));
> +
> +		/* Force NC for memory if SCTLR_ELx.C is clear */
> +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> +			mair = MEMATTR(NC, NC);

This matches the compute memory attributes part of AArch64.S1Translate().

> +
> +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> +		par |= wr->pa & GENMASK_ULL(47, 12);
> +
> +		sh = compute_sh(mair, wr->desc);
> +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> +	}
> +
> +	return par;
> +}
> +
> +static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> +{
> +	bool perm_fail, ur, uw, ux, pr, pw, pan;
> +	struct s1_walk_result wr = {};
> +	struct s1_walk_info wi = {};
> +	int ret, idx, el;
> +
> +	/*
> +	 * We only get here from guest EL2, so the translation regime
> +	 * AT applies to is solely defined by {E2H,TGE}.
> +	 */
> +	el = (vcpu_el2_e2h_is_set(vcpu) &&
> +	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +	ret = setup_s1_walk(vcpu, &wi, &wr, vaddr, el);
> +	if (ret)
> +		goto compute_par;
> +
> +	if (wr.level == S1_MMU_DISABLED)
> +		goto compute_par;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	ret = walk_s1(vcpu, &wi, &wr, vaddr);
> +
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	if (ret)
> +		goto compute_par;
> +
> +	/* FIXME: revisit when adding indirect permission support */
> +	if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3) &&
> +	    !wi.nvhe) {

Just FYI, the 'if' statement fits on one line without going over the old 80
character limit.

> +		u64 sctlr;
> +
> +		if (el == 1)
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		else
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +
> +		ux = (sctlr & SCTLR_EL1_EPAN) && !(wr.desc & PTE_UXN);

I don't understand this. UnprivExecute is true for the memory location if and
only if **SCTLR_ELx.EPAN** && !UXN?

> +	} else {
> +		ux = false;
> +	}
> +
> +	pw = !(wr.desc & PTE_RDONLY);
> +
> +	if (wi.nvhe) {
> +		ur = uw = false;
> +		pr = true;
> +	} else {
> +		if (wr.desc & PTE_USER) {
> +			ur = pr = true;
> +			uw = pw;
> +		} else {
> +			ur = uw = false;
> +			pr = true;
> +		}
> +	}
> +
> +	/* Apply the Hierarchical Permission madness */
> +	if (wi.nvhe) {
> +		wr.APTable &= BIT(1);
> +		wr.PXNTable = wr.UXNTable;
> +	}
> +
> +	ur &= !(wr.APTable & BIT(0));
> +	uw &= !(wr.APTable != 0);
> +	ux &= !wr.UXNTable;
> +
> +	pw &= !(wr.APTable & BIT(1));

Would it make sense here to compute the resulting permisisons like in
AArch64.S1DirectBasePermissions()? I.e, look at the AP bits first, have
a switch statement for all 4 values (also makes it very easy to cross-reference
with Table D8-60), then apply hierarchical permissions/pan/epan. I do admit
that I have a very selfish reason to propose this - it makes reviewing easier.

> +
> +	pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT;
> +
> +	perm_fail = false;
> +
> +	switch (op) {
> +	case OP_AT_S1E1RP:
> +		perm_fail |= pan && (ur || uw || ux);

I had a very hard time understanding what the code is trying to do here.  How
about rewriting it to something like the pseudocode below:

  // ux = !(desc and UXN) and !UXNTable
  perm_fail |= pan && (ur || uw || ((sctlr & SCTLR_EL1_EPAN) && ux));

... which maps more closely to AArch64.S1DirectBasePermissions().

Thanks,
Alex

> +		fallthrough;
> +	case OP_AT_S1E1R:
> +	case OP_AT_S1E2R:
> +		perm_fail |= !pr;
> +		break;
> +	case OP_AT_S1E1WP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1W:
> +	case OP_AT_S1E2W:
> +		perm_fail |= !pw;
> +		break;
> +	case OP_AT_S1E0R:
> +		perm_fail |= !ur;
> +		break;
> +	case OP_AT_S1E0W:
> +		perm_fail |= !uw;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	if (perm_fail) {
> +		struct s1_walk_result tmp;
> +
> +		tmp.failed = true;
> +		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> +		tmp.s2 = false;
> +		tmp.ptw = false;
> +
> +		wr = tmp;
> +	}
> +
> +compute_par:
> +	return compute_par_s1(vcpu, &wr);
> +}
> [..]


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
  2024-07-18 15:10   ` Alexandru Elisei
@ 2024-07-20  9:49     ` Marc Zyngier
  2024-07-22 10:33       ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-20  9:49 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Thu, 18 Jul 2024 16:10:20 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi,
> 
> On Tue, Jun 25, 2024 at 02:35:07PM +0100, Marc Zyngier wrote:
> > On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the
> > combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk.
> > 
> > However, there is a great deal of complexity coming from combining
> > the S1 and S2 attributes to report something consistent in PAR_EL1.
> > 
> > This is an absolute mine field, and I have a splitting headache.
> > 
> > [..]
> > +static u8 compute_sh(u8 attr, u64 desc)
> > +{
> > +	/* Any form of device, as well as NC has SH[1:0]=0b10 */
> > +	if (MEMATTR_IS_DEVICE(attr) || attr == MEMATTR(NC, NC))
> > +		return 0b10;
> > +
> > +	return FIELD_GET(PTE_SHARED, desc) == 0b11 ? 0b11 : 0b10;
> 
> If shareability is 0b00 (non-shareable), the PAR_EL1.SH field will be 0b10
> (outer-shareable), which seems to be contradicting PAREncodeShareability().

Yup, well caught.

> > +	par |= FIELD_PREP(SYS_PAR_EL1_SH,
> > +			  compute_sh(final_attr, tr->desc));
> > +
> > +	return par;
> >
> 
> It seems that the code doesn't combine shareability attributes, as per rule
> RGDTNP and S2CombineS1MemAttrs() or S2ApplyFWBMemAttrs(), which both end up
> calling S2CombineS1Shareability().

That as well. See below what I'm stashing on top.

Thanks,

	M.

diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index e66c97fc1fd3..28c4344d1c34 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -459,13 +459,34 @@ static u8 combine_s1_s2_attr(u8 s1, u8 s2)
 	return final;
 }
 
+#define ATTR_NSH	0b00
+#define ATTR_RSV	0b01
+#define ATTR_OSH	0b10
+#define ATTR_ISH	0b11
+
 static u8 compute_sh(u8 attr, u64 desc)
 {
+	u8 sh;
+
 	/* Any form of device, as well as NC has SH[1:0]=0b10 */
 	if (MEMATTR_IS_DEVICE(attr) || attr == MEMATTR(NC, NC))
-		return 0b10;
+		return ATTR_OSH;
+
+	sh = FIELD_GET(PTE_SHARED, desc);
+	if (sh == ATTR_RSV)		/* Reserved, mapped to NSH */
+		sh = ATTR_NSH;
+
+	return sh;
+}
+
+static u8 combine_sh(u8 s1_sh, u8 s2_sh)
+{
+	if (s1_sh == ATTR_OSH || s2_sh == ATTR_OSH)
+		return ATTR_OSH;
+	if (s1_sh == ATTR_ISH || s2_sh == ATTR_ISH)
+		return ATTR_ISH;
 
-	return FIELD_GET(PTE_SHARED, desc) == 0b11 ? 0b11 : 0b10;
+	return ATTR_NSH;
 }
 
 static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
@@ -540,7 +561,8 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
 	par  = FIELD_PREP(SYS_PAR_EL1_ATTR, final_attr);
 	par |= tr->output & GENMASK(47, 12);
 	par |= FIELD_PREP(SYS_PAR_EL1_SH,
-			  compute_sh(final_attr, tr->desc));
+			  combine_sh(FIELD_GET(SYS_PAR_EL1_SH, s1_par),
+				     compute_sh(final_attr, tr->desc)));
 
 	return par;
 }

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-18 15:16   ` Alexandru Elisei
@ 2024-07-20 13:49     ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-20 13:49 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Thu, 18 Jul 2024 16:16:19 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi,
> 
> Managed to have a look at AT handling for stage 1, I've been comparing it with
> AArch64.AT().
> 
> On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > In order to plug the brokenness of our current AT implementation,
> > we need a SW walker that is going to... err.. walk the S1 tables
> > and tell us what it finds.
> > 
> > Of course, it builds on top of our S2 walker, and share similar
> > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > it is able to bring back pages that have been otherwise evicted.
> > 
> > This is then plugged in the two AT S1 emulation functions as
> > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 520 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > index 71e3390b43b4c..8452273cbff6d 100644
> > --- a/arch/arm64/kvm/at.c
> > +++ b/arch/arm64/kvm/at.c
> > @@ -4,9 +4,305 @@
> >   * Author: Jintack Lim <jintack.lim@linaro.org>
> >   */
> >  
> > +#include <linux/kvm_host.h>
> > +
> > +#include <asm/esr.h>
> >  #include <asm/kvm_hyp.h>
> >  #include <asm/kvm_mmu.h>
> >  
> > +struct s1_walk_info {
> > +	u64	     baddr;
> > +	unsigned int max_oa_bits;
> > +	unsigned int pgshift;
> > +	unsigned int txsz;
> > +	int 	     sl;
> > +	bool	     hpd;
> > +	bool	     be;
> > +	bool	     nvhe;
> > +	bool	     s2;
> > +};
> > +
> > +struct s1_walk_result {
> > +	union {
> > +		struct {
> > +			u64	desc;
> > +			u64	pa;
> > +			s8	level;
> > +			u8	APTable;
> > +			bool	UXNTable;
> > +			bool	PXNTable;
> > +		};
> > +		struct {
> > +			u8	fst;
> > +			bool	ptw;
> > +			bool	s2;
> > +		};
> > +	};
> > +	bool	failed;
> > +};
> > +
> > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > +{
> > +	wr->fst		= fst;
> > +	wr->ptw		= ptw;
> > +	wr->s2		= s2;
> > +	wr->failed	= true;
> > +}
> > +
> > +#define S1_MMU_DISABLED		(-127)
> > +
> > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > +{
> > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > +	unsigned int stride, x;
> > +	bool va55, tbi;
> > +
> > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> > +
> > +	va55 = va & BIT(55);
> > +
> > +	if (wi->nvhe && va55)
> > +		goto addrsz;
> > +
> > +	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> > +
> > +	switch (el) {
> > +	case 1:
> > +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> > +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
> > +		ttbr	= (va55 ?
> > +			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
> > +			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
> > +		break;
> > +	case 2:
> > +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> > +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
> > +		ttbr	= (va55 ?
> > +			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
> > +			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
> > +		break;
> > +	default:
> > +		BUG();
> > +	}
> > +
> > +	/* Let's put the MMU disabled case aside immediately */
> > +	if (!(sctlr & SCTLR_ELx_M) ||
> > +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > +		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
> > +			goto addrsz;
> > +
> > +		wr->level = S1_MMU_DISABLED;
> > +		wr->desc = va;
> > +		return 0;
> > +	}
> > +
> > +	wi->be = sctlr & SCTLR_ELx_EE;
> > +
> > +	wi->hpd  = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HPDS, IMP);
> > +	wi->hpd &= (wi->nvhe ?
> > +		    FIELD_GET(TCR_EL2_HPD, tcr) :
> > +		    (va55 ?
> > +		     FIELD_GET(TCR_HPD1, tcr) :
> > +		     FIELD_GET(TCR_HPD0, tcr)));
> > +
> > +	tbi = (wi->nvhe ?
> > +	       FIELD_GET(TCR_EL2_TBI, tcr) :
> > +	       (va55 ?
> > +		FIELD_GET(TCR_TBI1, tcr) :
> > +		FIELD_GET(TCR_TBI0, tcr)));
> > +
> > +	if (!tbi && sign_extend64(va, 55) != (s64)va)
> > +		goto addrsz;
> > +
> > +	/* Someone was silly enough to encode TG0/TG1 differently */
> > +	if (va55) {
> > +		wi->txsz = FIELD_GET(TCR_T1SZ_MASK, tcr);
> > +		tg = FIELD_GET(TCR_TG1_MASK, tcr);
> > +
> > +		switch (tg << TCR_TG1_SHIFT) {
> > +		case TCR_TG1_4K:
> > +			wi->pgshift = 12;	 break;
> > +		case TCR_TG1_16K:
> > +			wi->pgshift = 14;	 break;
> > +		case TCR_TG1_64K:
> > +		default:	    /* IMPDEF: treat any other value as 64k */
> > +			wi->pgshift = 16;	 break;
> > +		}
> > +	} else {
> > +		wi->txsz = FIELD_GET(TCR_T0SZ_MASK, tcr);
> > +		tg = FIELD_GET(TCR_TG0_MASK, tcr);
> > +
> > +		switch (tg << TCR_TG0_SHIFT) {
> > +		case TCR_TG0_4K:
> > +			wi->pgshift = 12;	 break;
> > +		case TCR_TG0_16K:
> > +			wi->pgshift = 14;	 break;
> > +		case TCR_TG0_64K:
> > +		default:	    /* IMPDEF: treat any other value as 64k */
> > +			wi->pgshift = 16;	 break;
> > +		}
> > +	}
> > +
> > +	ia_bits = 64 - wi->txsz;
> 
> get_ia_size()?

yeah, fair enough. I wasn't sold on using any helper while the
walk_info struct is incomplete, but that doesn't change much.

> 
> > +
> > +	/* AArch64.S1StartLevel() */
> > +	stride = wi->pgshift - 3;
> > +	wi->sl = 3 - (((ia_bits - 1) - wi->pgshift) / stride);
> > +
> > +	/* Check for SL mandating LPA2 (which we don't support yet) */
> > +	switch (BIT(wi->pgshift)) {
> > +	case SZ_4K:
> > +		if (wi->sl == -1 &&
> > +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN4, 52_BIT))
> > +			goto addrsz;
> > +		break;
> > +	case SZ_16K:
> > +		if (wi->sl == 0 &&
> > +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN16, 52_BIT))
> > +			goto addrsz;
> > +		break;
> > +	}
> > +
> > +	ps = (wi->nvhe ?
> > +	      FIELD_GET(TCR_EL2_PS_MASK, tcr) : FIELD_GET(TCR_IPS_MASK, tcr));
> > +
> > +	wi->max_oa_bits = min(get_kvm_ipa_limit(), ps_to_output_size(ps));
> > +
> > +	/* Compute minimal alignment */
> > +	x = 3 + ia_bits - ((3 - wi->sl) * stride + wi->pgshift);
> > +
> > +	wi->baddr = ttbr & TTBRx_EL1_BADDR;
> > +	wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x);
> > +
> > +	return 0;
> > +
> > +addrsz:	/* Address Size Fault level 0 */
> > +	fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ, false, false);
> > +
> > +	return -EFAULT;
> > +}
> 
> The function seems to be missing checks for:
> 
> - valid TxSZ
> - VA is not larger than the maximum input size, as defined by TxSZ
> - EPD{0,1}

Yup, all fixed now, with E0PD{0,1} as an added bonus. The number of
ways a translation can fail is amazing.

> 
> > +
> > +static int get_ia_size(struct s1_walk_info *wi)
> > +{
> > +	return 64 - wi->txsz;
> > +}
> 
> This looks a lot like get_ia_size() from nested.c.

Indeed. Except that the *type* is different. And I really like the
fact that they are separate for now. I may end-up merging some of the
attributes at some point though.

> 
> > +
> > +static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > +		   struct s1_walk_result *wr, u64 va)
> > +{
> > +	u64 va_top, va_bottom, baddr, desc;
> > +	int level, stride, ret;
> > +
> > +	level = wi->sl;
> > +	stride = wi->pgshift - 3;
> > +	baddr = wi->baddr;
> 
> AArch64.S1Walk() also checks that baddr is not larger than the OA size.
> check_output_size() from nested.c looks almost like what you want
> here.

At this stage, it is too late, as wi->baddr has already been sanitised
by the setup phase. I'll add a check over there.

> 
> > +
> > +	va_top = get_ia_size(wi) - 1;
> > +
> > +	while (1) {
> > +		u64 index, ipa;
> > +
> > +		va_bottom = (3 - level) * stride + wi->pgshift;
> > +		index = (va & GENMASK_ULL(va_top, va_bottom)) >> (va_bottom - 3);
> > +
> > +		ipa = baddr | index;
> > +
> > +		if (wi->s2) {
> > +			struct kvm_s2_trans s2_trans = {};
> > +
> > +			ret = kvm_walk_nested_s2(vcpu, ipa, &s2_trans);
> > +			if (ret) {
> > +				fail_s1_walk(wr,
> > +					     (s2_trans.esr & ~ESR_ELx_FSC_LEVEL) | level,
> > +					     true, true);
> > +				return ret;
> > +			}
> > +
> > +			if (!kvm_s2_trans_readable(&s2_trans)) {
> > +				fail_s1_walk(wr, ESR_ELx_FSC_PERM | level,
> > +					     true, true);
> > +
> > +				return -EPERM;
> > +			}
> > +
> > +			ipa = kvm_s2_trans_output(&s2_trans);
> > +		}
> > +
> > +		ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc));
> > +		if (ret) {
> > +			fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level),
> > +				     true, false);
> > +			return ret;
> > +		}
> > +
> > +		if (wi->be)
> > +			desc = be64_to_cpu((__force __be64)desc);
> > +		else
> > +			desc = le64_to_cpu((__force __le64)desc);
> > +
> > +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) {
> > +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> > +				     true, false);
> > +			return -ENOENT;
> > +		}
> > +
> > +		/* We found a leaf, handle that */
> > +		if ((desc & 3) == 1 || level == 3)
> > +			break;
> > +
> > +		if (!wi->hpd) {
> > +			wr->APTable  |= FIELD_GET(PMD_TABLE_AP, desc);
> > +			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
> > +			wr->PXNTable |= FIELD_GET(PMD_TABLE_PXN, desc);
> > +		}
> > +
> > +		baddr = GENMASK_ULL(47, wi->pgshift);
> 
> Where is baddr updated with the value read from the descriptor? Am I missing
> something obvious here?

Huh. Something has gone very wrong, and I have no idea how.
This should read:

		baddr = desc & GENMASK_ULL(47, wi->pgshift);

because otherwise nothing makes sense. I must have done a last minute
cleanup and somehow broken it. Time to retest everything!

> 
> > +
> > +		/* Check for out-of-range OA */
> > +		if (wi->max_oa_bits < 48 &&
> > +		    (baddr & GENMASK_ULL(47, wi->max_oa_bits))) {
> > +			fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ | level,
> > +				     true, false);
> > +			return -EINVAL;
> > +		}
> 
> This looks very much like check_output_size() from nested.c.

Yup. I'll fold that into a helper -- still separate from the S2
version though.

> 
> > +
> > +		/* Prepare for next round */
> > +		va_top = va_bottom - 1;
> > +		level++;
> > +	}
> > +
> > +	/* Block mapping, check the validity of the level */
> > +	if (!(desc & BIT(1))) {
> > +		bool valid_block = false;
> > +
> > +		switch (BIT(wi->pgshift)) {
> > +		case SZ_4K:
> > +			valid_block = level == 1 || level == 2;
> > +			break;
> > +		case SZ_16K:
> > +		case SZ_64K:
> > +			valid_block = level == 2;
> > +			break;
> > +		}
> > +
> > +		if (!valid_block) {
> > +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> > +				     true, false);
> > +			return -EINVAL;
> > +		}
> > +	}
> 
> Matches AArch64.BlockDescSupported(), with the caveat that the walker currently
> doesn't support 52 bit PAs.
> 
> > +
> > +	wr->failed = false;
> > +	wr->level = level;
> > +	wr->desc = desc;
> > +	wr->pa = desc & GENMASK(47, va_bottom);
> 
> No output size check for final PA.

Now fixed.

> 
> > +	if (va_bottom > 12)
> > +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> > +
> > +	return 0;
> > +}
> > +
> >  struct mmu_config {
> >  	u64	ttbr0;
> >  	u64	ttbr1;
> > @@ -234,6 +530,177 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
> >  	return par;
> >  }
> >  
> > +static u64 compute_par_s1(struct kvm_vcpu *vcpu, struct s1_walk_result *wr)
> > +{
> > +	u64 par;
> > +
> > +	if (wr->failed) {
> > +		par = SYS_PAR_EL1_RES1;
> > +		par |= SYS_PAR_EL1_F;
> > +		par |= FIELD_PREP(SYS_PAR_EL1_FST, wr->fst);
> > +		par |= wr->ptw ? SYS_PAR_EL1_PTW : 0;
> > +		par |= wr->s2 ? SYS_PAR_EL1_S : 0;
> > +	} else if (wr->level == S1_MMU_DISABLED) {
> > +		/* MMU off or HCR_EL2.DC == 1 */
> > +		par = wr->pa & GENMASK_ULL(47, 12);
> 
> That's interesting, setup_s1_walk() sets wr->desc = va and leaves wr->pa
> unchanged (it's 0 from initialization in handle_at_slow()).

If by "interesting" you mean broken, then I agree!

> 
> > +
> > +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> > +		} else {
> > +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> > +					  MEMATTR(WbRaWa, WbRaWa));
> > +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> > +		}
> 
> This matches AArch64.S1DisabledOutput().
> 
> > +	} else {
> > +		u64 mair, sctlr;
> > +		int el;
> > +		u8 sh;
> > +
> > +		el = (vcpu_el2_e2h_is_set(vcpu) &&
> > +		      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> > +
> > +		mair = ((el == 2) ?
> > +			vcpu_read_sys_reg(vcpu, MAIR_EL2) :
> > +			vcpu_read_sys_reg(vcpu, MAIR_EL1));
> > +
> > +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> > +		mair &= 0xff;
> > +
> > +		sctlr = ((el == 2) ?
> > +			vcpu_read_sys_reg(vcpu, SCTLR_EL2) :
> > +			vcpu_read_sys_reg(vcpu, SCTLR_EL1));
> > +
> > +		/* Force NC for memory if SCTLR_ELx.C is clear */
> > +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> > +			mair = MEMATTR(NC, NC);
> 
> This matches the compute memory attributes part of AArch64.S1Translate().
> 
> > +
> > +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> > +		par |= wr->pa & GENMASK_ULL(47, 12);
> > +
> > +		sh = compute_sh(mair, wr->desc);
> > +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> > +	}
> > +
> > +	return par;
> > +}
> > +
> > +static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> > +{
> > +	bool perm_fail, ur, uw, ux, pr, pw, pan;
> > +	struct s1_walk_result wr = {};
> > +	struct s1_walk_info wi = {};
> > +	int ret, idx, el;
> > +
> > +	/*
> > +	 * We only get here from guest EL2, so the translation regime
> > +	 * AT applies to is solely defined by {E2H,TGE}.
> > +	 */
> > +	el = (vcpu_el2_e2h_is_set(vcpu) &&
> > +	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> > +
> > +	ret = setup_s1_walk(vcpu, &wi, &wr, vaddr, el);
> > +	if (ret)
> > +		goto compute_par;
> > +
> > +	if (wr.level == S1_MMU_DISABLED)
> > +		goto compute_par;
> > +
> > +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> > +
> > +	ret = walk_s1(vcpu, &wi, &wr, vaddr);
> > +
> > +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> > +
> > +	if (ret)
> > +		goto compute_par;
> > +
> > +	/* FIXME: revisit when adding indirect permission support */
> > +	if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3) &&
> > +	    !wi.nvhe) {
> 
> Just FYI, the 'if' statement fits on one line without going over the old 80
> character limit.

All that code has now been reworked.

> 
> > +		u64 sctlr;
> > +
> > +		if (el == 1)
> > +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> > +		else
> > +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> > +
> > +		ux = (sctlr & SCTLR_EL1_EPAN) && !(wr.desc & PTE_UXN);
> 
> I don't understand this. UnprivExecute is true for the memory location if and
> only if **SCTLR_ELx.EPAN** && !UXN?

Well, it is the only case we actually care about. And from what i read
below, you *have* understood it.

> 
> > +	} else {
> > +		ux = false;
> > +	}
> > +
> > +	pw = !(wr.desc & PTE_RDONLY);
> > +
> > +	if (wi.nvhe) {
> > +		ur = uw = false;
> > +		pr = true;
> > +	} else {
> > +		if (wr.desc & PTE_USER) {
> > +			ur = pr = true;
> > +			uw = pw;
> > +		} else {
> > +			ur = uw = false;
> > +			pr = true;
> > +		}
> > +	}
> > +
> > +	/* Apply the Hierarchical Permission madness */
> > +	if (wi.nvhe) {
> > +		wr.APTable &= BIT(1);
> > +		wr.PXNTable = wr.UXNTable;
> > +	}
> > +
> > +	ur &= !(wr.APTable & BIT(0));
> > +	uw &= !(wr.APTable != 0);
> > +	ux &= !wr.UXNTable;
> > +
> > +	pw &= !(wr.APTable & BIT(1));
> 
> Would it make sense here to compute the resulting permisisons like in
> AArch64.S1DirectBasePermissions()? I.e, look at the AP bits first, have
> a switch statement for all 4 values (also makes it very easy to cross-reference
> with Table D8-60), then apply hierarchical permissions/pan/epan. I do admit
> that I have a very selfish reason to propose this - it makes reviewing easier.
>

Fair enough. I usually try to distance myself from the pseudocode and
implement what I understand, but I appreciate this is just hard to
read. It definitely results in something larger, but it probably
doesn't matter much.

> > +
> > +	pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT;
> > +
> > +	perm_fail = false;
> > +
> > +	switch (op) {
> > +	case OP_AT_S1E1RP:
> > +		perm_fail |= pan && (ur || uw || ux);
> 
> I had a very hard time understanding what the code is trying to do here.  How
> about rewriting it to something like the pseudocode below:
> 
>   // ux = !(desc and UXN) and !UXNTable
>   perm_fail |= pan && (ur || uw || ((sctlr & SCTLR_EL1_EPAN) && ux));
> 
> ... which maps more closely to AArch64.S1DirectBasePermissions().

Yup, I got there by virtue of adopting the same flow as the
pseudocode.

Thanks a lot for the thorough review, much appreciated.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
  2024-07-20  9:49     ` Marc Zyngier
@ 2024-07-22 10:33       ` Alexandru Elisei
  0 siblings, 0 replies; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-22 10:33 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

On Sat, Jul 20, 2024 at 10:49:29AM +0100, Marc Zyngier wrote:
> On Thu, 18 Jul 2024 16:10:20 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi,
> > 
> > On Tue, Jun 25, 2024 at 02:35:07PM +0100, Marc Zyngier wrote:
> > > On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the
> > > combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk.
> > > 
> > > However, there is a great deal of complexity coming from combining
> > > the S1 and S2 attributes to report something consistent in PAR_EL1.
> > > 
> > > This is an absolute mine field, and I have a splitting headache.
> > > 
> > > [..]
> > > +static u8 compute_sh(u8 attr, u64 desc)
> > > +{
> > > +	/* Any form of device, as well as NC has SH[1:0]=0b10 */
> > > +	if (MEMATTR_IS_DEVICE(attr) || attr == MEMATTR(NC, NC))
> > > +		return 0b10;
> > > +
> > > +	return FIELD_GET(PTE_SHARED, desc) == 0b11 ? 0b11 : 0b10;
> > 
> > If shareability is 0b00 (non-shareable), the PAR_EL1.SH field will be 0b10
> > (outer-shareable), which seems to be contradicting PAREncodeShareability().
> 
> Yup, well caught.
> 
> > > +	par |= FIELD_PREP(SYS_PAR_EL1_SH,
> > > +			  compute_sh(final_attr, tr->desc));
> > > +
> > > +	return par;
> > >
> > 
> > It seems that the code doesn't combine shareability attributes, as per rule
> > RGDTNP and S2CombineS1MemAttrs() or S2ApplyFWBMemAttrs(), which both end up
> > calling S2CombineS1Shareability().
> 
> That as well. See below what I'm stashing on top.
> 
> Thanks,
> 
> 	M.
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index e66c97fc1fd3..28c4344d1c34 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -459,13 +459,34 @@ static u8 combine_s1_s2_attr(u8 s1, u8 s2)
>  	return final;
>  }
>  
> +#define ATTR_NSH	0b00
> +#define ATTR_RSV	0b01
> +#define ATTR_OSH	0b10
> +#define ATTR_ISH	0b11

Matches Table D8-89 from DDI 0487K.a.

> +
>  static u8 compute_sh(u8 attr, u64 desc)
>  {
> +	u8 sh;
> +
>  	/* Any form of device, as well as NC has SH[1:0]=0b10 */
>  	if (MEMATTR_IS_DEVICE(attr) || attr == MEMATTR(NC, NC))
> -		return 0b10;
> +		return ATTR_OSH;
> +
> +	sh = FIELD_GET(PTE_SHARED, desc);
> +	if (sh == ATTR_RSV)		/* Reserved, mapped to NSH */
> +		sh = ATTR_NSH;
> +
> +	return sh;
> +}

Matches PAREncodeShareability().

> +
> +static u8 combine_sh(u8 s1_sh, u8 s2_sh)
> +{
> +	if (s1_sh == ATTR_OSH || s2_sh == ATTR_OSH)
> +		return ATTR_OSH;
> +	if (s1_sh == ATTR_ISH || s2_sh == ATTR_ISH)
> +		return ATTR_ISH;
>  
> -	return FIELD_GET(PTE_SHARED, desc) == 0b11 ? 0b11 : 0b10;
> +	return ATTR_NSH;
>  }

Matches S2CombineS1Shareability().

>  
>  static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
> @@ -540,7 +561,8 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
>  	par  = FIELD_PREP(SYS_PAR_EL1_ATTR, final_attr);
>  	par |= tr->output & GENMASK(47, 12);
>  	par |= FIELD_PREP(SYS_PAR_EL1_SH,
> -			  compute_sh(final_attr, tr->desc));
> +			  combine_sh(FIELD_GET(SYS_PAR_EL1_SH, s1_par),
> +				     compute_sh(final_attr, tr->desc)));

Looks good.

Thanks,
Alex

>  
>  	return par;
>  }
> 
> -- 
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
                     ` (4 preceding siblings ...)
  2024-07-18 15:16   ` Alexandru Elisei
@ 2024-07-22 10:53   ` Alexandru Elisei
  2024-07-22 15:25     ` Marc Zyngier
  2024-07-25 14:16   ` Alexandru Elisei
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-22 10:53 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

I would like to use the S1 walker for KVM SPE, and I was planning to move it to
a separate file, where it would be shared between nested KVM and SPE. I think
this is also good for NV, since the walker would get more testing.

Do you think moving it to a shared location is a good approach? Or do you have
something else in mind?

Also, do you know where you'll be able to send an updated version of this
series? I'm asking because I want to decide between using this code (with fixes
on top) or wait for the next iteration. Please don't feel that you need to send
the next iteration too soon.

And please CC me on the series, so I don't miss it by mistake :)

Thanks,
Alex

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 520 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index 71e3390b43b4c..8452273cbff6d 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -4,9 +4,305 @@
>   * Author: Jintack Lim <jintack.lim@linaro.org>
>   */
>  
> +#include <linux/kvm_host.h>
> +
> +#include <asm/esr.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/kvm_mmu.h>
>  
> +struct s1_walk_info {
> +	u64	     baddr;
> +	unsigned int max_oa_bits;
> +	unsigned int pgshift;
> +	unsigned int txsz;
> +	int 	     sl;
> +	bool	     hpd;
> +	bool	     be;
> +	bool	     nvhe;
> +	bool	     s2;
> +};
> +
> +struct s1_walk_result {
> +	union {
> +		struct {
> +			u64	desc;
> +			u64	pa;
> +			s8	level;
> +			u8	APTable;
> +			bool	UXNTable;
> +			bool	PXNTable;
> +		};
> +		struct {
> +			u8	fst;
> +			bool	ptw;
> +			bool	s2;
> +		};
> +	};
> +	bool	failed;
> +};
> +
> +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> +{
> +	wr->fst		= fst;
> +	wr->ptw		= ptw;
> +	wr->s2		= s2;
> +	wr->failed	= true;
> +}
> +
> +#define S1_MMU_DISABLED		(-127)
> +
> +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +			 struct s1_walk_result *wr, const u64 va, const int el)
> +{
> +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> +	unsigned int stride, x;
> +	bool va55, tbi;
> +
> +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> +
> +	va55 = va & BIT(55);
> +
> +	if (wi->nvhe && va55)
> +		goto addrsz;
> +
> +	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> +
> +	switch (el) {
> +	case 1:
> +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
> +		ttbr	= (va55 ?
> +			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
> +			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
> +		break;
> +	case 2:
> +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
> +		ttbr	= (va55 ?
> +			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
> +			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	/* Let's put the MMU disabled case aside immediately */
> +	if (!(sctlr & SCTLR_ELx_M) ||
> +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
> +			goto addrsz;
> +
> +		wr->level = S1_MMU_DISABLED;
> +		wr->desc = va;
> +		return 0;
> +	}
> +
> +	wi->be = sctlr & SCTLR_ELx_EE;
> +
> +	wi->hpd  = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HPDS, IMP);
> +	wi->hpd &= (wi->nvhe ?
> +		    FIELD_GET(TCR_EL2_HPD, tcr) :
> +		    (va55 ?
> +		     FIELD_GET(TCR_HPD1, tcr) :
> +		     FIELD_GET(TCR_HPD0, tcr)));
> +
> +	tbi = (wi->nvhe ?
> +	       FIELD_GET(TCR_EL2_TBI, tcr) :
> +	       (va55 ?
> +		FIELD_GET(TCR_TBI1, tcr) :
> +		FIELD_GET(TCR_TBI0, tcr)));
> +
> +	if (!tbi && sign_extend64(va, 55) != (s64)va)
> +		goto addrsz;
> +
> +	/* Someone was silly enough to encode TG0/TG1 differently */
> +	if (va55) {
> +		wi->txsz = FIELD_GET(TCR_T1SZ_MASK, tcr);
> +		tg = FIELD_GET(TCR_TG1_MASK, tcr);
> +
> +		switch (tg << TCR_TG1_SHIFT) {
> +		case TCR_TG1_4K:
> +			wi->pgshift = 12;	 break;
> +		case TCR_TG1_16K:
> +			wi->pgshift = 14;	 break;
> +		case TCR_TG1_64K:
> +		default:	    /* IMPDEF: treat any other value as 64k */
> +			wi->pgshift = 16;	 break;
> +		}
> +	} else {
> +		wi->txsz = FIELD_GET(TCR_T0SZ_MASK, tcr);
> +		tg = FIELD_GET(TCR_TG0_MASK, tcr);
> +
> +		switch (tg << TCR_TG0_SHIFT) {
> +		case TCR_TG0_4K:
> +			wi->pgshift = 12;	 break;
> +		case TCR_TG0_16K:
> +			wi->pgshift = 14;	 break;
> +		case TCR_TG0_64K:
> +		default:	    /* IMPDEF: treat any other value as 64k */
> +			wi->pgshift = 16;	 break;
> +		}
> +	}
> +
> +	ia_bits = 64 - wi->txsz;
> +
> +	/* AArch64.S1StartLevel() */
> +	stride = wi->pgshift - 3;
> +	wi->sl = 3 - (((ia_bits - 1) - wi->pgshift) / stride);
> +
> +	/* Check for SL mandating LPA2 (which we don't support yet) */
> +	switch (BIT(wi->pgshift)) {
> +	case SZ_4K:
> +		if (wi->sl == -1 &&
> +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN4, 52_BIT))
> +			goto addrsz;
> +		break;
> +	case SZ_16K:
> +		if (wi->sl == 0 &&
> +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN16, 52_BIT))
> +			goto addrsz;
> +		break;
> +	}
> +
> +	ps = (wi->nvhe ?
> +	      FIELD_GET(TCR_EL2_PS_MASK, tcr) : FIELD_GET(TCR_IPS_MASK, tcr));
> +
> +	wi->max_oa_bits = min(get_kvm_ipa_limit(), ps_to_output_size(ps));
> +
> +	/* Compute minimal alignment */
> +	x = 3 + ia_bits - ((3 - wi->sl) * stride + wi->pgshift);
> +
> +	wi->baddr = ttbr & TTBRx_EL1_BADDR;
> +	wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x);
> +
> +	return 0;
> +
> +addrsz:	/* Address Size Fault level 0 */
> +	fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ, false, false);
> +
> +	return -EFAULT;
> +}
> +
> +static int get_ia_size(struct s1_walk_info *wi)
> +{
> +	return 64 - wi->txsz;
> +}
> +
> +static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +		   struct s1_walk_result *wr, u64 va)
> +{
> +	u64 va_top, va_bottom, baddr, desc;
> +	int level, stride, ret;
> +
> +	level = wi->sl;
> +	stride = wi->pgshift - 3;
> +	baddr = wi->baddr;
> +
> +	va_top = get_ia_size(wi) - 1;
> +
> +	while (1) {
> +		u64 index, ipa;
> +
> +		va_bottom = (3 - level) * stride + wi->pgshift;
> +		index = (va & GENMASK_ULL(va_top, va_bottom)) >> (va_bottom - 3);
> +
> +		ipa = baddr | index;
> +
> +		if (wi->s2) {
> +			struct kvm_s2_trans s2_trans = {};
> +
> +			ret = kvm_walk_nested_s2(vcpu, ipa, &s2_trans);
> +			if (ret) {
> +				fail_s1_walk(wr,
> +					     (s2_trans.esr & ~ESR_ELx_FSC_LEVEL) | level,
> +					     true, true);
> +				return ret;
> +			}
> +
> +			if (!kvm_s2_trans_readable(&s2_trans)) {
> +				fail_s1_walk(wr, ESR_ELx_FSC_PERM | level,
> +					     true, true);
> +
> +				return -EPERM;
> +			}
> +
> +			ipa = kvm_s2_trans_output(&s2_trans);
> +		}
> +
> +		ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc));
> +		if (ret) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level),
> +				     true, false);
> +			return ret;
> +		}
> +
> +		if (wi->be)
> +			desc = be64_to_cpu((__force __be64)desc);
> +		else
> +			desc = le64_to_cpu((__force __le64)desc);
> +
> +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> +				     true, false);
> +			return -ENOENT;
> +		}
> +
> +		/* We found a leaf, handle that */
> +		if ((desc & 3) == 1 || level == 3)
> +			break;
> +
> +		if (!wi->hpd) {
> +			wr->APTable  |= FIELD_GET(PMD_TABLE_AP, desc);
> +			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
> +			wr->PXNTable |= FIELD_GET(PMD_TABLE_PXN, desc);
> +		}
> +
> +		baddr = GENMASK_ULL(47, wi->pgshift);
> +
> +		/* Check for out-of-range OA */
> +		if (wi->max_oa_bits < 48 &&
> +		    (baddr & GENMASK_ULL(47, wi->max_oa_bits))) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ | level,
> +				     true, false);
> +			return -EINVAL;
> +		}
> +
> +		/* Prepare for next round */
> +		va_top = va_bottom - 1;
> +		level++;
> +	}
> +
> +	/* Block mapping, check the validity of the level */
> +	if (!(desc & BIT(1))) {
> +		bool valid_block = false;
> +
> +		switch (BIT(wi->pgshift)) {
> +		case SZ_4K:
> +			valid_block = level == 1 || level == 2;
> +			break;
> +		case SZ_16K:
> +		case SZ_64K:
> +			valid_block = level == 2;
> +			break;
> +		}
> +
> +		if (!valid_block) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> +				     true, false);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	wr->failed = false;
> +	wr->level = level;
> +	wr->desc = desc;
> +	wr->pa = desc & GENMASK(47, va_bottom);
> +	if (va_bottom > 12)
> +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> +
> +	return 0;
> +}
> +
>  struct mmu_config {
>  	u64	ttbr0;
>  	u64	ttbr1;
> @@ -234,6 +530,177 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
>  	return par;
>  }
>  
> +static u64 compute_par_s1(struct kvm_vcpu *vcpu, struct s1_walk_result *wr)
> +{
> +	u64 par;
> +
> +	if (wr->failed) {
> +		par = SYS_PAR_EL1_RES1;
> +		par |= SYS_PAR_EL1_F;
> +		par |= FIELD_PREP(SYS_PAR_EL1_FST, wr->fst);
> +		par |= wr->ptw ? SYS_PAR_EL1_PTW : 0;
> +		par |= wr->s2 ? SYS_PAR_EL1_S : 0;
> +	} else if (wr->level == S1_MMU_DISABLED) {
> +		/* MMU off or HCR_EL2.DC == 1 */
> +		par = wr->pa & GENMASK_ULL(47, 12);
> +
> +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> +		} else {
> +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> +					  MEMATTR(WbRaWa, WbRaWa));
> +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> +		}
> +	} else {
> +		u64 mair, sctlr;
> +		int el;
> +		u8 sh;
> +
> +		el = (vcpu_el2_e2h_is_set(vcpu) &&
> +		      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +		mair = ((el == 2) ?
> +			vcpu_read_sys_reg(vcpu, MAIR_EL2) :
> +			vcpu_read_sys_reg(vcpu, MAIR_EL1));
> +
> +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> +		mair &= 0xff;
> +
> +		sctlr = ((el == 2) ?
> +			vcpu_read_sys_reg(vcpu, SCTLR_EL2) :
> +			vcpu_read_sys_reg(vcpu, SCTLR_EL1));
> +
> +		/* Force NC for memory if SCTLR_ELx.C is clear */
> +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> +			mair = MEMATTR(NC, NC);
> +
> +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> +		par |= wr->pa & GENMASK_ULL(47, 12);
> +
> +		sh = compute_sh(mair, wr->desc);
> +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> +	}
> +
> +	return par;
> +}
> +
> +static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> +{
> +	bool perm_fail, ur, uw, ux, pr, pw, pan;
> +	struct s1_walk_result wr = {};
> +	struct s1_walk_info wi = {};
> +	int ret, idx, el;
> +
> +	/*
> +	 * We only get here from guest EL2, so the translation regime
> +	 * AT applies to is solely defined by {E2H,TGE}.
> +	 */
> +	el = (vcpu_el2_e2h_is_set(vcpu) &&
> +	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +	ret = setup_s1_walk(vcpu, &wi, &wr, vaddr, el);
> +	if (ret)
> +		goto compute_par;
> +
> +	if (wr.level == S1_MMU_DISABLED)
> +		goto compute_par;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	ret = walk_s1(vcpu, &wi, &wr, vaddr);
> +
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	if (ret)
> +		goto compute_par;
> +
> +	/* FIXME: revisit when adding indirect permission support */
> +	if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3) &&
> +	    !wi.nvhe) {
> +		u64 sctlr;
> +
> +		if (el == 1)
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		else
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +
> +		ux = (sctlr & SCTLR_EL1_EPAN) && !(wr.desc & PTE_UXN);
> +	} else {
> +		ux = false;
> +	}
> +
> +	pw = !(wr.desc & PTE_RDONLY);
> +
> +	if (wi.nvhe) {
> +		ur = uw = false;
> +		pr = true;
> +	} else {
> +		if (wr.desc & PTE_USER) {
> +			ur = pr = true;
> +			uw = pw;
> +		} else {
> +			ur = uw = false;
> +			pr = true;
> +		}
> +	}
> +
> +	/* Apply the Hierarchical Permission madness */
> +	if (wi.nvhe) {
> +		wr.APTable &= BIT(1);
> +		wr.PXNTable = wr.UXNTable;
> +	}
> +
> +	ur &= !(wr.APTable & BIT(0));
> +	uw &= !(wr.APTable != 0);
> +	ux &= !wr.UXNTable;
> +
> +	pw &= !(wr.APTable & BIT(1));
> +
> +	pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT;
> +
> +	perm_fail = false;
> +
> +	switch (op) {
> +	case OP_AT_S1E1RP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1R:
> +	case OP_AT_S1E2R:
> +		perm_fail |= !pr;
> +		break;
> +	case OP_AT_S1E1WP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1W:
> +	case OP_AT_S1E2W:
> +		perm_fail |= !pw;
> +		break;
> +	case OP_AT_S1E0R:
> +		perm_fail |= !ur;
> +		break;
> +	case OP_AT_S1E0W:
> +		perm_fail |= !uw;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	if (perm_fail) {
> +		struct s1_walk_result tmp;
> +
> +		tmp.failed = true;
> +		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> +		tmp.s2 = false;
> +		tmp.ptw = false;
> +
> +		wr = tmp;
> +	}
> +
> +compute_par:
> +	return compute_par_s1(vcpu, &wr);
> +}
> +
>  static bool check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
>  {
>  	u64 par_e0;
> @@ -266,9 +733,11 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	struct mmu_config config;
>  	struct kvm_s2_mmu *mmu;
>  	unsigned long flags;
> -	bool fail;
> +	bool fail, retry_slow;
>  	u64 par;
>  
> +	retry_slow = false;
> +
>  	write_lock(&vcpu->kvm->mmu_lock);
>  
>  	/*
> @@ -288,14 +757,15 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  		goto skip_mmu_switch;
>  
>  	/*
> -	 * FIXME: Obtaining the S2 MMU for a L2 is horribly racy, and
> -	 * we may not find it (recycled by another vcpu, for example).
> -	 * See the other FIXME comment below about the need for a SW
> -	 * PTW in this case.
> +	 * Obtaining the S2 MMU for a L2 is horribly racy, and we may not
> +	 * find it (recycled by another vcpu, for example). When this
> +	 * happens, use the SW (slow) path.
>  	 */
>  	mmu = lookup_s2_mmu(vcpu);
> -	if (WARN_ON(!mmu))
> +	if (!mmu) {
> +		retry_slow = true;
>  		goto out;
> +	}
>  
>  	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR0_EL1),	SYS_TTBR0);
>  	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR1_EL1),	SYS_TTBR1);
> @@ -331,18 +801,17 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	}
>  
>  	if (!fail)
> -		par = read_sysreg(par_el1);
> +		par = read_sysreg_par();
>  	else
>  		par = SYS_PAR_EL1_F;
>  
> +	retry_slow = !fail;
> +
>  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
>  
>  	/*
> -	 * Failed? let's leave the building now.
> -	 *
> -	 * FIXME: how about a failed translation because the shadow S2
> -	 * wasn't populated? We may need to perform a SW PTW,
> -	 * populating our shadow S2 and retry the instruction.
> +	 * Failed? let's leave the building now, unless we retry on
> +	 * the slow path.
>  	 */
>  	if (par & SYS_PAR_EL1_F)
>  		goto nopan;
> @@ -354,29 +823,58 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	switch (op) {
>  	case OP_AT_S1E1RP:
>  	case OP_AT_S1E1WP:
> +		retry_slow = false;
>  		fail = check_at_pan(vcpu, vaddr, &par);
>  		break;
>  	default:
>  		goto nopan;
>  	}
>  
> +	if (fail) {
> +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> +		goto nopan;
> +	}
> +
>  	/*
>  	 * If the EL0 translation has succeeded, we need to pretend
>  	 * the AT operation has failed, as the PAN setting forbids
>  	 * such a translation.
> -	 *
> -	 * FIXME: we hardcode a Level-3 permission fault. We really
> -	 * should return the real fault level.
>  	 */
> -	if (fail || !(par & SYS_PAR_EL1_F))
> -		vcpu_write_sys_reg(vcpu, (0xf << 1) | SYS_PAR_EL1_F, PAR_EL1);
> -
> +	if (par & SYS_PAR_EL1_F) {
> +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> +
> +		/*
> +		 * If we get something other than a permission fault, we
> +		 * need to retry, as we're likely to have missed in the PTs.
> +		 */
> +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> +			retry_slow = true;
> +	} else {
> +		/*
> +		 * The EL0 access succeded, but we don't have the full
> +		 * syndrom information to synthetize the failure. Go slow.
> +		 */
> +		retry_slow = true;
> +	}
>  nopan:
>  	__mmu_config_restore(&config);
>  out:
>  	local_irq_restore(flags);
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
> +
> +	/*
> +	 * If retry_slow is true, then we either are missing shadow S2
> +	 * entries, have paged out guest S1, or something is inconsistent.
> +	 *
> +	 * Either way, we need to walk the PTs by hand so that we can either
> +	 * fault things back, in or record accurate fault information along
> +	 * the way.
> +	 */
> +	if (retry_slow) {
> +		par = handle_at_slow(vcpu, op, vaddr);
> +		vcpu_write_sys_reg(vcpu, par, PAR_EL1);
> +	}
>  }
>  
>  void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> @@ -433,6 +931,10 @@ void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  
> +	/* We failed the translation, let's replay it in slow motion */
> +	if (!fail && (par & SYS_PAR_EL1_F))
> +		par = handle_at_slow(vcpu, op, vaddr);
> +
>  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
>  }
>  
> -- 
> 2.39.2
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-22 10:53   ` Alexandru Elisei
@ 2024-07-22 15:25     ` Marc Zyngier
  2024-07-23  8:57       ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-22 15:25 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Alex,

On Mon, 22 Jul 2024 11:53:13 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> I would like to use the S1 walker for KVM SPE, and I was planning to move it to
> a separate file, where it would be shared between nested KVM and SPE. I think
> this is also good for NV, since the walker would get more testing.
> 
> Do you think moving it to a shared location is a good approach? Or do you have
> something else in mind?

I'm definitely open to moving it somewhere else if that helps, though
the location doesn't matter much, TBH, and it is the boundary of the
interface I'm more interested in. It may need some work though, as the
current design is solely written with AT in mind.

> Also, do you know where you'll be able to send an updated version of this
> series? I'm asking because I want to decide between using this code (with fixes
> on top) or wait for the next iteration. Please don't feel that you need to send
> the next iteration too soon.

The current state of the branch is at [1], which I plan to send once
-rc1 is out. Note that this isn't a stable branch, so things can
change without any warning!

> And please CC me on the series, so I don't miss it by mistake :)

Of course!

Thanks,

	M.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/nv-at-pan-WIP

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-22 15:25     ` Marc Zyngier
@ 2024-07-23  8:57       ` Alexandru Elisei
  0 siblings, 0 replies; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-23  8:57 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Mon, Jul 22, 2024 at 04:25:00PM +0100, Marc Zyngier wrote:
> Hi Alex,
> 
> On Mon, 22 Jul 2024 11:53:13 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi Marc,
> > 
> > I would like to use the S1 walker for KVM SPE, and I was planning to move it to
> > a separate file, where it would be shared between nested KVM and SPE. I think
> > this is also good for NV, since the walker would get more testing.
> > 
> > Do you think moving it to a shared location is a good approach? Or do you have
> > something else in mind?
> 
> I'm definitely open to moving it somewhere else if that helps, though
> the location doesn't matter much, TBH, and it is the boundary of the
> interface I'm more interested in. It may need some work though, as the
> current design is solely written with AT in mind.

Looks that way to me too.

> 
> > Also, do you know where you'll be able to send an updated version of this
> > series? I'm asking because I want to decide between using this code (with fixes
> > on top) or wait for the next iteration. Please don't feel that you need to send
> > the next iteration too soon.
> 
> The current state of the branch is at [1], which I plan to send once
> -rc1 is out. Note that this isn't a stable branch, so things can
> change without any warning!
> 
> > And please CC me on the series, so I don't miss it by mistake :)
> 
> Of course!

Sounds great, thanks!

Alex

> 
> Thanks,
> 
> 	M.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/nv-at-pan-WIP
> 
> -- 
> Without deviation from the norm, progress is not possible.
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
                     ` (5 preceding siblings ...)
  2024-07-22 10:53   ` Alexandru Elisei
@ 2024-07-25 14:16   ` Alexandru Elisei
  2024-07-25 14:30     ` Marc Zyngier
  2024-07-29 15:26   ` Alexandru Elisei
  2024-07-31 14:33   ` Alexandru Elisei
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-25 14:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> [..]
> +static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> +{
> +	bool perm_fail, ur, uw, ux, pr, pw, pan;
> +	struct s1_walk_result wr = {};
> +	struct s1_walk_info wi = {};
> +	int ret, idx, el;
> +
> +	/*
> +	 * We only get here from guest EL2, so the translation regime
> +	 * AT applies to is solely defined by {E2H,TGE}.
> +	 */
> +	el = (vcpu_el2_e2h_is_set(vcpu) &&
> +	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +	ret = setup_s1_walk(vcpu, &wi, &wr, vaddr, el);
> +	if (ret)
> +		goto compute_par;
> +
> +	if (wr.level == S1_MMU_DISABLED)
> +		goto compute_par;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	ret = walk_s1(vcpu, &wi, &wr, vaddr);
> +
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	if (ret)
> +		goto compute_par;
> +
> +	/* FIXME: revisit when adding indirect permission support */
> +	if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3) &&
> +	    !wi.nvhe) {
> +		u64 sctlr;
> +
> +		if (el == 1)
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		else
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +
> +		ux = (sctlr & SCTLR_EL1_EPAN) && !(wr.desc & PTE_UXN);
> +	} else {
> +		ux = false;
> +	}
> +
> +	pw = !(wr.desc & PTE_RDONLY);
> +
> +	if (wi.nvhe) {
> +		ur = uw = false;
> +		pr = true;
> +	} else {
> +		if (wr.desc & PTE_USER) {
> +			ur = pr = true;
> +			uw = pw;
> +		} else {
> +			ur = uw = false;
> +			pr = true;
> +		}
> +	}
> +
> +	/* Apply the Hierarchical Permission madness */
> +	if (wi.nvhe) {
> +		wr.APTable &= BIT(1);
> +		wr.PXNTable = wr.UXNTable;
> +	}
> +
> +	ur &= !(wr.APTable & BIT(0));
> +	uw &= !(wr.APTable != 0);
> +	ux &= !wr.UXNTable;
> +
> +	pw &= !(wr.APTable & BIT(1));
> +
> +	pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT;
> +
> +	perm_fail = false;
> +
> +	switch (op) {
> +	case OP_AT_S1E1RP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1R:
> +	case OP_AT_S1E2R:
> +		perm_fail |= !pr;
> +		break;
> +	case OP_AT_S1E1WP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1W:
> +	case OP_AT_S1E2W:
> +		perm_fail |= !pw;
> +		break;
> +	case OP_AT_S1E0R:
> +		perm_fail |= !ur;
> +		break;
> +	case OP_AT_S1E0W:
> +		perm_fail |= !uw;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	if (perm_fail) {
> +		struct s1_walk_result tmp;

I was wondering if you would consider initializing 'tmp' to the empty struct
here. That makes it consistent with the initialization of 'wr' in the !perm_fail
case and I think it will make the code more robust wrt to changes to
compute_par_s1() and what fields it accesses.

Thanks,
Alex

> +
> +		tmp.failed = true;
> +		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> +		tmp.s2 = false;
> +		tmp.ptw = false;
> +
> +		wr = tmp;
> +	}
> +
> +compute_par:
> +	return compute_par_s1(vcpu, &wr);
> +}


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-25 14:16   ` Alexandru Elisei
@ 2024-07-25 14:30     ` Marc Zyngier
  2024-07-25 15:13       ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-25 14:30 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Thu, 25 Jul 2024 15:16:12 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > +	if (perm_fail) {
> > +		struct s1_walk_result tmp;
> 
> I was wondering if you would consider initializing 'tmp' to the empty struct
> here. That makes it consistent with the initialization of 'wr' in the !perm_fail
> case and I think it will make the code more robust wrt to changes to
> compute_par_s1() and what fields it accesses.

I think there is a slightly better way, with something like this:

diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index b02d8dbffd209..36fa2801ab4ef 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -803,12 +803,12 @@ static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
 	}
 
 	if (perm_fail) {
-		struct s1_walk_result tmp;
-
-		tmp.failed = true;
-		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
-		tmp.s2 = false;
-		tmp.ptw = false;
+		struct s1_walk_result tmp = (struct s1_walk_result){
+			.failed	= true,
+			.fst	= ESR_ELx_FSC_PERM | wr.level,
+			.s2	= false,
+			.ptw	= false,
+		};
 
 		wr = tmp;
 	}

Thoughts?

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-25 14:30     ` Marc Zyngier
@ 2024-07-25 15:13       ` Alexandru Elisei
  2024-07-25 15:33         ` Marc Zyngier
  0 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-25 15:13 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

On Thu, Jul 25, 2024 at 03:30:00PM +0100, Marc Zyngier wrote:
> On Thu, 25 Jul 2024 15:16:12 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi Marc,
> > 
> > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > +	if (perm_fail) {
> > > +		struct s1_walk_result tmp;
> > 
> > I was wondering if you would consider initializing 'tmp' to the empty struct
> > here. That makes it consistent with the initialization of 'wr' in the !perm_fail
> > case and I think it will make the code more robust wrt to changes to
> > compute_par_s1() and what fields it accesses.
> 
> I think there is a slightly better way, with something like this:
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index b02d8dbffd209..36fa2801ab4ef 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -803,12 +803,12 @@ static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	}
>  
>  	if (perm_fail) {
> -		struct s1_walk_result tmp;
> -
> -		tmp.failed = true;
> -		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> -		tmp.s2 = false;
> -		tmp.ptw = false;
> +		struct s1_walk_result tmp = (struct s1_walk_result){
> +			.failed	= true,
> +			.fst	= ESR_ELx_FSC_PERM | wr.level,
> +			.s2	= false,
> +			.ptw	= false,
> +		};
>  
>  		wr = tmp;
>  	}
> 
> Thoughts?

How about (diff against your kvm-arm64/nv-at-pan-WIP branch, in case something
looks off):

diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
index b02d8dbffd20..74ebe3223a13 100644
--- a/arch/arm64/kvm/at.c
+++ b/arch/arm64/kvm/at.c
@@ -802,16 +802,8 @@ static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
                BUG();
        }

-       if (perm_fail) {
-               struct s1_walk_result tmp;
-
-               tmp.failed = true;
-               tmp.fst = ESR_ELx_FSC_PERM | wr.level;
-               tmp.s2 = false;
-               tmp.ptw = false;
-
-               wr = tmp;
-       }
+       if (perm_fail)
+               fail_s1_walk(&wr, ESR_ELx_FSC_PERM | wr.level, false, false);

 compute_par:
        return compute_par_s1(vcpu, &wr);

Thanks,
Alex


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-25 15:13       ` Alexandru Elisei
@ 2024-07-25 15:33         ` Marc Zyngier
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Zyngier @ 2024-07-25 15:33 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Thu, 25 Jul 2024 16:13:25 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi,
> 
> On Thu, Jul 25, 2024 at 03:30:00PM +0100, Marc Zyngier wrote:
> > On Thu, 25 Jul 2024 15:16:12 +0100,
> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > > 
> > > Hi Marc,
> > > 
> > > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > > +	if (perm_fail) {
> > > > +		struct s1_walk_result tmp;
> > > 
> > > I was wondering if you would consider initializing 'tmp' to the empty struct
> > > here. That makes it consistent with the initialization of 'wr' in the !perm_fail
> > > case and I think it will make the code more robust wrt to changes to
> > > compute_par_s1() and what fields it accesses.
> > 
> > I think there is a slightly better way, with something like this:
> > 
> > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > index b02d8dbffd209..36fa2801ab4ef 100644
> > --- a/arch/arm64/kvm/at.c
> > +++ b/arch/arm64/kvm/at.c
> > @@ -803,12 +803,12 @@ static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> >  	}
> >  
> >  	if (perm_fail) {
> > -		struct s1_walk_result tmp;
> > -
> > -		tmp.failed = true;
> > -		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> > -		tmp.s2 = false;
> > -		tmp.ptw = false;
> > +		struct s1_walk_result tmp = (struct s1_walk_result){
> > +			.failed	= true,
> > +			.fst	= ESR_ELx_FSC_PERM | wr.level,
> > +			.s2	= false,
> > +			.ptw	= false,
> > +		};
> >  
> >  		wr = tmp;
> >  	}
> > 
> > Thoughts?
> 
> How about (diff against your kvm-arm64/nv-at-pan-WIP branch, in case something
> looks off):
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index b02d8dbffd20..74ebe3223a13 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -802,16 +802,8 @@ static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>                 BUG();
>         }
> 
> -       if (perm_fail) {
> -               struct s1_walk_result tmp;
> -
> -               tmp.failed = true;
> -               tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> -               tmp.s2 = false;
> -               tmp.ptw = false;
> -
> -               wr = tmp;
> -       }
> +       if (perm_fail)
> +               fail_s1_walk(&wr, ESR_ELx_FSC_PERM | wr.level, false, false);
> 
>  compute_par:
>         return compute_par_s1(vcpu, &wr);
> 

Ah, much nicer indeed!

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
                     ` (6 preceding siblings ...)
  2024-07-25 14:16   ` Alexandru Elisei
@ 2024-07-29 15:26   ` Alexandru Elisei
  2024-07-31  8:55     ` Marc Zyngier
  2024-07-31 14:33   ` Alexandru Elisei
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-29 15:26 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 520 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index 71e3390b43b4c..8452273cbff6d 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -4,9 +4,305 @@
>   * Author: Jintack Lim <jintack.lim@linaro.org>
>   */
>  
> +#include <linux/kvm_host.h>
> +
> +#include <asm/esr.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/kvm_mmu.h>
>  
> +struct s1_walk_info {
> +	u64	     baddr;
> +	unsigned int max_oa_bits;
> +	unsigned int pgshift;
> +	unsigned int txsz;
> +	int 	     sl;
> +	bool	     hpd;
> +	bool	     be;
> +	bool	     nvhe;
> +	bool	     s2;
> +};
> +
> +struct s1_walk_result {
> +	union {
> +		struct {
> +			u64	desc;
> +			u64	pa;
> +			s8	level;
> +			u8	APTable;
> +			bool	UXNTable;
> +			bool	PXNTable;
> +		};
> +		struct {
> +			u8	fst;
> +			bool	ptw;
> +			bool	s2;
> +		};
> +	};
> +	bool	failed;
> +};
> +
> +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> +{
> +	wr->fst		= fst;
> +	wr->ptw		= ptw;
> +	wr->s2		= s2;
> +	wr->failed	= true;
> +}
> +
> +#define S1_MMU_DISABLED		(-127)
> +
> +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +			 struct s1_walk_result *wr, const u64 va, const int el)
> +{
> +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> +	unsigned int stride, x;
> +	bool va55, tbi;
> +
> +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);

Where 'el' is computed in handle_at_slow() as:

	/*
	 * We only get here from guest EL2, so the translation regime
	 * AT applies to is solely defined by {E2H,TGE}.
	 */
	el = (vcpu_el2_e2h_is_set(vcpu) &&
	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;

I think 'nvhe' will always be false ('el' is 2 only when E2H is set).

I'm curious about what 'el' represents. The translation regime for the AT
instruction?

Thanks,
Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-29 15:26   ` Alexandru Elisei
@ 2024-07-31  8:55     ` Marc Zyngier
  2024-07-31  9:53       ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-31  8:55 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Mon, 29 Jul 2024 16:26:00 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > In order to plug the brokenness of our current AT implementation,
> > we need a SW walker that is going to... err.. walk the S1 tables
> > and tell us what it finds.
> > 
> > Of course, it builds on top of our S2 walker, and share similar
> > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > it is able to bring back pages that have been otherwise evicted.
> > 
> > This is then plugged in the two AT S1 emulation functions as
> > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 520 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > index 71e3390b43b4c..8452273cbff6d 100644
> > --- a/arch/arm64/kvm/at.c
> > +++ b/arch/arm64/kvm/at.c
> > @@ -4,9 +4,305 @@
> >   * Author: Jintack Lim <jintack.lim@linaro.org>
> >   */
> >  
> > +#include <linux/kvm_host.h>
> > +
> > +#include <asm/esr.h>
> >  #include <asm/kvm_hyp.h>
> >  #include <asm/kvm_mmu.h>
> >  
> > +struct s1_walk_info {
> > +	u64	     baddr;
> > +	unsigned int max_oa_bits;
> > +	unsigned int pgshift;
> > +	unsigned int txsz;
> > +	int 	     sl;
> > +	bool	     hpd;
> > +	bool	     be;
> > +	bool	     nvhe;
> > +	bool	     s2;
> > +};
> > +
> > +struct s1_walk_result {
> > +	union {
> > +		struct {
> > +			u64	desc;
> > +			u64	pa;
> > +			s8	level;
> > +			u8	APTable;
> > +			bool	UXNTable;
> > +			bool	PXNTable;
> > +		};
> > +		struct {
> > +			u8	fst;
> > +			bool	ptw;
> > +			bool	s2;
> > +		};
> > +	};
> > +	bool	failed;
> > +};
> > +
> > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > +{
> > +	wr->fst		= fst;
> > +	wr->ptw		= ptw;
> > +	wr->s2		= s2;
> > +	wr->failed	= true;
> > +}
> > +
> > +#define S1_MMU_DISABLED		(-127)
> > +
> > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > +{
> > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > +	unsigned int stride, x;
> > +	bool va55, tbi;
> > +
> > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> 
> Where 'el' is computed in handle_at_slow() as:
> 
> 	/*
> 	 * We only get here from guest EL2, so the translation regime
> 	 * AT applies to is solely defined by {E2H,TGE}.
> 	 */
> 	el = (vcpu_el2_e2h_is_set(vcpu) &&
> 	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> 
> I think 'nvhe' will always be false ('el' is 2 only when E2H is
> set).

Yeah, there is a number of problems here. el should depend on both the
instruction (some are EL2-specific) and the HCR control bits. I'll
tackle that now.

> I'm curious about what 'el' represents. The translation regime for the AT
> instruction?

Exactly that.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-31  8:55     ` Marc Zyngier
@ 2024-07-31  9:53       ` Alexandru Elisei
  2024-07-31 10:18         ` Marc Zyngier
  0 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-31  9:53 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

On Wed, Jul 31, 2024 at 09:55:28AM +0100, Marc Zyngier wrote:
> On Mon, 29 Jul 2024 16:26:00 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi Marc,
> > 
> > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > In order to plug the brokenness of our current AT implementation,
> > > we need a SW walker that is going to... err.. walk the S1 tables
> > > and tell us what it finds.
> > > 
> > > Of course, it builds on top of our S2 walker, and share similar
> > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > it is able to bring back pages that have been otherwise evicted.
> > > 
> > > This is then plugged in the two AT S1 emulation functions as
> > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > 
> > > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > > ---
> > >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 520 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > > index 71e3390b43b4c..8452273cbff6d 100644
> > > --- a/arch/arm64/kvm/at.c
> > > +++ b/arch/arm64/kvm/at.c
> > > @@ -4,9 +4,305 @@
> > >   * Author: Jintack Lim <jintack.lim@linaro.org>
> > >   */
> > >  
> > > +#include <linux/kvm_host.h>
> > > +
> > > +#include <asm/esr.h>
> > >  #include <asm/kvm_hyp.h>
> > >  #include <asm/kvm_mmu.h>
> > >  
> > > +struct s1_walk_info {
> > > +	u64	     baddr;
> > > +	unsigned int max_oa_bits;
> > > +	unsigned int pgshift;
> > > +	unsigned int txsz;
> > > +	int 	     sl;
> > > +	bool	     hpd;
> > > +	bool	     be;
> > > +	bool	     nvhe;
> > > +	bool	     s2;
> > > +};
> > > +
> > > +struct s1_walk_result {
> > > +	union {
> > > +		struct {
> > > +			u64	desc;
> > > +			u64	pa;
> > > +			s8	level;
> > > +			u8	APTable;
> > > +			bool	UXNTable;
> > > +			bool	PXNTable;
> > > +		};
> > > +		struct {
> > > +			u8	fst;
> > > +			bool	ptw;
> > > +			bool	s2;
> > > +		};
> > > +	};
> > > +	bool	failed;
> > > +};
> > > +
> > > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > > +{
> > > +	wr->fst		= fst;
> > > +	wr->ptw		= ptw;
> > > +	wr->s2		= s2;
> > > +	wr->failed	= true;
> > > +}
> > > +
> > > +#define S1_MMU_DISABLED		(-127)
> > > +
> > > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > > +{
> > > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > > +	unsigned int stride, x;
> > > +	bool va55, tbi;
> > > +
> > > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> > 
> > Where 'el' is computed in handle_at_slow() as:
> > 
> > 	/*
> > 	 * We only get here from guest EL2, so the translation regime
> > 	 * AT applies to is solely defined by {E2H,TGE}.
> > 	 */
> > 	el = (vcpu_el2_e2h_is_set(vcpu) &&
> > 	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> > 
> > I think 'nvhe' will always be false ('el' is 2 only when E2H is
> > set).
> 
> Yeah, there is a number of problems here. el should depend on both the
> instruction (some are EL2-specific) and the HCR control bits. I'll
> tackle that now.

Yeah, also noticed that how sctlr, tcr and ttbr are chosen in setup_s1_walk()
doesn't look quite right for the nvhe case.

> 
> > I'm curious about what 'el' represents. The translation regime for the AT
> > instruction?
> 
> Exactly that.

Might I make a suggestion here? I was thinking about dropping the (el, wi-nvhe*)
tuple to represent the translation regime and have a wi->regime (or similar) to
unambiguously encode the regime. The value can be an enum with three values to
represent the three possible regimes (REGIME_EL10, REGIME_EL2, REGIME_EL20).

Just a thought though, feel free to ignore at your leisure.

*wi->single_range on the kvm-arm64/nv-at-pan-WIP branch.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions
  2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
                   ` (10 preceding siblings ...)
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
@ 2024-07-31 10:05 ` Alexandru Elisei
  2024-07-31 11:02   ` Marc Zyngier
  11 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-31 10:05 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Tue, Jun 25, 2024 at 02:34:59PM +0100, Marc Zyngier wrote:
> Another task that a hypervisor supporting NV on arm64 has to deal with
> is to emulate the AT instruction, because we multiplex all the S1
> translations on a single set of registers, and the guest S2 is never
> truly resident on the CPU.

I'm unfamiliar with the state of NV support in KVM, but I thought I would have a
look at when AT trapping is enabled. As far as I can tell, it's only enabled in
vhe/switch.c::__activate_traps() -> compute_hcr() if is_hyp_ctct(vcpu). Found
this by grep'ing for HCR_AT.

Assuming the above is correct, I am curious about the following:

- The above paragraph mentions guest's stage 2 (and the code takes that into
  consideration), yet when is_hyp_ctxt() is true it is likely that the guest
  stage 2 is not enabled. Are you planning to enable the AT trap based on
  virtual HCR_EL2.VM being set in a later series?

- A guest might also set the HCR_EL2.AT bit in the virtual HCR_EL2 register. I
  suppose I have the same question, injecting the exception back into the guest
  is going to be handled in another series?

Thanks,
Alex

> 
> So given that we lie about page tables, we also have to lie about
> translation instructions, hence the emulation. Things are made
> complicated by the fact that guest S1 page tables can be swapped out,
> and that our shadow S2 is likely to be incomplete. So while using AT
> to emulate AT is tempting (and useful), it is not going to always
> work, and we thus need a fallback in the shape of a SW S1 walker.
> 
> This series is built in 4 basic blocks:
> 
> - Add missing definition and basic reworking
> 
> - Dumb emulation of all relevant AT instructions using AT instructions
> 
> - Add a SW S1 walker that is using our S2 walker
> 
> - Add FEAT_ATS1A support, which is almost trivial
> 
> This has been tested by comparing the output of a HW walker with the
> output of the SW one. Obviously, this isn't bullet proof, and I'm
> pretty sure there are some nasties in there.
> 
> In a departure from my usual habit, this series is on top of
> kvmarm/next, as it depends on the NV S2 shadow code.
> 
> Joey Gouly (1):
>   KVM: arm64: make kvm_at() take an OP_AT_*
> 
> Marc Zyngier (11):
>   arm64: Add missing APTable and TCR_ELx.HPD masks
>   arm64: Add PAR_EL1 field description
>   KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor
>   KVM: arm64: nv: Honor absence of FEAT_PAN2
>   KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P]
>   KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}
>   KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}
>   KVM: arm64: nv: Make ps_to_output_size() generally available
>   KVM: arm64: nv: Add SW walker for AT S1 emulation
>   KVM: arm64: nv: Plumb handling of AT S1* traps from EL2
>   KVM: arm64: nv: Add support for FEAT_ATS1A
> 
>  arch/arm64/include/asm/kvm_arm.h       |    1 +
>  arch/arm64/include/asm/kvm_asm.h       |    6 +-
>  arch/arm64/include/asm/kvm_nested.h    |   18 +-
>  arch/arm64/include/asm/pgtable-hwdef.h |    7 +
>  arch/arm64/include/asm/sysreg.h        |   19 +
>  arch/arm64/kvm/Makefile                |    2 +-
>  arch/arm64/kvm/at.c                    | 1007 ++++++++++++++++++++++++
>  arch/arm64/kvm/emulate-nested.c        |    2 +
>  arch/arm64/kvm/hyp/include/hyp/fault.h |    2 +-
>  arch/arm64/kvm/nested.c                |   26 +-
>  arch/arm64/kvm/sys_regs.c              |   60 ++
>  11 files changed, 1125 insertions(+), 25 deletions(-)
>  create mode 100644 arch/arm64/kvm/at.c
> 
> -- 
> 2.39.2
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-31  9:53       ` Alexandru Elisei
@ 2024-07-31 10:18         ` Marc Zyngier
  2024-07-31 10:28           ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-31 10:18 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Wed, 31 Jul 2024 10:53:14 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi,
> 
> On Wed, Jul 31, 2024 at 09:55:28AM +0100, Marc Zyngier wrote:
> > On Mon, 29 Jul 2024 16:26:00 +0100,
> > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > > 
> > > Hi Marc,
> > > 
> > > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > > In order to plug the brokenness of our current AT implementation,
> > > > we need a SW walker that is going to... err.. walk the S1 tables
> > > > and tell us what it finds.
> > > > 
> > > > Of course, it builds on top of our S2 walker, and share similar
> > > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > > it is able to bring back pages that have been otherwise evicted.
> > > > 
> > > > This is then plugged in the two AT S1 emulation functions as
> > > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > > 
> > > > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > > > ---
> > > >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> > > >  1 file changed, 520 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > > > index 71e3390b43b4c..8452273cbff6d 100644
> > > > --- a/arch/arm64/kvm/at.c
> > > > +++ b/arch/arm64/kvm/at.c
> > > > @@ -4,9 +4,305 @@
> > > >   * Author: Jintack Lim <jintack.lim@linaro.org>
> > > >   */
> > > >  
> > > > +#include <linux/kvm_host.h>
> > > > +
> > > > +#include <asm/esr.h>
> > > >  #include <asm/kvm_hyp.h>
> > > >  #include <asm/kvm_mmu.h>
> > > >  
> > > > +struct s1_walk_info {
> > > > +	u64	     baddr;
> > > > +	unsigned int max_oa_bits;
> > > > +	unsigned int pgshift;
> > > > +	unsigned int txsz;
> > > > +	int 	     sl;
> > > > +	bool	     hpd;
> > > > +	bool	     be;
> > > > +	bool	     nvhe;
> > > > +	bool	     s2;
> > > > +};
> > > > +
> > > > +struct s1_walk_result {
> > > > +	union {
> > > > +		struct {
> > > > +			u64	desc;
> > > > +			u64	pa;
> > > > +			s8	level;
> > > > +			u8	APTable;
> > > > +			bool	UXNTable;
> > > > +			bool	PXNTable;
> > > > +		};
> > > > +		struct {
> > > > +			u8	fst;
> > > > +			bool	ptw;
> > > > +			bool	s2;
> > > > +		};
> > > > +	};
> > > > +	bool	failed;
> > > > +};
> > > > +
> > > > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > > > +{
> > > > +	wr->fst		= fst;
> > > > +	wr->ptw		= ptw;
> > > > +	wr->s2		= s2;
> > > > +	wr->failed	= true;
> > > > +}
> > > > +
> > > > +#define S1_MMU_DISABLED		(-127)
> > > > +
> > > > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > > > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > > > +{
> > > > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > > > +	unsigned int stride, x;
> > > > +	bool va55, tbi;
> > > > +
> > > > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> > > 
> > > Where 'el' is computed in handle_at_slow() as:
> > > 
> > > 	/*
> > > 	 * We only get here from guest EL2, so the translation regime
> > > 	 * AT applies to is solely defined by {E2H,TGE}.
> > > 	 */
> > > 	el = (vcpu_el2_e2h_is_set(vcpu) &&
> > > 	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> > > 
> > > I think 'nvhe' will always be false ('el' is 2 only when E2H is
> > > set).
> > 
> > Yeah, there is a number of problems here. el should depend on both the
> > instruction (some are EL2-specific) and the HCR control bits. I'll
> > tackle that now.
> 
> Yeah, also noticed that how sctlr, tcr and ttbr are chosen in setup_s1_walk()
> doesn't look quite right for the nvhe case.

Are you sure? Assuming the 'el' value is correct (and I think I fixed
that on my local branch), they seem correct to me (we check for va55
early in the function to avoid an later issue).

Can you point out what exactly fails in that logic?

>
> > 
> > > I'm curious about what 'el' represents. The translation regime for the AT
> > > instruction?
> > 
> > Exactly that.
> 
> Might I make a suggestion here? I was thinking about dropping the (el, wi-nvhe*)
> tuple to represent the translation regime and have a wi->regime (or similar) to
> unambiguously encode the regime. The value can be an enum with three values to
> represent the three possible regimes (REGIME_EL10, REGIME_EL2, REGIME_EL20).

I've been thinking of that, but I'm wondering whether that just
results in pretty awful code in the end, because we go from 2 cases
(el==1 or el==2) to 3. But most of the time, we don't care about the
E2H=0 case, because we can handle it just like E2H=1.

I'll give it a go and see what it looks like.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-31 10:18         ` Marc Zyngier
@ 2024-07-31 10:28           ` Alexandru Elisei
  0 siblings, 0 replies; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-31 10:28 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi,

On Wed, Jul 31, 2024 at 11:18:06AM +0100, Marc Zyngier wrote:
> On Wed, 31 Jul 2024 10:53:14 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi,
> > 
> > On Wed, Jul 31, 2024 at 09:55:28AM +0100, Marc Zyngier wrote:
> > > On Mon, 29 Jul 2024 16:26:00 +0100,
> > > Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > > > 
> > > > Hi Marc,
> > > > 
> > > > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > > > In order to plug the brokenness of our current AT implementation,
> > > > > we need a SW walker that is going to... err.. walk the S1 tables
> > > > > and tell us what it finds.
> > > > > 
> > > > > Of course, it builds on top of our S2 walker, and share similar
> > > > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > > > it is able to bring back pages that have been otherwise evicted.
> > > > > 
> > > > > This is then plugged in the two AT S1 emulation functions as
> > > > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > > > 
> > > > > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > > > > ---
> > > > >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> > > > >  1 file changed, 520 insertions(+), 18 deletions(-)
> > > > > 
> > > > > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > > > > index 71e3390b43b4c..8452273cbff6d 100644
> > > > > --- a/arch/arm64/kvm/at.c
> > > > > +++ b/arch/arm64/kvm/at.c
> > > > > @@ -4,9 +4,305 @@
> > > > >   * Author: Jintack Lim <jintack.lim@linaro.org>
> > > > >   */
> > > > >  
> > > > > +#include <linux/kvm_host.h>
> > > > > +
> > > > > +#include <asm/esr.h>
> > > > >  #include <asm/kvm_hyp.h>
> > > > >  #include <asm/kvm_mmu.h>
> > > > >  
> > > > > +struct s1_walk_info {
> > > > > +	u64	     baddr;
> > > > > +	unsigned int max_oa_bits;
> > > > > +	unsigned int pgshift;
> > > > > +	unsigned int txsz;
> > > > > +	int 	     sl;
> > > > > +	bool	     hpd;
> > > > > +	bool	     be;
> > > > > +	bool	     nvhe;
> > > > > +	bool	     s2;
> > > > > +};
> > > > > +
> > > > > +struct s1_walk_result {
> > > > > +	union {
> > > > > +		struct {
> > > > > +			u64	desc;
> > > > > +			u64	pa;
> > > > > +			s8	level;
> > > > > +			u8	APTable;
> > > > > +			bool	UXNTable;
> > > > > +			bool	PXNTable;
> > > > > +		};
> > > > > +		struct {
> > > > > +			u8	fst;
> > > > > +			bool	ptw;
> > > > > +			bool	s2;
> > > > > +		};
> > > > > +	};
> > > > > +	bool	failed;
> > > > > +};
> > > > > +
> > > > > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > > > > +{
> > > > > +	wr->fst		= fst;
> > > > > +	wr->ptw		= ptw;
> > > > > +	wr->s2		= s2;
> > > > > +	wr->failed	= true;
> > > > > +}
> > > > > +
> > > > > +#define S1_MMU_DISABLED		(-127)
> > > > > +
> > > > > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > > > > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > > > > +{
> > > > > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > > > > +	unsigned int stride, x;
> > > > > +	bool va55, tbi;
> > > > > +
> > > > > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> > > > 
> > > > Where 'el' is computed in handle_at_slow() as:
> > > > 
> > > > 	/*
> > > > 	 * We only get here from guest EL2, so the translation regime
> > > > 	 * AT applies to is solely defined by {E2H,TGE}.
> > > > 	 */
> > > > 	el = (vcpu_el2_e2h_is_set(vcpu) &&
> > > > 	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> > > > 
> > > > I think 'nvhe' will always be false ('el' is 2 only when E2H is
> > > > set).
> > > 
> > > Yeah, there is a number of problems here. el should depend on both the
> > > instruction (some are EL2-specific) and the HCR control bits. I'll
> > > tackle that now.
> > 
> > Yeah, also noticed that how sctlr, tcr and ttbr are chosen in setup_s1_walk()
> > doesn't look quite right for the nvhe case.
> 
> Are you sure? Assuming the 'el' value is correct (and I think I fixed
> that on my local branch), they seem correct to me (we check for va55
> early in the function to avoid an later issue).
> 
> Can you point out what exactly fails in that logic?

I was trying to say that another consequence of el being 1 in the nvhe case was
that sctlr, tcr and ttbr were read from the EL1 variants of the registers,
instead of EL2. Sorry if that wasn't clear.

Thanks,
Alex

> 
> >
> > > 
> > > > I'm curious about what 'el' represents. The translation regime for the AT
> > > > instruction?
> > > 
> > > Exactly that.
> > 
> > Might I make a suggestion here? I was thinking about dropping the (el, wi-nvhe*)
> > tuple to represent the translation regime and have a wi->regime (or similar) to
> > unambiguously encode the regime. The value can be an enum with three values to
> > represent the three possible regimes (REGIME_EL10, REGIME_EL2, REGIME_EL20).
> 
> I've been thinking of that, but I'm wondering whether that just
> results in pretty awful code in the end, because we go from 2 cases
> (el==1 or el==2) to 3. But most of the time, we don't care about the
> E2H=0 case, because we can handle it just like E2H=1.
> 
> I'll give it a go and see what it looks like.
> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions
  2024-07-31 10:05 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
@ 2024-07-31 11:02   ` Marc Zyngier
  2024-07-31 14:19     ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-31 11:02 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Wed, 31 Jul 2024 11:05:05 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On Tue, Jun 25, 2024 at 02:34:59PM +0100, Marc Zyngier wrote:
> > Another task that a hypervisor supporting NV on arm64 has to deal with
> > is to emulate the AT instruction, because we multiplex all the S1
> > translations on a single set of registers, and the guest S2 is never
> > truly resident on the CPU.
> 
> I'm unfamiliar with the state of NV support in KVM, but I thought I would have a
> look at when AT trapping is enabled. As far as I can tell, it's only enabled in
> vhe/switch.c::__activate_traps() -> compute_hcr() if is_hyp_ctct(vcpu). Found
> this by grep'ing for HCR_AT.
> 
> Assuming the above is correct, I am curious about the following:
> 
> - The above paragraph mentions guest's stage 2 (and the code takes that into
>   consideration), yet when is_hyp_ctxt() is true it is likely that the guest
>   stage 2 is not enabled. Are you planning to enable the AT trap based on
>   virtual HCR_EL2.VM being set in a later series?

I don't understand what you are referring to. AT traps and the guest's
HCR_EL2.VM are totally orthogonal, and are (or at least should be)
treated independently.

But more importantly, there are a bunch of cases where you have no
other choice but trap, and that what I allude to when I say "because
we multiplex all the S1 translations on a single set of register".

If I'm running the EL2 part of the guest, and that guest executes an
AT S1E1R while HCR_EL2.{E2H,TGE}={1,0}, it refers to the guest's EL1&0
translation regime. I can't let the guest execute it, because it would
walk its view of the EL2&0 regime. So we need to trap, evaluate what
the guest is trying to do, and do the walk in the correct context (by
using the instructions or the SW walk).

> 
> - A guest might also set the HCR_EL2.AT bit in the virtual HCR_EL2 register. I
>   suppose I have the same question, injecting the exception back into the guest
>   is going to be handled in another series?

This is already handled. The guest's HCR_EL2 is always folded into the
runtime configuration, and the resulting trap handled through the
existing trap routing infrastructure (see d0fc0a2519a6d, which added
the triaging of most traps resulting from HCR_EL2).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions
  2024-07-31 11:02   ` Marc Zyngier
@ 2024-07-31 14:19     ` Alexandru Elisei
  0 siblings, 0 replies; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-31 14:19 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Wed, Jul 31, 2024 at 12:02:24PM +0100, Marc Zyngier wrote:
> On Wed, 31 Jul 2024 11:05:05 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi Marc,
> > 
> > On Tue, Jun 25, 2024 at 02:34:59PM +0100, Marc Zyngier wrote:
> > > Another task that a hypervisor supporting NV on arm64 has to deal with
> > > is to emulate the AT instruction, because we multiplex all the S1
> > > translations on a single set of registers, and the guest S2 is never
> > > truly resident on the CPU.
> > 
> > I'm unfamiliar with the state of NV support in KVM, but I thought I would have a
> > look at when AT trapping is enabled. As far as I can tell, it's only enabled in
> > vhe/switch.c::__activate_traps() -> compute_hcr() if is_hyp_ctct(vcpu). Found
> > this by grep'ing for HCR_AT.
> > 
> > Assuming the above is correct, I am curious about the following:
> > 
> > - The above paragraph mentions guest's stage 2 (and the code takes that into
> >   consideration), yet when is_hyp_ctxt() is true it is likely that the guest
> >   stage 2 is not enabled. Are you planning to enable the AT trap based on
> >   virtual HCR_EL2.VM being set in a later series?
> 
> I don't understand what you are referring to. AT traps and the guest's
> HCR_EL2.VM are totally orthogonal, and are (or at least should be)
> treated independently.

I was referring to what happens when a guest is running at EL1 with virtual
stage 2 enabled and that guest performs an AT instruction. If the stage 1
translation tables are not mapped at virtual stage 2, then KVM should inject a
data abort in the guest hypervisor.

But after thinking about it some more, I guess that's not something that needs
AT trapping: if the stage 1 tables are not mapped in the physical stage 2
(because the level 1 hypervisor unmapped them from the virtual stage 2), then
KVM will get a data abort, and then inject that back into the guest hypervisor.

And as far as I can tell, KVM tracks IPAs becoming unmapped from virtual stage 2
by trapping TLBIs.

So everything looks correct to me, sorry for the noise.

> 
> But more importantly, there are a bunch of cases where you have no
> other choice but trap, and that what I allude to when I say "because
> we multiplex all the S1 translations on a single set of register".
> 
> If I'm running the EL2 part of the guest, and that guest executes an
> AT S1E1R while HCR_EL2.{E2H,TGE}={1,0}, it refers to the guest's EL1&0
> translation regime. I can't let the guest execute it, because it would
> walk its view of the EL2&0 regime. So we need to trap, evaluate what
> the guest is trying to do, and do the walk in the correct context (by
> using the instructions or the SW walk).

Yes, that looks correct to me.

> 
> > 
> > - A guest might also set the HCR_EL2.AT bit in the virtual HCR_EL2 register. I
> >   suppose I have the same question, injecting the exception back into the guest
> >   is going to be handled in another series?
> 
> This is already handled. The guest's HCR_EL2 is always folded into the
> runtime configuration, and the resulting trap handled through the
> existing trap routing infrastructure (see d0fc0a2519a6d, which added
> the triaging of most traps resulting from HCR_EL2).

That explains it then, thanks for digging out the commit id!

Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
                     ` (7 preceding siblings ...)
  2024-07-29 15:26   ` Alexandru Elisei
@ 2024-07-31 14:33   ` Alexandru Elisei
  2024-07-31 15:43     ` Marc Zyngier
  8 siblings, 1 reply; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-31 14:33 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> In order to plug the brokenness of our current AT implementation,
> we need a SW walker that is going to... err.. walk the S1 tables
> and tell us what it finds.
> 
> Of course, it builds on top of our S2 walker, and share similar
> concepts. The beauty of it is that since it uses kvm_read_guest(),
> it is able to bring back pages that have been otherwise evicted.
> 
> This is then plugged in the two AT S1 emulation functions as
> a "slow path" fallback. I'm not sure it is that slow, but hey.
> 
> Signed-off-by: Marc Zyngier <maz@kernel.org>
> ---
>  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 520 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> index 71e3390b43b4c..8452273cbff6d 100644
> --- a/arch/arm64/kvm/at.c
> +++ b/arch/arm64/kvm/at.c
> @@ -4,9 +4,305 @@
>   * Author: Jintack Lim <jintack.lim@linaro.org>
>   */
>  
> +#include <linux/kvm_host.h>
> +
> +#include <asm/esr.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/kvm_mmu.h>
>  
> +struct s1_walk_info {
> +	u64	     baddr;
> +	unsigned int max_oa_bits;
> +	unsigned int pgshift;
> +	unsigned int txsz;
> +	int 	     sl;
> +	bool	     hpd;
> +	bool	     be;
> +	bool	     nvhe;
> +	bool	     s2;
> +};
> +
> +struct s1_walk_result {
> +	union {
> +		struct {
> +			u64	desc;
> +			u64	pa;
> +			s8	level;
> +			u8	APTable;
> +			bool	UXNTable;
> +			bool	PXNTable;
> +		};
> +		struct {
> +			u8	fst;
> +			bool	ptw;
> +			bool	s2;
> +		};
> +	};
> +	bool	failed;
> +};
> +
> +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> +{
> +	wr->fst		= fst;
> +	wr->ptw		= ptw;
> +	wr->s2		= s2;
> +	wr->failed	= true;
> +}
> +
> +#define S1_MMU_DISABLED		(-127)
> +
> +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +			 struct s1_walk_result *wr, const u64 va, const int el)
> +{
> +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> +	unsigned int stride, x;
> +	bool va55, tbi;
> +
> +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> +
> +	va55 = va & BIT(55);
> +
> +	if (wi->nvhe && va55)
> +		goto addrsz;
> +
> +	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> +
> +	switch (el) {
> +	case 1:
> +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
> +		ttbr	= (va55 ?
> +			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
> +			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
> +		break;
> +	case 2:
> +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
> +		ttbr	= (va55 ?
> +			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
> +			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	/* Let's put the MMU disabled case aside immediately */
> +	if (!(sctlr & SCTLR_ELx_M) ||
> +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))

As far as I can tell, if TBI, the pseudocode ignores bits 63:56 when checking
for out-of-bounds VA for the MMU disabled case (above) and the MMU enabled case
(below). That also matches the description of TBIx bits in the TCR_ELx
registers.

Thanks,
Alex

> +			goto addrsz;
> +
> +		wr->level = S1_MMU_DISABLED;
> +		wr->desc = va;
> +		return 0;
> +	}
> +
> +	wi->be = sctlr & SCTLR_ELx_EE;
> +
> +	wi->hpd  = kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, HPDS, IMP);
> +	wi->hpd &= (wi->nvhe ?
> +		    FIELD_GET(TCR_EL2_HPD, tcr) :
> +		    (va55 ?
> +		     FIELD_GET(TCR_HPD1, tcr) :
> +		     FIELD_GET(TCR_HPD0, tcr)));
> +
> +	tbi = (wi->nvhe ?
> +	       FIELD_GET(TCR_EL2_TBI, tcr) :
> +	       (va55 ?
> +		FIELD_GET(TCR_TBI1, tcr) :
> +		FIELD_GET(TCR_TBI0, tcr)));
> +
> +	if (!tbi && sign_extend64(va, 55) != (s64)va)
> +		goto addrsz;
> +
> +	/* Someone was silly enough to encode TG0/TG1 differently */
> +	if (va55) {
> +		wi->txsz = FIELD_GET(TCR_T1SZ_MASK, tcr);
> +		tg = FIELD_GET(TCR_TG1_MASK, tcr);
> +
> +		switch (tg << TCR_TG1_SHIFT) {
> +		case TCR_TG1_4K:
> +			wi->pgshift = 12;	 break;
> +		case TCR_TG1_16K:
> +			wi->pgshift = 14;	 break;
> +		case TCR_TG1_64K:
> +		default:	    /* IMPDEF: treat any other value as 64k */
> +			wi->pgshift = 16;	 break;
> +		}
> +	} else {
> +		wi->txsz = FIELD_GET(TCR_T0SZ_MASK, tcr);
> +		tg = FIELD_GET(TCR_TG0_MASK, tcr);
> +
> +		switch (tg << TCR_TG0_SHIFT) {
> +		case TCR_TG0_4K:
> +			wi->pgshift = 12;	 break;
> +		case TCR_TG0_16K:
> +			wi->pgshift = 14;	 break;
> +		case TCR_TG0_64K:
> +		default:	    /* IMPDEF: treat any other value as 64k */
> +			wi->pgshift = 16;	 break;
> +		}
> +	}
> +
> +	ia_bits = 64 - wi->txsz;
> +
> +	/* AArch64.S1StartLevel() */
> +	stride = wi->pgshift - 3;
> +	wi->sl = 3 - (((ia_bits - 1) - wi->pgshift) / stride);
> +
> +	/* Check for SL mandating LPA2 (which we don't support yet) */
> +	switch (BIT(wi->pgshift)) {
> +	case SZ_4K:
> +		if (wi->sl == -1 &&
> +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN4, 52_BIT))
> +			goto addrsz;
> +		break;
> +	case SZ_16K:
> +		if (wi->sl == 0 &&
> +		    !kvm_has_feat(vcpu->kvm, ID_AA64MMFR0_EL1, TGRAN16, 52_BIT))
> +			goto addrsz;
> +		break;
> +	}
> +
> +	ps = (wi->nvhe ?
> +	      FIELD_GET(TCR_EL2_PS_MASK, tcr) : FIELD_GET(TCR_IPS_MASK, tcr));
> +
> +	wi->max_oa_bits = min(get_kvm_ipa_limit(), ps_to_output_size(ps));
> +
> +	/* Compute minimal alignment */
> +	x = 3 + ia_bits - ((3 - wi->sl) * stride + wi->pgshift);
> +
> +	wi->baddr = ttbr & TTBRx_EL1_BADDR;
> +	wi->baddr &= GENMASK_ULL(wi->max_oa_bits - 1, x);
> +
> +	return 0;
> +
> +addrsz:	/* Address Size Fault level 0 */
> +	fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ, false, false);
> +
> +	return -EFAULT;
> +}
> +
> +static int get_ia_size(struct s1_walk_info *wi)
> +{
> +	return 64 - wi->txsz;
> +}
> +
> +static int walk_s1(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> +		   struct s1_walk_result *wr, u64 va)
> +{
> +	u64 va_top, va_bottom, baddr, desc;
> +	int level, stride, ret;
> +
> +	level = wi->sl;
> +	stride = wi->pgshift - 3;
> +	baddr = wi->baddr;
> +
> +	va_top = get_ia_size(wi) - 1;
> +
> +	while (1) {
> +		u64 index, ipa;
> +
> +		va_bottom = (3 - level) * stride + wi->pgshift;
> +		index = (va & GENMASK_ULL(va_top, va_bottom)) >> (va_bottom - 3);
> +
> +		ipa = baddr | index;
> +
> +		if (wi->s2) {
> +			struct kvm_s2_trans s2_trans = {};
> +
> +			ret = kvm_walk_nested_s2(vcpu, ipa, &s2_trans);
> +			if (ret) {
> +				fail_s1_walk(wr,
> +					     (s2_trans.esr & ~ESR_ELx_FSC_LEVEL) | level,
> +					     true, true);
> +				return ret;
> +			}
> +
> +			if (!kvm_s2_trans_readable(&s2_trans)) {
> +				fail_s1_walk(wr, ESR_ELx_FSC_PERM | level,
> +					     true, true);
> +
> +				return -EPERM;
> +			}
> +
> +			ipa = kvm_s2_trans_output(&s2_trans);
> +		}
> +
> +		ret = kvm_read_guest(vcpu->kvm, ipa, &desc, sizeof(desc));
> +		if (ret) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_SEA_TTW(level),
> +				     true, false);
> +			return ret;
> +		}
> +
> +		if (wi->be)
> +			desc = be64_to_cpu((__force __be64)desc);
> +		else
> +			desc = le64_to_cpu((__force __le64)desc);
> +
> +		if (!(desc & 1) || ((desc & 3) == 1 && level == 3)) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> +				     true, false);
> +			return -ENOENT;
> +		}
> +
> +		/* We found a leaf, handle that */
> +		if ((desc & 3) == 1 || level == 3)
> +			break;
> +
> +		if (!wi->hpd) {
> +			wr->APTable  |= FIELD_GET(PMD_TABLE_AP, desc);
> +			wr->UXNTable |= FIELD_GET(PMD_TABLE_UXN, desc);
> +			wr->PXNTable |= FIELD_GET(PMD_TABLE_PXN, desc);
> +		}
> +
> +		baddr = GENMASK_ULL(47, wi->pgshift);
> +
> +		/* Check for out-of-range OA */
> +		if (wi->max_oa_bits < 48 &&
> +		    (baddr & GENMASK_ULL(47, wi->max_oa_bits))) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_ADDRSZ | level,
> +				     true, false);
> +			return -EINVAL;
> +		}
> +
> +		/* Prepare for next round */
> +		va_top = va_bottom - 1;
> +		level++;
> +	}
> +
> +	/* Block mapping, check the validity of the level */
> +	if (!(desc & BIT(1))) {
> +		bool valid_block = false;
> +
> +		switch (BIT(wi->pgshift)) {
> +		case SZ_4K:
> +			valid_block = level == 1 || level == 2;
> +			break;
> +		case SZ_16K:
> +		case SZ_64K:
> +			valid_block = level == 2;
> +			break;
> +		}
> +
> +		if (!valid_block) {
> +			fail_s1_walk(wr, ESR_ELx_FSC_FAULT | level,
> +				     true, false);
> +			return -EINVAL;
> +		}
> +	}
> +
> +	wr->failed = false;
> +	wr->level = level;
> +	wr->desc = desc;
> +	wr->pa = desc & GENMASK(47, va_bottom);
> +	if (va_bottom > 12)
> +		wr->pa |= va & GENMASK_ULL(va_bottom - 1, 12);
> +
> +	return 0;
> +}
> +
>  struct mmu_config {
>  	u64	ttbr0;
>  	u64	ttbr1;
> @@ -234,6 +530,177 @@ static u64 compute_par_s12(struct kvm_vcpu *vcpu, u64 s1_par,
>  	return par;
>  }
>  
> +static u64 compute_par_s1(struct kvm_vcpu *vcpu, struct s1_walk_result *wr)
> +{
> +	u64 par;
> +
> +	if (wr->failed) {
> +		par = SYS_PAR_EL1_RES1;
> +		par |= SYS_PAR_EL1_F;
> +		par |= FIELD_PREP(SYS_PAR_EL1_FST, wr->fst);
> +		par |= wr->ptw ? SYS_PAR_EL1_PTW : 0;
> +		par |= wr->s2 ? SYS_PAR_EL1_S : 0;
> +	} else if (wr->level == S1_MMU_DISABLED) {
> +		/* MMU off or HCR_EL2.DC == 1 */
> +		par = wr->pa & GENMASK_ULL(47, 12);
> +
> +		if (!(__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR, 0); /* nGnRnE */
> +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b10); /* OS */
> +		} else {
> +			par |= FIELD_PREP(SYS_PAR_EL1_ATTR,
> +					  MEMATTR(WbRaWa, WbRaWa));
> +			par |= FIELD_PREP(SYS_PAR_EL1_SH, 0b00); /* NS */
> +		}
> +	} else {
> +		u64 mair, sctlr;
> +		int el;
> +		u8 sh;
> +
> +		el = (vcpu_el2_e2h_is_set(vcpu) &&
> +		      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +		mair = ((el == 2) ?
> +			vcpu_read_sys_reg(vcpu, MAIR_EL2) :
> +			vcpu_read_sys_reg(vcpu, MAIR_EL1));
> +
> +		mair >>= FIELD_GET(PTE_ATTRINDX_MASK, wr->desc) * 8;
> +		mair &= 0xff;
> +
> +		sctlr = ((el == 2) ?
> +			vcpu_read_sys_reg(vcpu, SCTLR_EL2) :
> +			vcpu_read_sys_reg(vcpu, SCTLR_EL1));
> +
> +		/* Force NC for memory if SCTLR_ELx.C is clear */
> +		if (!(sctlr & SCTLR_EL1_C) && !MEMATTR_IS_DEVICE(mair))
> +			mair = MEMATTR(NC, NC);
> +
> +		par  = FIELD_PREP(SYS_PAR_EL1_ATTR, mair);
> +		par |= wr->pa & GENMASK_ULL(47, 12);
> +
> +		sh = compute_sh(mair, wr->desc);
> +		par |= FIELD_PREP(SYS_PAR_EL1_SH, sh);
> +	}
> +
> +	return par;
> +}
> +
> +static u64 handle_at_slow(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> +{
> +	bool perm_fail, ur, uw, ux, pr, pw, pan;
> +	struct s1_walk_result wr = {};
> +	struct s1_walk_info wi = {};
> +	int ret, idx, el;
> +
> +	/*
> +	 * We only get here from guest EL2, so the translation regime
> +	 * AT applies to is solely defined by {E2H,TGE}.
> +	 */
> +	el = (vcpu_el2_e2h_is_set(vcpu) &&
> +	      vcpu_el2_tge_is_set(vcpu)) ? 2 : 1;
> +
> +	ret = setup_s1_walk(vcpu, &wi, &wr, vaddr, el);
> +	if (ret)
> +		goto compute_par;
> +
> +	if (wr.level == S1_MMU_DISABLED)
> +		goto compute_par;
> +
> +	idx = srcu_read_lock(&vcpu->kvm->srcu);
> +
> +	ret = walk_s1(vcpu, &wi, &wr, vaddr);
> +
> +	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> +
> +	if (ret)
> +		goto compute_par;
> +
> +	/* FIXME: revisit when adding indirect permission support */
> +	if (kvm_has_feat(vcpu->kvm, ID_AA64MMFR1_EL1, PAN, PAN3) &&
> +	    !wi.nvhe) {
> +		u64 sctlr;
> +
> +		if (el == 1)
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> +		else
> +			sctlr = vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> +
> +		ux = (sctlr & SCTLR_EL1_EPAN) && !(wr.desc & PTE_UXN);
> +	} else {
> +		ux = false;
> +	}
> +
> +	pw = !(wr.desc & PTE_RDONLY);
> +
> +	if (wi.nvhe) {
> +		ur = uw = false;
> +		pr = true;
> +	} else {
> +		if (wr.desc & PTE_USER) {
> +			ur = pr = true;
> +			uw = pw;
> +		} else {
> +			ur = uw = false;
> +			pr = true;
> +		}
> +	}
> +
> +	/* Apply the Hierarchical Permission madness */
> +	if (wi.nvhe) {
> +		wr.APTable &= BIT(1);
> +		wr.PXNTable = wr.UXNTable;
> +	}
> +
> +	ur &= !(wr.APTable & BIT(0));
> +	uw &= !(wr.APTable != 0);
> +	ux &= !wr.UXNTable;
> +
> +	pw &= !(wr.APTable & BIT(1));
> +
> +	pan = *vcpu_cpsr(vcpu) & PSR_PAN_BIT;
> +
> +	perm_fail = false;
> +
> +	switch (op) {
> +	case OP_AT_S1E1RP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1R:
> +	case OP_AT_S1E2R:
> +		perm_fail |= !pr;
> +		break;
> +	case OP_AT_S1E1WP:
> +		perm_fail |= pan && (ur || uw || ux);
> +		fallthrough;
> +	case OP_AT_S1E1W:
> +	case OP_AT_S1E2W:
> +		perm_fail |= !pw;
> +		break;
> +	case OP_AT_S1E0R:
> +		perm_fail |= !ur;
> +		break;
> +	case OP_AT_S1E0W:
> +		perm_fail |= !uw;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	if (perm_fail) {
> +		struct s1_walk_result tmp;
> +
> +		tmp.failed = true;
> +		tmp.fst = ESR_ELx_FSC_PERM | wr.level;
> +		tmp.s2 = false;
> +		tmp.ptw = false;
> +
> +		wr = tmp;
> +	}
> +
> +compute_par:
> +	return compute_par_s1(vcpu, &wr);
> +}
> +
>  static bool check_at_pan(struct kvm_vcpu *vcpu, u64 vaddr, u64 *res)
>  {
>  	u64 par_e0;
> @@ -266,9 +733,11 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	struct mmu_config config;
>  	struct kvm_s2_mmu *mmu;
>  	unsigned long flags;
> -	bool fail;
> +	bool fail, retry_slow;
>  	u64 par;
>  
> +	retry_slow = false;
> +
>  	write_lock(&vcpu->kvm->mmu_lock);
>  
>  	/*
> @@ -288,14 +757,15 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  		goto skip_mmu_switch;
>  
>  	/*
> -	 * FIXME: Obtaining the S2 MMU for a L2 is horribly racy, and
> -	 * we may not find it (recycled by another vcpu, for example).
> -	 * See the other FIXME comment below about the need for a SW
> -	 * PTW in this case.
> +	 * Obtaining the S2 MMU for a L2 is horribly racy, and we may not
> +	 * find it (recycled by another vcpu, for example). When this
> +	 * happens, use the SW (slow) path.
>  	 */
>  	mmu = lookup_s2_mmu(vcpu);
> -	if (WARN_ON(!mmu))
> +	if (!mmu) {
> +		retry_slow = true;
>  		goto out;
> +	}
>  
>  	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR0_EL1),	SYS_TTBR0);
>  	write_sysreg_el1(vcpu_read_sys_reg(vcpu, TTBR1_EL1),	SYS_TTBR1);
> @@ -331,18 +801,17 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	}
>  
>  	if (!fail)
> -		par = read_sysreg(par_el1);
> +		par = read_sysreg_par();
>  	else
>  		par = SYS_PAR_EL1_F;
>  
> +	retry_slow = !fail;
> +
>  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
>  
>  	/*
> -	 * Failed? let's leave the building now.
> -	 *
> -	 * FIXME: how about a failed translation because the shadow S2
> -	 * wasn't populated? We may need to perform a SW PTW,
> -	 * populating our shadow S2 and retry the instruction.
> +	 * Failed? let's leave the building now, unless we retry on
> +	 * the slow path.
>  	 */
>  	if (par & SYS_PAR_EL1_F)
>  		goto nopan;
> @@ -354,29 +823,58 @@ void __kvm_at_s1e01(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  	switch (op) {
>  	case OP_AT_S1E1RP:
>  	case OP_AT_S1E1WP:
> +		retry_slow = false;
>  		fail = check_at_pan(vcpu, vaddr, &par);
>  		break;
>  	default:
>  		goto nopan;
>  	}
>  
> +	if (fail) {
> +		vcpu_write_sys_reg(vcpu, SYS_PAR_EL1_F, PAR_EL1);
> +		goto nopan;
> +	}
> +
>  	/*
>  	 * If the EL0 translation has succeeded, we need to pretend
>  	 * the AT operation has failed, as the PAN setting forbids
>  	 * such a translation.
> -	 *
> -	 * FIXME: we hardcode a Level-3 permission fault. We really
> -	 * should return the real fault level.
>  	 */
> -	if (fail || !(par & SYS_PAR_EL1_F))
> -		vcpu_write_sys_reg(vcpu, (0xf << 1) | SYS_PAR_EL1_F, PAR_EL1);
> -
> +	if (par & SYS_PAR_EL1_F) {
> +		u8 fst = FIELD_GET(SYS_PAR_EL1_FST, par);
> +
> +		/*
> +		 * If we get something other than a permission fault, we
> +		 * need to retry, as we're likely to have missed in the PTs.
> +		 */
> +		if ((fst & ESR_ELx_FSC_TYPE) != ESR_ELx_FSC_PERM)
> +			retry_slow = true;
> +	} else {
> +		/*
> +		 * The EL0 access succeded, but we don't have the full
> +		 * syndrom information to synthetize the failure. Go slow.
> +		 */
> +		retry_slow = true;
> +	}
>  nopan:
>  	__mmu_config_restore(&config);
>  out:
>  	local_irq_restore(flags);
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
> +
> +	/*
> +	 * If retry_slow is true, then we either are missing shadow S2
> +	 * entries, have paged out guest S1, or something is inconsistent.
> +	 *
> +	 * Either way, we need to walk the PTs by hand so that we can either
> +	 * fault things back, in or record accurate fault information along
> +	 * the way.
> +	 */
> +	if (retry_slow) {
> +		par = handle_at_slow(vcpu, op, vaddr);
> +		vcpu_write_sys_reg(vcpu, par, PAR_EL1);
> +	}
>  }
>  
>  void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
> @@ -433,6 +931,10 @@ void __kvm_at_s1e2(struct kvm_vcpu *vcpu, u32 op, u64 vaddr)
>  
>  	write_unlock(&vcpu->kvm->mmu_lock);
>  
> +	/* We failed the translation, let's replay it in slow motion */
> +	if (!fail && (par & SYS_PAR_EL1_F))
> +		par = handle_at_slow(vcpu, op, vaddr);
> +
>  	vcpu_write_sys_reg(vcpu, par, PAR_EL1);
>  }
>  
> -- 
> 2.39.2
> 
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-31 14:33   ` Alexandru Elisei
@ 2024-07-31 15:43     ` Marc Zyngier
  2024-07-31 16:05       ` Alexandru Elisei
  0 siblings, 1 reply; 50+ messages in thread
From: Marc Zyngier @ 2024-07-31 15:43 UTC (permalink / raw)
  To: Alexandru Elisei
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

On Wed, 31 Jul 2024 15:33:25 +0100,
Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> 
> Hi Marc,
> 
> On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > In order to plug the brokenness of our current AT implementation,
> > we need a SW walker that is going to... err.. walk the S1 tables
> > and tell us what it finds.
> > 
> > Of course, it builds on top of our S2 walker, and share similar
> > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > it is able to bring back pages that have been otherwise evicted.
> > 
> > This is then plugged in the two AT S1 emulation functions as
> > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > 
> > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > ---
> >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 520 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > index 71e3390b43b4c..8452273cbff6d 100644
> > --- a/arch/arm64/kvm/at.c
> > +++ b/arch/arm64/kvm/at.c
> > @@ -4,9 +4,305 @@
> >   * Author: Jintack Lim <jintack.lim@linaro.org>
> >   */
> >  
> > +#include <linux/kvm_host.h>
> > +
> > +#include <asm/esr.h>
> >  #include <asm/kvm_hyp.h>
> >  #include <asm/kvm_mmu.h>
> >  
> > +struct s1_walk_info {
> > +	u64	     baddr;
> > +	unsigned int max_oa_bits;
> > +	unsigned int pgshift;
> > +	unsigned int txsz;
> > +	int 	     sl;
> > +	bool	     hpd;
> > +	bool	     be;
> > +	bool	     nvhe;
> > +	bool	     s2;
> > +};
> > +
> > +struct s1_walk_result {
> > +	union {
> > +		struct {
> > +			u64	desc;
> > +			u64	pa;
> > +			s8	level;
> > +			u8	APTable;
> > +			bool	UXNTable;
> > +			bool	PXNTable;
> > +		};
> > +		struct {
> > +			u8	fst;
> > +			bool	ptw;
> > +			bool	s2;
> > +		};
> > +	};
> > +	bool	failed;
> > +};
> > +
> > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > +{
> > +	wr->fst		= fst;
> > +	wr->ptw		= ptw;
> > +	wr->s2		= s2;
> > +	wr->failed	= true;
> > +}
> > +
> > +#define S1_MMU_DISABLED		(-127)
> > +
> > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > +{
> > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > +	unsigned int stride, x;
> > +	bool va55, tbi;
> > +
> > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> > +
> > +	va55 = va & BIT(55);
> > +
> > +	if (wi->nvhe && va55)
> > +		goto addrsz;
> > +
> > +	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> > +
> > +	switch (el) {
> > +	case 1:
> > +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> > +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
> > +		ttbr	= (va55 ?
> > +			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
> > +			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
> > +		break;
> > +	case 2:
> > +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> > +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
> > +		ttbr	= (va55 ?
> > +			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
> > +			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
> > +		break;
> > +	default:
> > +		BUG();
> > +	}
> > +
> > +	/* Let's put the MMU disabled case aside immediately */
> > +	if (!(sctlr & SCTLR_ELx_M) ||
> > +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > +		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
> 
> As far as I can tell, if TBI, the pseudocode ignores bits 63:56 when checking
> for out-of-bounds VA for the MMU disabled case (above) and the MMU enabled case
> (below). That also matches the description of TBIx bits in the TCR_ELx
> registers.

Right. Then the check needs to be hoisted up and the VA sanitised
before we compare it to anything.

Thanks for all your review comments, but I am going to ask you to stop
here. You are reviewing a pretty old code base, and although I'm sure
you look at what is in my tree, I'd really like to post a new version
for everyone to enjoy.

I'll stash that last change on top and post the result.

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation
  2024-07-31 15:43     ` Marc Zyngier
@ 2024-07-31 16:05       ` Alexandru Elisei
  0 siblings, 0 replies; 50+ messages in thread
From: Alexandru Elisei @ 2024-07-31 16:05 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvmarm, linux-arm-kernel, kvm, James Morse, Suzuki K Poulose,
	Oliver Upton, Zenghui Yu, Joey Gouly

Hi Marc,

On Wed, Jul 31, 2024 at 04:43:16PM +0100, Marc Zyngier wrote:
> On Wed, 31 Jul 2024 15:33:25 +0100,
> Alexandru Elisei <alexandru.elisei@arm.com> wrote:
> > 
> > Hi Marc,
> > 
> > On Mon, Jul 08, 2024 at 05:57:58PM +0100, Marc Zyngier wrote:
> > > In order to plug the brokenness of our current AT implementation,
> > > we need a SW walker that is going to... err.. walk the S1 tables
> > > and tell us what it finds.
> > > 
> > > Of course, it builds on top of our S2 walker, and share similar
> > > concepts. The beauty of it is that since it uses kvm_read_guest(),
> > > it is able to bring back pages that have been otherwise evicted.
> > > 
> > > This is then plugged in the two AT S1 emulation functions as
> > > a "slow path" fallback. I'm not sure it is that slow, but hey.
> > > 
> > > Signed-off-by: Marc Zyngier <maz@kernel.org>
> > > ---
> > >  arch/arm64/kvm/at.c | 538 ++++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 520 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/arch/arm64/kvm/at.c b/arch/arm64/kvm/at.c
> > > index 71e3390b43b4c..8452273cbff6d 100644
> > > --- a/arch/arm64/kvm/at.c
> > > +++ b/arch/arm64/kvm/at.c
> > > @@ -4,9 +4,305 @@
> > >   * Author: Jintack Lim <jintack.lim@linaro.org>
> > >   */
> > >  
> > > +#include <linux/kvm_host.h>
> > > +
> > > +#include <asm/esr.h>
> > >  #include <asm/kvm_hyp.h>
> > >  #include <asm/kvm_mmu.h>
> > >  
> > > +struct s1_walk_info {
> > > +	u64	     baddr;
> > > +	unsigned int max_oa_bits;
> > > +	unsigned int pgshift;
> > > +	unsigned int txsz;
> > > +	int 	     sl;
> > > +	bool	     hpd;
> > > +	bool	     be;
> > > +	bool	     nvhe;
> > > +	bool	     s2;
> > > +};
> > > +
> > > +struct s1_walk_result {
> > > +	union {
> > > +		struct {
> > > +			u64	desc;
> > > +			u64	pa;
> > > +			s8	level;
> > > +			u8	APTable;
> > > +			bool	UXNTable;
> > > +			bool	PXNTable;
> > > +		};
> > > +		struct {
> > > +			u8	fst;
> > > +			bool	ptw;
> > > +			bool	s2;
> > > +		};
> > > +	};
> > > +	bool	failed;
> > > +};
> > > +
> > > +static void fail_s1_walk(struct s1_walk_result *wr, u8 fst, bool ptw, bool s2)
> > > +{
> > > +	wr->fst		= fst;
> > > +	wr->ptw		= ptw;
> > > +	wr->s2		= s2;
> > > +	wr->failed	= true;
> > > +}
> > > +
> > > +#define S1_MMU_DISABLED		(-127)
> > > +
> > > +static int setup_s1_walk(struct kvm_vcpu *vcpu, struct s1_walk_info *wi,
> > > +			 struct s1_walk_result *wr, const u64 va, const int el)
> > > +{
> > > +	u64 sctlr, tcr, tg, ps, ia_bits, ttbr;
> > > +	unsigned int stride, x;
> > > +	bool va55, tbi;
> > > +
> > > +	wi->nvhe = el == 2 && !vcpu_el2_e2h_is_set(vcpu);
> > > +
> > > +	va55 = va & BIT(55);
> > > +
> > > +	if (wi->nvhe && va55)
> > > +		goto addrsz;
> > > +
> > > +	wi->s2 = el < 2 && (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_VM);
> > > +
> > > +	switch (el) {
> > > +	case 1:
> > > +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL1);
> > > +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL1);
> > > +		ttbr	= (va55 ?
> > > +			   vcpu_read_sys_reg(vcpu, TTBR1_EL1) :
> > > +			   vcpu_read_sys_reg(vcpu, TTBR0_EL1));
> > > +		break;
> > > +	case 2:
> > > +		sctlr	= vcpu_read_sys_reg(vcpu, SCTLR_EL2);
> > > +		tcr	= vcpu_read_sys_reg(vcpu, TCR_EL2);
> > > +		ttbr	= (va55 ?
> > > +			   vcpu_read_sys_reg(vcpu, TTBR1_EL2) :
> > > +			   vcpu_read_sys_reg(vcpu, TTBR0_EL2));
> > > +		break;
> > > +	default:
> > > +		BUG();
> > > +	}
> > > +
> > > +	/* Let's put the MMU disabled case aside immediately */
> > > +	if (!(sctlr & SCTLR_ELx_M) ||
> > > +	    (__vcpu_sys_reg(vcpu, HCR_EL2) & HCR_DC)) {
> > > +		if (va >= BIT(kvm_get_pa_bits(vcpu->kvm)))
> > 
> > As far as I can tell, if TBI, the pseudocode ignores bits 63:56 when checking
> > for out-of-bounds VA for the MMU disabled case (above) and the MMU enabled case
> > (below). That also matches the description of TBIx bits in the TCR_ELx
> > registers.
> 
> Right. Then the check needs to be hoisted up and the VA sanitised
> before we compare it to anything.
> 
> Thanks for all your review comments, but I am going to ask you to stop
> here. You are reviewing a pretty old code base, and although I'm sure
> you look at what is in my tree, I'd really like to post a new version
> for everyone to enjoy.

Got it.

Thanks,
Alex


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2024-07-31 16:06 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-25 13:34 [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Marc Zyngier
2024-06-25 13:35 ` [PATCH 01/12] arm64: Add missing APTable and TCR_ELx.HPD masks Marc Zyngier
2024-07-12  8:32   ` Anshuman Khandual
2024-07-13  8:04     ` Marc Zyngier
2024-06-25 13:35 ` [PATCH 02/12] arm64: Add PAR_EL1 field description Marc Zyngier
2024-07-12  7:06   ` Anshuman Khandual
2024-07-13  7:56     ` Marc Zyngier
2024-06-25 13:35 ` [PATCH 03/12] KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor Marc Zyngier
2024-06-25 13:35 ` [PATCH 04/12] KVM: arm64: nv: Honor absence of FEAT_PAN2 Marc Zyngier
2024-07-12  8:40   ` Anshuman Khandual
2024-06-25 13:35 ` [PATCH 05/12] KVM: arm64: make kvm_at() take an OP_AT_* Marc Zyngier
2024-07-12  8:52   ` Anshuman Khandual
2024-06-25 13:35 ` [PATCH 06/12] KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}[P] Marc Zyngier
2024-06-25 13:35 ` [PATCH 07/12] KVM: arm64: nv: Add basic emulation of AT S1E2{R,W} Marc Zyngier
2024-06-25 13:35 ` [PATCH 08/12] KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W} Marc Zyngier
2024-07-18 15:10   ` Alexandru Elisei
2024-07-20  9:49     ` Marc Zyngier
2024-07-22 10:33       ` Alexandru Elisei
2024-06-25 13:35 ` [PATCH 09/12] KVM: arm64: nv: Make ps_to_output_size() generally available Marc Zyngier
2024-07-08 16:28 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
2024-07-08 17:00   ` Marc Zyngier
2024-07-08 16:57 ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Marc Zyngier
2024-07-08 16:57   ` [PATCH 11/12] KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 Marc Zyngier
2024-07-08 16:58   ` [PATCH 12/12] KVM: arm64: nv: Add support for FEAT_ATS1A Marc Zyngier
2024-07-10 15:12   ` [PATCH 10/12] KVM: arm64: nv: Add SW walker for AT S1 emulation Alexandru Elisei
2024-07-11  8:05     ` Marc Zyngier
2024-07-11 10:56   ` Alexandru Elisei
2024-07-11 12:16     ` Marc Zyngier
2024-07-15 15:30       ` Alexandru Elisei
2024-07-18 11:37         ` Marc Zyngier
2024-07-18 15:16   ` Alexandru Elisei
2024-07-20 13:49     ` Marc Zyngier
2024-07-22 10:53   ` Alexandru Elisei
2024-07-22 15:25     ` Marc Zyngier
2024-07-23  8:57       ` Alexandru Elisei
2024-07-25 14:16   ` Alexandru Elisei
2024-07-25 14:30     ` Marc Zyngier
2024-07-25 15:13       ` Alexandru Elisei
2024-07-25 15:33         ` Marc Zyngier
2024-07-29 15:26   ` Alexandru Elisei
2024-07-31  8:55     ` Marc Zyngier
2024-07-31  9:53       ` Alexandru Elisei
2024-07-31 10:18         ` Marc Zyngier
2024-07-31 10:28           ` Alexandru Elisei
2024-07-31 14:33   ` Alexandru Elisei
2024-07-31 15:43     ` Marc Zyngier
2024-07-31 16:05       ` Alexandru Elisei
2024-07-31 10:05 ` [PATCH 00/12] KVM: arm64: nv: Add support for address translation instructions Alexandru Elisei
2024-07-31 11:02   ` Marc Zyngier
2024-07-31 14:19     ` Alexandru Elisei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).