linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v2 02/33] KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation
       [not found] ` <1538127963-15645-3-git-send-email-paulus@ozlabs.org>
@ 2018-10-02  4:49   ` David Gibson
  0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2018-10-02  4:49 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

On Fri, Sep 28, 2018 at 07:45:32PM +1000, Paul Mackerras wrote:
> This removes code that clears the external interrupt pending bit in
> the pending_exceptions bitmap.  This is left over from an earlier
> iteration of the code where this bit was set when an escalation
> interrupt arrived in order to wake the vcpu from cede.  Currently
> we set the vcpu->arch.irq_pending flag instead for this purpose.
> Therefore there is no need to do anything with the pending_exceptions
> bitmap.
> 
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_xive_template.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive_template.c b/arch/powerpc/kvm/book3s_xive_template.c
> index 203ea65..033363d 100644
> --- a/arch/powerpc/kvm/book3s_xive_template.c
> +++ b/arch/powerpc/kvm/book3s_xive_template.c
> @@ -280,14 +280,6 @@ X_STATIC unsigned long GLUE(X_PFX,h_xirr)(struct kvm_vcpu *vcpu)
>  	/* First collect pending bits from HW */
>  	GLUE(X_PFX,ack_pending)(xc);
>  
> -	/*
> -	 * Cleanup the old-style bits if needed (they may have been
> -	 * set by pull or an escalation interrupts).
> -	 */
> -	if (test_bit(BOOK3S_IRQPRIO_EXTERNAL, &vcpu->arch.pending_exceptions))
> -		clear_bit(BOOK3S_IRQPRIO_EXTERNAL,
> -			  &vcpu->arch.pending_exceptions);
> -
>  	pr_devel(" new pending=0x%02x hw_cppr=%d cppr=%d\n",
>  		 xc->pending, xc->hw_cppr, xc->cppr);
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 06/33] KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable
       [not found] ` <1538127963-15645-7-git-send-email-paulus@ozlabs.org>
@ 2018-10-02  5:15   ` David Gibson
  0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2018-10-02  5:15 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 18773 bytes --]

On Fri, Sep 28, 2018 at 07:45:36PM +1000, Paul Mackerras wrote:
> This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm
> which allows the caller to indicate whether it wants the nonvolatile
> register state to be preserved across the call, as required by the C
> calling conventions.  This parameter being non-zero also causes the
> MSR bits that enable TM, FP, VMX and VSX to be preserved.  The
> condition register and DSCR are now always preserved.
> 
> With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called
> from C code provided the 3rd parameter is non-zero.  So that these
> functions can be called from modules, they now include code to set
> the TOC pointer (r2) on entry, as they can call other built-in C
> functions which will assume the TOC to have been set.
> 
> Also, the fake suspend code in kvmppc_save_tm_hv is modified here to
> assume that treclaim in fake-suspend state does not modify any registers,
> which is the case on POWER9.  This enables the code to be simplified
> quite a bit.
> 
> _kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with
> this change, since they now only need to save and restore TAR and pass
> 1 for the 3rd argument to __kvmppc_{save,restore}_tm.
> 
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/asm-prototypes.h |  10 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  49 +++---
>  arch/powerpc/kvm/tm.S                     | 250 ++++++++++++++++--------------
>  3 files changed, 169 insertions(+), 140 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h
> index 024e8fc..0c1a2b0 100644
> --- a/arch/powerpc/include/asm/asm-prototypes.h
> +++ b/arch/powerpc/include/asm/asm-prototypes.h
> @@ -150,6 +150,16 @@ extern s32 patch__memset_nocache, patch__memcpy_nocache;
>  
>  extern long flush_count_cache;
>  
> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> +void kvmppc_save_tm_hv(struct kvm_vcpu *vcpu, u64 msr, bool preserve_nv);
> +void kvmppc_restore_tm_hv(struct kvm_vcpu *vcpu, u64 msr, bool preserve_nv);
> +#else
> +static inline void kvmppc_save_tm_hv(struct kvm_vcpu *vcpu, u64 msr,
> +				     bool preserve_nv) { }
> +static inline void kvmppc_restore_tm_hv(struct kvm_vcpu *vcpu, u64 msr,
> +					bool preserve_nv) { }
> +#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
> +
>  void kvmhv_save_host_pmu(void);
>  void kvmhv_load_host_pmu(void);
>  void kvmhv_save_guest_pmu(struct kvm_vcpu *vcpu, bool pmu_in_use);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 772740d..67a847f 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -759,11 +759,13 @@ BEGIN_FTR_SECTION
>  	b	91f
>  END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
>  	/*
> -	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
> +	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
>  	 */
>  	mr      r3, r4
>  	ld      r4, VCPU_MSR(r3)
> +	li	r5, 0			/* don't preserve non-vol regs */
>  	bl	kvmppc_restore_tm_hv
> +	nop
>  	ld	r4, HSTATE_KVM_VCPU(r13)
>  91:
>  #endif
> @@ -1603,11 +1605,13 @@ BEGIN_FTR_SECTION
>  	b	91f
>  END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
>  	/*
> -	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
> +	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
>  	 */
>  	mr      r3, r9
>  	ld      r4, VCPU_MSR(r3)
> +	li	r5, 0			/* don't preserve non-vol regs */
>  	bl	kvmppc_save_tm_hv
> +	nop
>  	ld	r9, HSTATE_KVM_VCPU(r13)
>  91:
>  #endif
> @@ -2486,11 +2490,13 @@ BEGIN_FTR_SECTION
>  	b	91f
>  END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
>  	/*
> -	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
> +	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
>  	 */
>  	ld	r3, HSTATE_KVM_VCPU(r13)
>  	ld      r4, VCPU_MSR(r3)
> +	li	r5, 0			/* don't preserve non-vol regs */
>  	bl	kvmppc_save_tm_hv
> +	nop
>  91:
>  #endif
>  
> @@ -2606,11 +2612,13 @@ BEGIN_FTR_SECTION
>  	b	91f
>  END_FTR_SECTION(CPU_FTR_TM | CPU_FTR_P9_TM_HV_ASSIST, 0)
>  	/*
> -	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS INCLUDING CR
> +	 * NOTE THAT THIS TRASHES ALL NON-VOLATILE REGISTERS (but not CR)
>  	 */
>  	mr      r3, r4
>  	ld      r4, VCPU_MSR(r3)
> +	li	r5, 0			/* don't preserve non-vol regs */
>  	bl	kvmppc_restore_tm_hv
> +	nop
>  	ld	r4, HSTATE_KVM_VCPU(r13)
>  91:
>  #endif
> @@ -2943,10 +2951,12 @@ END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
>   * Save transactional state and TM-related registers.
>   * Called with r3 pointing to the vcpu struct and r4 containing
>   * the guest MSR value.
> - * This can modify all checkpointed registers, but
> + * r5 is non-zero iff non-volatile register state needs to be maintained.
> + * If r5 == 0, this can modify all checkpointed registers, but
>   * restores r1 and r2 before exit.
>   */
> -kvmppc_save_tm_hv:
> +_GLOBAL_TOC(kvmppc_save_tm_hv)
> +EXPORT_SYMBOL_GPL(kvmppc_save_tm_hv)
>  	/* See if we need to handle fake suspend mode */
>  BEGIN_FTR_SECTION
>  	b	__kvmppc_save_tm
> @@ -2974,12 +2984,6 @@ BEGIN_FTR_SECTION
>  END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
>  	nop
>  
> -	std	r1, HSTATE_HOST_R1(r13)
> -
> -	/* Clear the MSR RI since r1, r13 may be foobar. */
> -	li	r5, 0
> -	mtmsrd	r5, 1
> -
>  	/* We have to treclaim here because that's the only way to do S->N */
>  	li	r3, TM_CAUSE_KVM_RESCHED
>  	TRECLAIM(R3)
> @@ -2988,22 +2992,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
>  	 * We were in fake suspend, so we are not going to save the
>  	 * register state as the guest checkpointed state (since
>  	 * we already have it), therefore we can now use any volatile GPR.
> +	 * In fact treclaim in fake suspend state doesn't modify
> +	 * any registers.
>  	 */
> -	/* Reload PACA pointer, stack pointer and TOC. */
> -	GET_PACA(r13)
> -	ld	r1, HSTATE_HOST_R1(r13)
> -	ld	r2, PACATOC(r13)
>  
> -	/* Set MSR RI now we have r1 and r13 back. */
> -	li	r5, MSR_RI
> -	mtmsrd	r5, 1
> -
> -	HMT_MEDIUM
> -	ld	r6, HSTATE_DSCR(r13)
> -	mtspr	SPRN_DSCR, r6
> -BEGIN_FTR_SECTION_NESTED(96)
> +BEGIN_FTR_SECTION
>  	bl	pnv_power9_force_smt4_release
> -END_FTR_SECTION_NESTED(CPU_FTR_P9_TM_XER_SO_BUG, CPU_FTR_P9_TM_XER_SO_BUG, 96)
> +END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_XER_SO_BUG)
>  	nop
>  
>  4:
> @@ -3029,10 +3024,12 @@ END_FTR_SECTION_NESTED(CPU_FTR_P9_TM_XER_SO_BUG, CPU_FTR_P9_TM_XER_SO_BUG, 96)
>   * Restore transactional state and TM-related registers.
>   * Called with r3 pointing to the vcpu struct
>   * and r4 containing the guest MSR value.
> + * r5 is non-zero iff non-volatile register state needs to be maintained.
>   * This potentially modifies all checkpointed registers.
>   * It restores r1 and r2 from the PACA.
>   */
> -kvmppc_restore_tm_hv:
> +_GLOBAL_TOC(kvmppc_restore_tm_hv)
> +EXPORT_SYMBOL_GPL(kvmppc_restore_tm_hv)
>  	/*
>  	 * If we are doing TM emulation for the guest on a POWER9 DD2,
>  	 * then we don't actually do a trechkpt -- we either set up
> diff --git a/arch/powerpc/kvm/tm.S b/arch/powerpc/kvm/tm.S
> index 90e330f..0531a14 100644
> --- a/arch/powerpc/kvm/tm.S
> +++ b/arch/powerpc/kvm/tm.S
> @@ -28,17 +28,25 @@
>   * Save transactional state and TM-related registers.
>   * Called with:
>   * - r3 pointing to the vcpu struct
> - * - r4 points to the MSR with current TS bits:
> + * - r4 containing the MSR with current TS bits:
>   * 	(For HV KVM, it is VCPU_MSR ; For PR KVM, it is host MSR).
> - * This can modify all checkpointed registers, but
> - * restores r1, r2 before exit.
> + * - r5 containing a flag indicating that non-volatile registers
> + *	must be preserved.
> + * If r5 == 0, this can modify all checkpointed registers, but
> + * restores r1, r2 before exit.  If r5 != 0, this restores the
> + * MSR TM/FP/VEC/VSX bits to their state on entry.
>   */
>  _GLOBAL(__kvmppc_save_tm)
>  	mflr	r0
>  	std	r0, PPC_LR_STKOFF(r1)
> +	stdu    r1, -SWITCH_FRAME_SIZE(r1)
> +
> +	mr	r9, r3
> +	cmpdi	cr7, r5, 0
>  
>  	/* Turn on TM. */
>  	mfmsr	r8
> +	mr	r10, r8
>  	li	r0, 1
>  	rldimi	r8, r0, MSR_TM_LG, 63-MSR_TM_LG
>  	ori     r8, r8, MSR_FP
> @@ -51,6 +59,27 @@ _GLOBAL(__kvmppc_save_tm)
>  	std	r1, HSTATE_SCRATCH2(r13)
>  	std	r3, HSTATE_SCRATCH1(r13)
>  
> +	/* Save CR on the stack - even if r5 == 0 we need to get cr7 back. */
> +	mfcr	r6
> +	SAVE_GPR(6, r1)
> +
> +	/* Save DSCR so we can restore it to avoid running with user value */
> +	mfspr	r7, SPRN_DSCR
> +	SAVE_GPR(7, r1)
> +
> +	/*
> +	 * We are going to do treclaim., which will modify all checkpointed
> +	 * registers.  Save the non-volatile registers on the stack if
> +	 * preservation of non-volatile state has been requested.
> +	 */
> +	beq	cr7, 3f
> +	SAVE_NVGPRS(r1)
> +
> +	/* MSR[TS] will be 0 (non-transactional) once we do treclaim. */
> +	li	r0, 0
> +	rldimi	r10, r0, MSR_TS_S_LG, 63 - MSR_TS_T_LG
> +	SAVE_GPR(10, r1)	/* final MSR value */
> +3:
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  BEGIN_FTR_SECTION
>  	/* Emulation of the treclaim instruction needs TEXASR before treclaim */
> @@ -74,22 +103,25 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_HV_ASSIST)
>  	std	r9, PACATMSCRATCH(r13)
>  	ld	r9, HSTATE_SCRATCH1(r13)
>  
> -	/* Get a few more GPRs free. */
> -	std	r29, VCPU_GPRS_TM(29)(r9)
> -	std	r30, VCPU_GPRS_TM(30)(r9)
> -	std	r31, VCPU_GPRS_TM(31)(r9)
> -
> -	/* Save away PPR and DSCR soon so don't run with user values. */
> -	mfspr	r31, SPRN_PPR
> +	/* Save away PPR soon so we don't run with user value. */
> +	std	r0, VCPU_GPRS_TM(0)(r9)
> +	mfspr	r0, SPRN_PPR
>  	HMT_MEDIUM
> -	mfspr	r30, SPRN_DSCR
> -#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> -	ld	r29, HSTATE_DSCR(r13)
> -	mtspr	SPRN_DSCR, r29
> -#endif
>  
> -	/* Save all but r9, r13 & r29-r31 */
> -	reg = 0
> +	/* Reload stack pointer. */
> +	std	r1, VCPU_GPRS_TM(1)(r9)
> +	ld	r1, HSTATE_SCRATCH2(r13)
> +
> +	/* Set MSR RI now we have r1 and r13 back. */
> +	std	r2, VCPU_GPRS_TM(2)(r9)
> +	li	r2, MSR_RI
> +	mtmsrd	r2, 1
> +
> +	/* Reload TOC pointer. */
> +	ld	r2, PACATOC(r13)
> +
> +	/* Save all but r0-r2, r9 & r13 */
> +	reg = 3
>  	.rept	29
>  	.if (reg != 9) && (reg != 13)
>  	std	reg, VCPU_GPRS_TM(reg)(r9)
> @@ -103,33 +135,29 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_HV_ASSIST)
>  	ld	r4, PACATMSCRATCH(r13)
>  	std	r4, VCPU_GPRS_TM(9)(r9)
>  
> -	/* Reload stack pointer and TOC. */
> -	ld	r1, HSTATE_SCRATCH2(r13)
> -	ld	r2, PACATOC(r13)
> -
> -	/* Set MSR RI now we have r1 and r13 back. */
> -	li	r5, MSR_RI
> -	mtmsrd	r5, 1
> +	/* Restore host DSCR and CR values, after saving guest values */
> +	mfcr	r6
> +	mfspr	r7, SPRN_DSCR
> +	stw	r6, VCPU_CR_TM(r9)
> +	std	r7, VCPU_DSCR_TM(r9)
> +	REST_GPR(6, r1)
> +	REST_GPR(7, r1)
> +	mtcr	r6
> +	mtspr	SPRN_DSCR, r7
>  
> -	/* Save away checkpinted SPRs. */
> -	std	r31, VCPU_PPR_TM(r9)
> -	std	r30, VCPU_DSCR_TM(r9)
> +	/* Save away checkpointed SPRs. */
> +	std	r0, VCPU_PPR_TM(r9)
>  	mflr	r5
> -	mfcr	r6
>  	mfctr	r7
>  	mfspr	r8, SPRN_AMR
>  	mfspr	r10, SPRN_TAR
>  	mfxer	r11
>  	std	r5, VCPU_LR_TM(r9)
> -	stw	r6, VCPU_CR_TM(r9)
>  	std	r7, VCPU_CTR_TM(r9)
>  	std	r8, VCPU_AMR_TM(r9)
>  	std	r10, VCPU_TAR_TM(r9)
>  	std	r11, VCPU_XER_TM(r9)
>  
> -	/* Restore r12 as trap number. */
> -	lwz	r12, VCPU_TRAP(r9)
> -
>  	/* Save FP/VSX. */
>  	addi	r3, r9, VCPU_FPRS_TM
>  	bl	store_fp_state
> @@ -137,6 +165,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_HV_ASSIST)
>  	bl	store_vr_state
>  	mfspr	r6, SPRN_VRSAVE
>  	stw	r6, VCPU_VRSAVE_TM(r9)
> +
> +	/* Restore non-volatile registers if requested to */
> +	beq	cr7, 1f
> +	REST_NVGPRS(r1)
> +	REST_GPR(10, r1)
>  1:
>  	/*
>  	 * We need to save these SPRs after the treclaim so that the software
> @@ -146,12 +179,16 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_HV_ASSIST)
>  	 */
>  	mfspr	r7, SPRN_TEXASR
>  	std	r7, VCPU_TEXASR(r9)
> -11:
>  	mfspr	r5, SPRN_TFHAR
>  	mfspr	r6, SPRN_TFIAR
>  	std	r5, VCPU_TFHAR(r9)
>  	std	r6, VCPU_TFIAR(r9)
>  
> +	/* Restore MSR state if requested */
> +	beq	cr7, 2f
> +	mtmsrd	r10, 0
> +2:
> +	addi	r1, r1, SWITCH_FRAME_SIZE
>  	ld	r0, PPC_LR_STKOFF(r1)
>  	mtlr	r0
>  	blr
> @@ -161,49 +198,22 @@ END_FTR_SECTION_IFSET(CPU_FTR_P9_TM_HV_ASSIST)
>   * be invoked from C function by PR KVM only.
>   */
>  _GLOBAL(_kvmppc_save_tm_pr)
> -	mflr	r5
> -	std	r5, PPC_LR_STKOFF(r1)
> -	stdu    r1, -SWITCH_FRAME_SIZE(r1)
> -	SAVE_NVGPRS(r1)
> -
> -	/* save MSR since TM/math bits might be impacted
> -	 * by __kvmppc_save_tm().
> -	 */
> -	mfmsr	r5
> -	SAVE_GPR(5, r1)
> -
> -	/* also save DSCR/CR/TAR so that it can be recovered later */
> -	mfspr   r6, SPRN_DSCR
> -	SAVE_GPR(6, r1)
> -
> -	mfcr    r7
> -	stw     r7, _CCR(r1)
> +	mflr	r0
> +	std	r0, PPC_LR_STKOFF(r1)
> +	stdu    r1, -PPC_MIN_STKFRM(r1)
>  
>  	mfspr   r8, SPRN_TAR
> -	SAVE_GPR(8, r1)
> +	std	r8, PPC_MIN_STKFRM-8(r1)
>  
> +	li	r5, 1		/* preserve non-volatile registers */
>  	bl	__kvmppc_save_tm
>  
> -	REST_GPR(8, r1)
> +	ld	r8, PPC_MIN_STKFRM-8(r1)
>  	mtspr   SPRN_TAR, r8
>  
> -	ld      r7, _CCR(r1)
> -	mtcr	r7
> -
> -	REST_GPR(6, r1)
> -	mtspr   SPRN_DSCR, r6
> -
> -	/* need preserve current MSR's MSR_TS bits */
> -	REST_GPR(5, r1)
> -	mfmsr   r6
> -	rldicl  r6, r6, 64 - MSR_TS_S_LG, 62
> -	rldimi  r5, r6, MSR_TS_S_LG, 63 - MSR_TS_T_LG
> -	mtmsrd  r5
> -
> -	REST_NVGPRS(r1)
> -	addi    r1, r1, SWITCH_FRAME_SIZE
> -	ld	r5, PPC_LR_STKOFF(r1)
> -	mtlr	r5
> +	addi    r1, r1, PPC_MIN_STKFRM
> +	ld	r0, PPC_LR_STKOFF(r1)
> +	mtlr	r0
>  	blr
>  
>  EXPORT_SYMBOL_GPL(_kvmppc_save_tm_pr);
> @@ -215,15 +225,21 @@ EXPORT_SYMBOL_GPL(_kvmppc_save_tm_pr);
>   *  - r4 is the guest MSR with desired TS bits:
>   * 	For HV KVM, it is VCPU_MSR
>   * 	For PR KVM, it is provided by caller
> - * This potentially modifies all checkpointed registers.
> - * It restores r1, r2 from the PACA.
> + * - r5 containing a flag indicating that non-volatile registers
> + *	must be preserved.
> + * If r5 == 0, this potentially modifies all checkpointed registers, but
> + * restores r1, r2 from the PACA before exit.
> + * If r5 != 0, this restores the MSR TM/FP/VEC/VSX bits to their state on entry.
>   */
>  _GLOBAL(__kvmppc_restore_tm)
>  	mflr	r0
>  	std	r0, PPC_LR_STKOFF(r1)
>  
> +	cmpdi	cr7, r5, 0
> +
>  	/* Turn on TM/FP/VSX/VMX so we can restore them. */
>  	mfmsr	r5
> +	mr	r10, r5
>  	li	r6, MSR_TM >> 32
>  	sldi	r6, r6, 32
>  	or	r5, r5, r6
> @@ -244,8 +260,7 @@ _GLOBAL(__kvmppc_restore_tm)
>  
>  	mr	r5, r4
>  	rldicl. r5, r5, 64 - MSR_TS_S_LG, 62
> -	beqlr		/* TM not active in guest */
> -	std	r1, HSTATE_SCRATCH2(r13)
> +	beq	9f		/* TM not active in guest */
>  
>  	/* Make sure the failure summary is set, otherwise we'll program check
>  	 * when we trechkpt.  It's possible that this might have been not set
> @@ -256,6 +271,26 @@ _GLOBAL(__kvmppc_restore_tm)
>  	mtspr	SPRN_TEXASR, r7
>  
>  	/*
> +	 * Make a stack frame and save non-volatile registers if requested.
> +	 */
> +	stdu	r1, -SWITCH_FRAME_SIZE(r1)
> +	std	r1, HSTATE_SCRATCH2(r13)
> +
> +	mfcr	r6
> +	mfspr	r7, SPRN_DSCR
> +	SAVE_GPR(2, r1)
> +	SAVE_GPR(6, r1)
> +	SAVE_GPR(7, r1)
> +
> +	beq	cr7, 4f
> +	SAVE_NVGPRS(r1)
> +
> +	/* MSR[TS] will be 1 (suspended) once we do trechkpt */
> +	li	r0, 1
> +	rldimi	r10, r0, MSR_TS_S_LG, 63 - MSR_TS_T_LG
> +	SAVE_GPR(10, r1)	/* final MSR value */
> +4:
> +	/*
>  	 * We need to load up the checkpointed state for the guest.
>  	 * We need to do this early as it will blow away any GPRs, VSRs and
>  	 * some SPRs.
> @@ -291,8 +326,6 @@ _GLOBAL(__kvmppc_restore_tm)
>  	ld	r29, VCPU_DSCR_TM(r3)
>  	ld	r30, VCPU_PPR_TM(r3)
>  
> -	std	r2, PACATMSCRATCH(r13) /* Save TOC */
> -
>  	/* Clear the MSR RI since r1, r13 are all going to be foobar. */
>  	li	r5, 0
>  	mtmsrd	r5, 1
> @@ -318,18 +351,31 @@ _GLOBAL(__kvmppc_restore_tm)
>  	/* Now let's get back the state we need. */
>  	HMT_MEDIUM
>  	GET_PACA(r13)
> -#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> -	ld	r29, HSTATE_DSCR(r13)
> -	mtspr	SPRN_DSCR, r29
> -#endif
>  	ld	r1, HSTATE_SCRATCH2(r13)
> -	ld	r2, PACATMSCRATCH(r13)
> +	REST_GPR(7, r1)
> +	mtspr	SPRN_DSCR, r7
>  
>  	/* Set the MSR RI since we have our registers back. */
>  	li	r5, MSR_RI
>  	mtmsrd	r5, 1
> +
> +	/* Restore TOC pointer and CR */
> +	REST_GPR(2, r1)
> +	REST_GPR(6, r1)
> +	mtcr	r6
> +
> +	/* Restore non-volatile registers if requested to. */
> +	beq	cr7, 5f
> +	REST_GPR(10, r1)
> +	REST_NVGPRS(r1)
> +
> +5:	addi	r1, r1, SWITCH_FRAME_SIZE
>  	ld	r0, PPC_LR_STKOFF(r1)
>  	mtlr	r0
> +
> +9:	/* Restore MSR bits if requested */
> +	beqlr	cr7
> +	mtmsrd	r10, 0
>  	blr
>  
>  /*
> @@ -337,47 +383,23 @@ _GLOBAL(__kvmppc_restore_tm)
>   * can be invoked from C function by PR KVM only.
>   */
>  _GLOBAL(_kvmppc_restore_tm_pr)
> -	mflr	r5
> -	std	r5, PPC_LR_STKOFF(r1)
> -	stdu    r1, -SWITCH_FRAME_SIZE(r1)
> -	SAVE_NVGPRS(r1)
> -
> -	/* save MSR to avoid TM/math bits change */
> -	mfmsr	r5
> -	SAVE_GPR(5, r1)
> -
> -	/* also save DSCR/CR/TAR so that it can be recovered later */
> -	mfspr   r6, SPRN_DSCR
> -	SAVE_GPR(6, r1)
> -
> -	mfcr    r7
> -	stw     r7, _CCR(r1)
> +	mflr	r0
> +	std	r0, PPC_LR_STKOFF(r1)
> +	stdu    r1, -PPC_MIN_STKFRM(r1)
>  
> +	/* save TAR so that it can be recovered later */
>  	mfspr   r8, SPRN_TAR
> -	SAVE_GPR(8, r1)
> +	std	r8, PPC_MIN_STKFRM-8(r1)
>  
> +	li	r5, 1
>  	bl	__kvmppc_restore_tm
>  
> -	REST_GPR(8, r1)
> +	ld	r8, PPC_MIN_STKFRM-8(r1)
>  	mtspr   SPRN_TAR, r8
>  
> -	ld      r7, _CCR(r1)
> -	mtcr	r7
> -
> -	REST_GPR(6, r1)
> -	mtspr   SPRN_DSCR, r6
> -
> -	/* need preserve current MSR's MSR_TS bits */
> -	REST_GPR(5, r1)
> -	mfmsr   r6
> -	rldicl  r6, r6, 64 - MSR_TS_S_LG, 62
> -	rldimi  r5, r6, MSR_TS_S_LG, 63 - MSR_TS_T_LG
> -	mtmsrd  r5
> -
> -	REST_NVGPRS(r1)
> -	addi    r1, r1, SWITCH_FRAME_SIZE
> -	ld	r5, PPC_LR_STKOFF(r1)
> -	mtlr	r5
> +	addi    r1, r1, PPC_MIN_STKFRM
> +	ld	r0, PPC_LR_STKOFF(r1)
> +	mtlr	r0
>  	blr
>  
>  EXPORT_SYMBOL_GPL(_kvmppc_restore_tm_pr);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 18/33] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
       [not found] ` <1538127963-15645-19-git-send-email-paulus@ozlabs.org>
@ 2018-10-02  6:01   ` David Gibson
  2018-10-02  7:48     ` Paul Mackerras
  0 siblings, 1 reply; 8+ messages in thread
From: David Gibson @ 2018-10-02  6:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 9157 bytes --]

On Fri, Sep 28, 2018 at 07:45:48PM +1000, Paul Mackerras wrote:
> This starts the process of adding the code to support nested HV-style
> virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
> a nested hypervisor can use to set the base address and size of a
> partition table in its memory (analogous to the PTCR register).
> On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
> hypercall from the guest is handled by code that saves the virtual
> PTCR value for the guest.
> 
> This also adds code for creating and destroying nested guests and for
> reading the partition table entry for a nested guest from L1 memory.
> Each nested guest has its own shadow LPID value, different in general
> from the LPID value used by the nested hypervisor to refer to it.  The
> shadow LPID value is allocated at nested guest creation time.
> 
> Nested hypervisor functionality is only available for a radix guest,
> which therefore means a radix host on a POWER9 (or later) processor.
> 
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

I've made a number of comments below, but they're all pretty minor
things.  They might be worth including if we have to respin for
whatever reason, or as follow-up improvements, but I don't think we
need to hold this up for them.


[snip]
> @@ -287,6 +288,7 @@ struct kvm_arch {
>  	u8 radix;
>  	u8 fwnmi_enabled;
>  	bool threads_indep;
> +	bool nested_enable;
>  	pgd_t *pgtable;
>  	u64 process_table;
>  	struct dentry *debugfs_dir;
> @@ -312,6 +314,9 @@ struct kvm_arch {
>  #endif
>  	struct kvmppc_ops *kvm_ops;
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +	u64 l1_ptcr;
> +	int max_nested_lpid;
> +	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];

This array could be quite large.  As a followup would it be worth
dynamically allocating it, so it can be skipped for L1s with no
nesting allowed, and/or dynamically resized as the L1 adds/removes L2s.

>  	/* This array can grow quite large, keep it at the end */
>  	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
>  #endif

[snip]
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> new file mode 100644
> index 0000000..5341052
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -0,0 +1,283 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright IBM Corporation, 2018
> + * Authors Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> + *	   Paul Mackerras <paulus@ozlabs.org>
> + *
> + * Description: KVM functions specific to running nested KVM-HV guests
> + * on Book3S processors (specifically POWER9 and later).
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/kvm_host.h>
> +
> +#include <asm/kvm_ppc.h>
> +#include <asm/mmu.h>
> +#include <asm/pgtable.h>
> +#include <asm/pgalloc.h>
> +
> +static struct patb_entry *pseries_partition_tb;
> +
> +static void kvmhv_update_ptbl_cache(struct kvm_nested_guest *gp);
> +
> +/* Only called when we're not in hypervisor mode */

This comment isn't strictly accurate, the function is called, but
exits trivially.

> +bool kvmhv_nested_init(void)
> +{
> +	long int ptb_order;
> +	unsigned long ptcr;
> +	long rc;
> +
> +	if (!kvmhv_on_pseries())
> +		return true;
> +	if (!radix_enabled())
> +		return false;
> +
> +	/* find log base 2 of KVMPPC_NR_LPIDS, rounding up */
> +	ptb_order = __ilog2(KVMPPC_NR_LPIDS - 1) + 1;
> +	if (ptb_order < 8)
> +		ptb_order = 8;
> +	pseries_partition_tb = kmalloc(sizeof(struct patb_entry) << ptb_order,
> +				       GFP_KERNEL);
> +	if (!pseries_partition_tb) {
> +		pr_err("kvm-hv: failed to allocated nested partition table\n");
> +		return false;

Since this can fail in several different ways, it seems like returning
an errno, rather than a bool would make sense.

> +	}
> +
> +	ptcr = __pa(pseries_partition_tb) | (ptb_order - 8);
> +	rc = plpar_hcall_norets(H_SET_PARTITION_TABLE, ptcr);
> +	if (rc != H_SUCCESS) {
> +		pr_err("kvm-hv: Parent hypervisor does not support nesting (rc=%ld)\n",
> +		       rc);
> +		kfree(pseries_partition_tb);
> +		pseries_partition_tb = NULL;
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +void kvmhv_nested_exit(void)
> +{
> +	if (kvmhv_on_pseries() && pseries_partition_tb) {

First clause is redundant there, isn't it, since pseries_partition_tb
can only be set if we're on pseries?

> +		plpar_hcall_norets(H_SET_PARTITION_TABLE, 0);
> +		kfree(pseries_partition_tb);
> +		pseries_partition_tb = NULL;
> +	}
> +}
> +
> +void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
> +{
> +	if (cpu_has_feature(CPU_FTR_HVMODE)) {
> +		mmu_partition_table_set_entry(lpid, dw0, dw1);
> +	} else {
> +		pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
> +		pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
> +		/* this will be emulated, L0 will do the necessary barriers */
> +		asm volatile(PPC_TLBIE_5(%0, %1, 2, 0, 1) : :
> +			     "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));

I think in this version you were using a paravirt TLB flush, instead
of emulation?

> +	}
> +}
> +
> +static void kvmhv_set_nested_ptbl(struct kvm_nested_guest *gp)
> +{
> +	unsigned long dw0;
> +
> +	dw0 = PATB_HR | radix__get_tree_size() |
> +		__pa(gp->shadow_pgtable) | RADIX_PGD_INDEX_SIZE;
> +	kvmhv_set_ptbl_entry(gp->shadow_lpid, dw0, gp->process_table);
> +}
> +
> +void kvmhv_vm_nested_init(struct kvm *kvm)
> +{
> +	kvm->arch.max_nested_lpid = -1;
> +}
> +
> +/*
> + * Handle the H_SET_PARTITION_TABLE hcall.
> + * r4 = guest real address of partition table + log_2(size) - 12
> + * (formatted as for the PTCR).
> + */
> +long kvmhv_set_partition_table(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	unsigned long ptcr = kvmppc_get_gpr(vcpu, 4);
> +
> +	kvm->arch.l1_ptcr = ptcr;

I don't think it's actually dangerous, since we validate the L1
addresses when we read from the table, but it would probably be better
for debugging a guest if this failed the hcall if the PTCR didn't make
sense (out of bounds order, or not within L1 memory size).

> +	return H_SUCCESS;
> +}

[snip]
> +/*
> + * Free up any resources allocated for a nested guest.
> + */
> +static void kvmhv_release_nested(struct kvm_nested_guest *gp)
> +{
> +	kvmhv_set_ptbl_entry(gp->shadow_lpid, 0, 0);
> +	kvmppc_free_lpid(gp->shadow_lpid);
> +	if (gp->shadow_pgtable)
> +		pgd_free(gp->l1_host->mm, gp->shadow_pgtable);
> +	kfree(gp);
> +}
> +
> +static void kvmhv_remove_nested(struct kvm_nested_guest *gp)
> +{
> +	struct kvm *kvm = gp->l1_host;
> +	int lpid = gp->l1_lpid;
> +	long ref;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	if (gp == kvm->arch.nested_guests[lpid]) {

This is to protect against a race with another remove, yes?  Since kvm
and lpid are read before you take the lock.  Is that right?

> +		kvm->arch.nested_guests[lpid] = NULL;
> +		if (lpid == kvm->arch.max_nested_lpid) {
> +			while (--lpid >= 0 && !kvm->arch.nested_guests[lpid])
> +				;
> +			kvm->arch.max_nested_lpid = lpid;
> +		}
> +		--gp->refcnt;
> +	}
> +	ref = gp->refcnt;
> +	spin_unlock(&kvm->mmu_lock);
> +	if (ref == 0)
> +		kvmhv_release_nested(gp);
> +}

[snip]
> +struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
> +					  bool create)
> +{
> +	struct kvm_nested_guest *gp, *newgp;
> +
> +	if (l1_lpid >= KVM_MAX_NESTED_GUESTS ||
> +	    l1_lpid >= (1ul << ((kvm->arch.l1_ptcr & PRTS_MASK) + 12 - 4)))
> +		return NULL;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	gp = kvm->arch.nested_guests[l1_lpid];
> +	if (gp)
> +		++gp->refcnt;
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	if (gp || !create)
> +		return gp;
> +
> +	newgp = kvmhv_alloc_nested(kvm, l1_lpid);
> +	if (!newgp)
> +		return NULL;
> +	spin_lock(&kvm->mmu_lock);
> +	if (kvm->arch.nested_guests[l1_lpid]) {
> +		/* someone else beat us to it */

Should we print a message in this case.  It's no skin off the host's
nose, but wouldn't this mean the guest is concurrently trying to start
two guests with the same lpid, which seems like a dubious thing for it
to be doing.

> +		gp = kvm->arch.nested_guests[l1_lpid];
> +	} else {
> +		kvm->arch.nested_guests[l1_lpid] = newgp;
> +		++newgp->refcnt;
> +		gp = newgp;
> +		newgp = NULL;
> +		if (l1_lpid > kvm->arch.max_nested_lpid)
> +			kvm->arch.max_nested_lpid = l1_lpid;
> +	}
> +	++gp->refcnt;
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	if (newgp)
> +		kvmhv_release_nested(newgp);
> +
> +	return gp;
> +}
> +
> +void kvmhv_put_nested(struct kvm_nested_guest *gp)
> +{
> +	struct kvm *kvm = gp->l1_host;
> +	long ref;
> +
> +	spin_lock(&kvm->mmu_lock);
> +	ref = --gp->refcnt;
> +	spin_unlock(&kvm->mmu_lock);
> +	if (ref == 0)
> +		kvmhv_release_nested(gp);
> +}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 19/33] KVM: PPC: Book3S HV: Nested guest entry via hypercall
       [not found] ` <1538127963-15645-20-git-send-email-paulus@ozlabs.org>
@ 2018-10-02  7:00   ` David Gibson
  2018-10-02  8:00     ` Paul Mackerras
  0 siblings, 1 reply; 8+ messages in thread
From: David Gibson @ 2018-10-02  7:00 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 8783 bytes --]

On Fri, Sep 28, 2018 at 07:45:49PM +1000, Paul Mackerras wrote:
> This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
> hypervisor to enter one of its nested guests.  The hypercall supplies
> register values in two structs.  Those values are copied by the level 0
> (L0) hypervisor (the one which is running in hypervisor mode) into the
> vcpu struct of the L1 guest, and then the guest is run until an
> interrupt or error occurs which needs to be reported to L1 via the
> hypercall return value.
> 
> Currently this assumes that the L0 and L1 hypervisors are the same
> endianness, and the structs passed as arguments are in native
> endianness.  If they are of different endianness, the version number
> check will fail and the hcall will be rejected.
> 
> Nested hypervisors do not support indep_threads_mode=N, so this adds
> code to print a warning message if the administrator has set
> indep_threads_mode=N, and treat it as Y.
> 
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

[snip]
> +/* Register state for entering a nested guest with H_ENTER_NESTED */
> +struct hv_guest_state {
> +	u64 version;		/* version of this structure layout */
> +	u32 lpid;
> +	u32 vcpu_token;
> +	/* These registers are hypervisor privileged (at least for writing) */
> +	u64 lpcr;
> +	u64 pcr;
> +	u64 amor;
> +	u64 dpdes;
> +	u64 hfscr;
> +	s64 tb_offset;
> +	u64 dawr0;
> +	u64 dawrx0;
> +	u64 ciabr;
> +	u64 hdec_expiry;
> +	u64 purr;
> +	u64 spurr;
> +	u64 ic;
> +	u64 vtb;
> +	u64 hdar;
> +	u64 hdsisr;
> +	u64 heir;
> +	u64 asdr;
> +	/* These are OS privileged but need to be set late in guest entry */
> +	u64 srr0;
> +	u64 srr1;
> +	u64 sprg[4];
> +	u64 pidr;
> +	u64 cfar;
> +	u64 ppr;
> +};

I'm guessing the implication here is that most supervisor privileged
registers need to be set by the L1 to the L2 values, before making the
H_ENTER_NESTED call.  Is that right?

[snip]
> +static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
> +{
> +	int r;
> +	int srcu_idx;
> +
> +	vcpu->stat.sum_exits++;
> +
> +	/*
> +	 * This can happen if an interrupt occurs in the last stages
> +	 * of guest entry or the first stages of guest exit (i.e. after
> +	 * setting paca->kvm_hstate.in_guest to KVM_GUEST_MODE_GUEST_HV
> +	 * and before setting it to KVM_GUEST_MODE_HOST_HV).
> +	 * That can happen due to a bug, or due to a machine check
> +	 * occurring at just the wrong time.
> +	 */
> +	if (vcpu->arch.shregs.msr & MSR_HV) {
> +		pr_emerg("KVM trap in HV mode while nested!\n");
> +		pr_emerg("trap=0x%x | pc=0x%lx | msr=0x%llx\n",
> +			 vcpu->arch.trap, kvmppc_get_pc(vcpu),
> +			 vcpu->arch.shregs.msr);
> +		kvmppc_dump_regs(vcpu);
> +		return RESUME_HOST;

To make sure I'm understanding right here, RESUME_HOST will
effectively mean resume the L0, and RESUME_GUEST (without additional
processing) will mean resume the L2, right?

> +	}
> +	switch (vcpu->arch.trap) {
> +	/* We're good on these - the host merely wanted to get our attention */
> +	case BOOK3S_INTERRUPT_HV_DECREMENTER:
> +		vcpu->stat.dec_exits++;
> +		r = RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_EXTERNAL:
> +		vcpu->stat.ext_intr_exits++;
> +		r = RESUME_HOST;
> +		break;
> +	case BOOK3S_INTERRUPT_H_DOORBELL:
> +	case BOOK3S_INTERRUPT_H_VIRT:
> +		vcpu->stat.ext_intr_exits++;
> +		r = RESUME_GUEST;
> +		break;
> +	/* SR/HMI/PMI are HV interrupts that host has handled. Resume guest.*/
> +	case BOOK3S_INTERRUPT_HMI:
> +	case BOOK3S_INTERRUPT_PERFMON:
> +	case BOOK3S_INTERRUPT_SYSTEM_RESET:
> +		r = RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_MACHINE_CHECK:
> +		/* Pass the machine check to the L1 guest */
> +		r = RESUME_HOST;
> +		/* Print the MCE event to host console. */
> +		machine_check_print_event_info(&vcpu->arch.mce_evt, false);
> +		break;
> +	/*
> +	 * We get these next two if the guest accesses a page which it thinks
> +	 * it has mapped but which is not actually present, either because
> +	 * it is for an emulated I/O device or because the corresonding
> +	 * host page has been paged out.
> +	 */
> +	case BOOK3S_INTERRUPT_H_DATA_STORAGE:
> +		srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		r = kvmhv_nested_page_fault(vcpu);
> +		srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
> +		break;
> +	case BOOK3S_INTERRUPT_H_INST_STORAGE:
> +		vcpu->arch.fault_dar = kvmppc_get_pc(vcpu);
> +		vcpu->arch.fault_dsisr = kvmppc_get_msr(vcpu) &
> +					 DSISR_SRR1_MATCH_64S;
> +		if (vcpu->arch.shregs.msr & HSRR1_HISI_WRITE)
> +			vcpu->arch.fault_dsisr |= DSISR_ISSTORE;
> +		srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> +		r = kvmhv_nested_page_fault(vcpu);
> +		srcu_read_unlock(&vcpu->kvm->srcu, srcu_idx);
> +		break;
> +
> +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> +	case BOOK3S_INTERRUPT_HV_SOFTPATCH:
> +		/*
> +		 * This occurs for various TM-related instructions that
> +		 * we need to emulate on POWER9 DD2.2.  We have already
> +		 * handled the cases where the guest was in real-suspend
> +		 * mode and was transitioning to transactional state.
> +		 */
> +		r = kvmhv_p9_tm_emulation(vcpu);
> +		break;
> +#endif
> +
> +	case BOOK3S_INTERRUPT_HV_RM_HARD:
> +		vcpu->arch.trap = 0;
> +		r = RESUME_GUEST;
> +		if (!xive_enabled())
> +			kvmppc_xics_rm_complete(vcpu, 0);
> +		break;
> +	default:
> +		r = RESUME_HOST;
> +		break;
> +	}
> +
> +	return r;
> +}
> +
>  static int kvm_arch_vcpu_ioctl_get_sregs_hv(struct kvm_vcpu *vcpu,
>  					    struct kvm_sregs *sregs)
>  {
> @@ -3098,7 +3203,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
>  /*
>   * Load up hypervisor-mode registers on P9.
>   */
> -static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit)
> +static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> +				     unsigned long lpcr)

I don't understand this change: you've added a parameter, but there's
no change to the body of the function, so you're not actually using
the new parameter.

[snip]
> +/*
> + * This never fails for a radix guest, as none of the operations it does
> + * for a radix guest can fail or have a way to report failure.
> + * kvmhv_run_single_vcpu() relies on this fact.
> + */
>  static int kvmhv_setup_mmu(struct kvm_vcpu *vcpu)
>  {
>  	int r = 0;
> @@ -3684,12 +3814,15 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
>  	return vcpu->arch.ret;
>  }
>  
> -static int kvmppc_run_single_vcpu(struct kvm_run *kvm_run,
> -				  struct kvm_vcpu *vcpu, u64 time_limit)
> +int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
> +			  struct kvm_vcpu *vcpu, u64 time_limit,
> +			  unsigned long lpcr)

IIRC this is the streamlined entry path introduced earlier in the
series, yes?  Making the name change where it was introduced would
avoid the extra churn here.

It'd be nice to have something here to make it more obvious that this
path is only for radix guests, but I'm not really sure how to
accomplish that.

>  {
>  	int trap, r, pcpu, pcpu0;
>  	int srcu_idx;
>  	struct kvmppc_vcore *vc;
> +	struct kvm_nested_guest *nested = vcpu->arch.nested;
> +	unsigned long lpid;
>  
>  	trace_kvmppc_run_vcpu_enter(vcpu);
>  
> @@ -3710,16 +3843,8 @@ static int kvmppc_run_single_vcpu(struct kvm_run *kvm_run,
>  	vc->runner = vcpu;
>  
>  	/* See if the MMU is ready to go */
> -	if (!vcpu->kvm->arch.mmu_ready) {
> -		r = kvmhv_setup_mmu(vcpu);
> -		if (r) {
> -			kvm_run->exit_reason = KVM_EXIT_FAIL_ENTRY;
> -			kvm_run->fail_entry.
> -				hardware_entry_failure_reason = 0;
> -			vcpu->arch.ret = r;
> -			goto out;
> -		}
> -	}
> +	if (!vcpu->kvm->arch.mmu_ready)
> +		kvmhv_setup_mmu(vcpu);
>  
>  	if (need_resched())
>  		cond_resched();
> @@ -3741,7 +3866,22 @@ static int kvmppc_run_single_vcpu(struct kvm_run *kvm_run,
>  	if (lazy_irq_pending() || need_resched() || !vcpu->kvm->arch.mmu_ready)
>  		goto out;
>  
> -	kvmppc_core_prepare_to_enter(vcpu);
> +	if (!nested) {
> +		kvmppc_core_prepare_to_enter(vcpu);
> +		if (vcpu->arch.doorbell_request) {
> +			vc->dpdes = 1;
> +			smp_wmb();
> +			vcpu->arch.doorbell_request = 0;
> +		}
> +		if (test_bit(BOOK3S_IRQPRIO_EXTERNAL,
> +			     &vcpu->arch.pending_exceptions))
> +			lpcr |= LPCR_MER;
> +	} else if (vcpu->arch.pending_exceptions ||
> +		   vcpu->arch.doorbell_request ||
> +		   xive_interrupt_pending(vcpu)) {
> +		vcpu->arch.ret = RESUME_HOST;
> +		goto out;
> +	}
>  
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 20/33] KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor
       [not found] ` <1538127963-15645-21-git-send-email-paulus@ozlabs.org>
@ 2018-10-02  7:02   ` David Gibson
  0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2018-10-02  7:02 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4995 bytes --]

On Fri, Sep 28, 2018 at 07:45:50PM +1000, Paul Mackerras wrote:
> This adds code to call the H_IPI and H_EOI hypercalls when we are
> running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
> feature) and we would otherwise access the XICS interrupt controller
> directly or via an OPAL call.
> 
> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_hv.c         |  7 +++++-
>  arch/powerpc/kvm/book3s_hv_builtin.c | 44 +++++++++++++++++++++++++++++-------
>  arch/powerpc/kvm/book3s_hv_rm_xics.c |  8 +++++++
>  3 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 60adf47..8d2f91f 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -173,6 +173,10 @@ static bool kvmppc_ipi_thread(int cpu)
>  {
>  	unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
>  
> +	/* If we're a nested hypervisor, fall back to ordinary IPIs for now */
> +	if (kvmhv_on_pseries())
> +		return false;
> +
>  	/* On POWER9 we can use msgsnd to IPI any cpu */
>  	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>  		msg |= get_hard_smp_processor_id(cpu);
> @@ -5164,7 +5168,8 @@ static int kvmppc_book3s_init_hv(void)
>  	 * indirectly, via OPAL.
>  	 */
>  #ifdef CONFIG_SMP
> -	if (!xive_enabled() && !local_paca->kvm_hstate.xics_phys) {
> +	if (!xive_enabled() && !kvmhv_on_pseries() &&
> +	    !local_paca->kvm_hstate.xics_phys) {
>  		struct device_node *np;
>  
>  		np = of_find_compatible_node(NULL, NULL, "ibm,opal-intc");
> diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
> index ccfea5b..a71e2fc 100644
> --- a/arch/powerpc/kvm/book3s_hv_builtin.c
> +++ b/arch/powerpc/kvm/book3s_hv_builtin.c
> @@ -231,6 +231,15 @@ void kvmhv_rm_send_ipi(int cpu)
>  	void __iomem *xics_phys;
>  	unsigned long msg = PPC_DBELL_TYPE(PPC_DBELL_SERVER);
>  
> +	/* For a nested hypervisor, use the XICS via hcall */
> +	if (kvmhv_on_pseries()) {
> +		unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> +
> +		plpar_hcall_raw(H_IPI, retbuf, get_hard_smp_processor_id(cpu),
> +				IPI_PRIORITY);
> +		return;
> +	}
> +
>  	/* On POWER9 we can use msgsnd for any destination cpu. */
>  	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
>  		msg |= get_hard_smp_processor_id(cpu);
> @@ -460,12 +469,19 @@ static long kvmppc_read_one_intr(bool *again)
>  		return 1;
>  
>  	/* Now read the interrupt from the ICP */
> -	xics_phys = local_paca->kvm_hstate.xics_phys;
> -	rc = 0;
> -	if (!xics_phys)
> -		rc = opal_int_get_xirr(&xirr, false);
> -	else
> -		xirr = __raw_rm_readl(xics_phys + XICS_XIRR);
> +	if (kvmhv_on_pseries()) {
> +		unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> +
> +		rc = plpar_hcall_raw(H_XIRR, retbuf, 0xFF);
> +		xirr = cpu_to_be32(retbuf[0]);
> +	} else {
> +		xics_phys = local_paca->kvm_hstate.xics_phys;
> +		rc = 0;
> +		if (!xics_phys)
> +			rc = opal_int_get_xirr(&xirr, false);
> +		else
> +			xirr = __raw_rm_readl(xics_phys + XICS_XIRR);
> +	}
>  	if (rc < 0)
>  		return 1;
>  
> @@ -494,7 +510,13 @@ static long kvmppc_read_one_intr(bool *again)
>  	 */
>  	if (xisr == XICS_IPI) {
>  		rc = 0;
> -		if (xics_phys) {
> +		if (kvmhv_on_pseries()) {
> +			unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> +
> +			plpar_hcall_raw(H_IPI, retbuf,
> +					hard_smp_processor_id(), 0xff);
> +			plpar_hcall_raw(H_EOI, retbuf, h_xirr);
> +		} else if (xics_phys) {
>  			__raw_rm_writeb(0xff, xics_phys + XICS_MFRR);
>  			__raw_rm_writel(xirr, xics_phys + XICS_XIRR);
>  		} else {
> @@ -520,7 +542,13 @@ static long kvmppc_read_one_intr(bool *again)
>  			/* We raced with the host,
>  			 * we need to resend that IPI, bummer
>  			 */
> -			if (xics_phys)
> +			if (kvmhv_on_pseries()) {
> +				unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> +
> +				plpar_hcall_raw(H_IPI, retbuf,
> +						hard_smp_processor_id(),
> +						IPI_PRIORITY);
> +			} else if (xics_phys)
>  				__raw_rm_writeb(IPI_PRIORITY,
>  						xics_phys + XICS_MFRR);
>  			else
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_xics.c b/arch/powerpc/kvm/book3s_hv_rm_xics.c
> index 8b9f356..b3f5786 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_xics.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_xics.c
> @@ -767,6 +767,14 @@ static void icp_eoi(struct irq_chip *c, u32 hwirq, __be32 xirr, bool *again)
>  	void __iomem *xics_phys;
>  	int64_t rc;
>  
> +	if (kvmhv_on_pseries()) {
> +		unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
> +
> +		iosync();
> +		plpar_hcall_raw(H_EOI, retbuf, hwirq);
> +		return;
> +	}
> +
>  	rc = pnv_opal_pci_msi_eoi(c, hwirq);
>  
>  	if (rc)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 18/33] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
  2018-10-02  6:01   ` [PATCH v2 18/33] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization David Gibson
@ 2018-10-02  7:48     ` Paul Mackerras
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Mackerras @ 2018-10-02  7:48 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, kvm

On Tue, Oct 02, 2018 at 04:01:52PM +1000, David Gibson wrote:
> On Fri, Sep 28, 2018 at 07:45:48PM +1000, Paul Mackerras wrote:
> > This starts the process of adding the code to support nested HV-style
> > virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
> > a nested hypervisor can use to set the base address and size of a
> > partition table in its memory (analogous to the PTCR register).
> > On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
> > hypercall from the guest is handled by code that saves the virtual
> > PTCR value for the guest.
> > 
> > This also adds code for creating and destroying nested guests and for
> > reading the partition table entry for a nested guest from L1 memory.
> > Each nested guest has its own shadow LPID value, different in general
> > from the LPID value used by the nested hypervisor to refer to it.  The
> > shadow LPID value is allocated at nested guest creation time.
> > 
> > Nested hypervisor functionality is only available for a radix guest,
> > which therefore means a radix host on a POWER9 (or later) processor.
> > 
> > Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> I've made a number of comments below, but they're all pretty minor
> things.  They might be worth including if we have to respin for
> whatever reason, or as follow-up improvements, but I don't think we
> need to hold this up for them.

I have some other changes that will mean I'll be sending a v3.

> 
> [snip]
> > @@ -287,6 +288,7 @@ struct kvm_arch {
> >  	u8 radix;
> >  	u8 fwnmi_enabled;
> >  	bool threads_indep;
> > +	bool nested_enable;
> >  	pgd_t *pgtable;
> >  	u64 process_table;
> >  	struct dentry *debugfs_dir;
> > @@ -312,6 +314,9 @@ struct kvm_arch {
> >  #endif
> >  	struct kvmppc_ops *kvm_ops;
> >  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +	u64 l1_ptcr;
> > +	int max_nested_lpid;
> > +	struct kvm_nested_guest *nested_guests[KVM_MAX_NESTED_GUESTS];
> 
> This array could be quite large.  As a followup would it be worth
> dynamically allocating it, so it can be skipped for L1s with no
> nesting allowed, and/or dynamically resized as the L1 adds/removes L2s.

True.

> >  	/* This array can grow quite large, keep it at the end */
> >  	struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> >  #endif
> 
> [snip]
> > diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c
> > new file mode 100644
> > index 0000000..5341052
> > --- /dev/null
> > +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> > @@ -0,0 +1,283 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright IBM Corporation, 2018
> > + * Authors Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> > + *	   Paul Mackerras <paulus@ozlabs.org>
> > + *
> > + * Description: KVM functions specific to running nested KVM-HV guests
> > + * on Book3S processors (specifically POWER9 and later).
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/kvm_host.h>
> > +
> > +#include <asm/kvm_ppc.h>
> > +#include <asm/mmu.h>
> > +#include <asm/pgtable.h>
> > +#include <asm/pgalloc.h>
> > +
> > +static struct patb_entry *pseries_partition_tb;
> > +
> > +static void kvmhv_update_ptbl_cache(struct kvm_nested_guest *gp);
> > +
> > +/* Only called when we're not in hypervisor mode */
> 
> This comment isn't strictly accurate, the function is called, but
> exits trivially.

Right, I'll change the comment.

> > +bool kvmhv_nested_init(void)
> > +{
> > +	long int ptb_order;
> > +	unsigned long ptcr;
> > +	long rc;
> > +
> > +	if (!kvmhv_on_pseries())
> > +		return true;
> > +	if (!radix_enabled())
> > +		return false;
> > +
> > +	/* find log base 2 of KVMPPC_NR_LPIDS, rounding up */
> > +	ptb_order = __ilog2(KVMPPC_NR_LPIDS - 1) + 1;
> > +	if (ptb_order < 8)
> > +		ptb_order = 8;
> > +	pseries_partition_tb = kmalloc(sizeof(struct patb_entry) << ptb_order,
> > +				       GFP_KERNEL);
> > +	if (!pseries_partition_tb) {
> > +		pr_err("kvm-hv: failed to allocated nested partition table\n");
> > +		return false;
> 
> Since this can fail in several different ways, it seems like returning
> an errno, rather than a bool would make sense.

OK.

> > +	}
> > +
> > +	ptcr = __pa(pseries_partition_tb) | (ptb_order - 8);
> > +	rc = plpar_hcall_norets(H_SET_PARTITION_TABLE, ptcr);
> > +	if (rc != H_SUCCESS) {
> > +		pr_err("kvm-hv: Parent hypervisor does not support nesting (rc=%ld)\n",
> > +		       rc);
> > +		kfree(pseries_partition_tb);
> > +		pseries_partition_tb = NULL;
> > +		return false;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +void kvmhv_nested_exit(void)
> > +{
> > +	if (kvmhv_on_pseries() && pseries_partition_tb) {
> 
> First clause is redundant there, isn't it, since pseries_partition_tb
> can only be set if we're on pseries?

It is, but the subtlety here is that we're relying on the compiler
removing the call to plpar_hcall_norets() in configs with
CONFIG_PPC_PSERIES=n (because there is no definition of
plpar_hcall_norets in such configs).  The compiler can tell that
kvmhv_on_pseries() is always false for those configs, but it can't
tell that pseries_partition_tb is always NULL.

> > +		plpar_hcall_norets(H_SET_PARTITION_TABLE, 0);
> > +		kfree(pseries_partition_tb);
> > +		pseries_partition_tb = NULL;
> > +	}
> > +}
> > +
> > +void kvmhv_set_ptbl_entry(unsigned int lpid, u64 dw0, u64 dw1)
> > +{
> > +	if (cpu_has_feature(CPU_FTR_HVMODE)) {
> > +		mmu_partition_table_set_entry(lpid, dw0, dw1);
> > +	} else {
> > +		pseries_partition_tb[lpid].patb0 = cpu_to_be64(dw0);
> > +		pseries_partition_tb[lpid].patb1 = cpu_to_be64(dw1);
> > +		/* this will be emulated, L0 will do the necessary barriers */
> > +		asm volatile(PPC_TLBIE_5(%0, %1, 2, 0, 1) : :
> > +			     "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
> 
> I think in this version you were using a paravirt TLB flush, instead
> of emulation?

This gets converted to a paravirt flush in the last patch, yes.  I
suppose I could introduce that earlier.

> > +	}
> > +}
> > +
> > +static void kvmhv_set_nested_ptbl(struct kvm_nested_guest *gp)
> > +{
> > +	unsigned long dw0;
> > +
> > +	dw0 = PATB_HR | radix__get_tree_size() |
> > +		__pa(gp->shadow_pgtable) | RADIX_PGD_INDEX_SIZE;
> > +	kvmhv_set_ptbl_entry(gp->shadow_lpid, dw0, gp->process_table);
> > +}
> > +
> > +void kvmhv_vm_nested_init(struct kvm *kvm)
> > +{
> > +	kvm->arch.max_nested_lpid = -1;
> > +}
> > +
> > +/*
> > + * Handle the H_SET_PARTITION_TABLE hcall.
> > + * r4 = guest real address of partition table + log_2(size) - 12
> > + * (formatted as for the PTCR).
> > + */
> > +long kvmhv_set_partition_table(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm *kvm = vcpu->kvm;
> > +	unsigned long ptcr = kvmppc_get_gpr(vcpu, 4);
> > +
> > +	kvm->arch.l1_ptcr = ptcr;
> 
> I don't think it's actually dangerous, since we validate the L1
> addresses when we read from the table, but it would probably be better
> for debugging a guest if this failed the hcall if the PTCR didn't make
> sense (out of bounds order, or not within L1 memory size).

OK.

> > +	return H_SUCCESS;
> > +}
> 
> [snip]
> > +/*
> > + * Free up any resources allocated for a nested guest.
> > + */
> > +static void kvmhv_release_nested(struct kvm_nested_guest *gp)
> > +{
> > +	kvmhv_set_ptbl_entry(gp->shadow_lpid, 0, 0);
> > +	kvmppc_free_lpid(gp->shadow_lpid);
> > +	if (gp->shadow_pgtable)
> > +		pgd_free(gp->l1_host->mm, gp->shadow_pgtable);
> > +	kfree(gp);
> > +}
> > +
> > +static void kvmhv_remove_nested(struct kvm_nested_guest *gp)
> > +{
> > +	struct kvm *kvm = gp->l1_host;
> > +	int lpid = gp->l1_lpid;
> > +	long ref;
> > +
> > +	spin_lock(&kvm->mmu_lock);
> > +	if (gp == kvm->arch.nested_guests[lpid]) {
> 
> This is to protect against a race with another remove, yes?  Since kvm
> and lpid are read before you take the lock.  Is that right?

Basically, yes.  The lock is taken and dropped in kvmhv_get_nested()
and another CPU could have done kvmhv_remove_nested() in the
meanwhile.

> > +		kvm->arch.nested_guests[lpid] = NULL;
> > +		if (lpid == kvm->arch.max_nested_lpid) {
> > +			while (--lpid >= 0 && !kvm->arch.nested_guests[lpid])
> > +				;
> > +			kvm->arch.max_nested_lpid = lpid;
> > +		}
> > +		--gp->refcnt;
> > +	}
> > +	ref = gp->refcnt;
> > +	spin_unlock(&kvm->mmu_lock);
> > +	if (ref == 0)
> > +		kvmhv_release_nested(gp);
> > +}
> 
> [snip]
> > +struct kvm_nested_guest *kvmhv_get_nested(struct kvm *kvm, int l1_lpid,
> > +					  bool create)
> > +{
> > +	struct kvm_nested_guest *gp, *newgp;
> > +
> > +	if (l1_lpid >= KVM_MAX_NESTED_GUESTS ||
> > +	    l1_lpid >= (1ul << ((kvm->arch.l1_ptcr & PRTS_MASK) + 12 - 4)))
> > +		return NULL;
> > +
> > +	spin_lock(&kvm->mmu_lock);
> > +	gp = kvm->arch.nested_guests[l1_lpid];
> > +	if (gp)
> > +		++gp->refcnt;
> > +	spin_unlock(&kvm->mmu_lock);
> > +
> > +	if (gp || !create)
> > +		return gp;
> > +
> > +	newgp = kvmhv_alloc_nested(kvm, l1_lpid);
> > +	if (!newgp)
> > +		return NULL;
> > +	spin_lock(&kvm->mmu_lock);
> > +	if (kvm->arch.nested_guests[l1_lpid]) {
> > +		/* someone else beat us to it */
> 
> Should we print a message in this case.  It's no skin off the host's
> nose, but wouldn't this mean the guest is concurrently trying to start
> two guests with the same lpid, which seems like a dubious thing for it
> to be doing.

No, it could just be starting two vcpus from the same guest.  This
could happen when the L1 guest has just migrated in and now all of its
vcpus are getting started concurrently.

Paul.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 19/33] KVM: PPC: Book3S HV: Nested guest entry via hypercall
  2018-10-02  7:00   ` [PATCH v2 19/33] KVM: PPC: Book3S HV: Nested guest entry via hypercall David Gibson
@ 2018-10-02  8:00     ` Paul Mackerras
  2018-10-03  5:09       ` David Gibson
  0 siblings, 1 reply; 8+ messages in thread
From: Paul Mackerras @ 2018-10-02  8:00 UTC (permalink / raw)
  To: David Gibson; +Cc: linuxppc-dev, kvm-ppc, kvm

On Tue, Oct 02, 2018 at 05:00:09PM +1000, David Gibson wrote:
> On Fri, Sep 28, 2018 at 07:45:49PM +1000, Paul Mackerras wrote:
> > This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
> > hypervisor to enter one of its nested guests.  The hypercall supplies
> > register values in two structs.  Those values are copied by the level 0
> > (L0) hypervisor (the one which is running in hypervisor mode) into the
> > vcpu struct of the L1 guest, and then the guest is run until an
> > interrupt or error occurs which needs to be reported to L1 via the
> > hypercall return value.
> > 
> > Currently this assumes that the L0 and L1 hypervisors are the same
> > endianness, and the structs passed as arguments are in native
> > endianness.  If they are of different endianness, the version number
> > check will fail and the hcall will be rejected.
> > 
> > Nested hypervisors do not support indep_threads_mode=N, so this adds
> > code to print a warning message if the administrator has set
> > indep_threads_mode=N, and treat it as Y.
> > 
> > Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> 
> [snip]
> > +/* Register state for entering a nested guest with H_ENTER_NESTED */
> > +struct hv_guest_state {
> > +	u64 version;		/* version of this structure layout */
> > +	u32 lpid;
> > +	u32 vcpu_token;
> > +	/* These registers are hypervisor privileged (at least for writing) */
> > +	u64 lpcr;
> > +	u64 pcr;
> > +	u64 amor;
> > +	u64 dpdes;
> > +	u64 hfscr;
> > +	s64 tb_offset;
> > +	u64 dawr0;
> > +	u64 dawrx0;
> > +	u64 ciabr;
> > +	u64 hdec_expiry;
> > +	u64 purr;
> > +	u64 spurr;
> > +	u64 ic;
> > +	u64 vtb;
> > +	u64 hdar;
> > +	u64 hdsisr;
> > +	u64 heir;
> > +	u64 asdr;
> > +	/* These are OS privileged but need to be set late in guest entry */
> > +	u64 srr0;
> > +	u64 srr1;
> > +	u64 sprg[4];
> > +	u64 pidr;
> > +	u64 cfar;
> > +	u64 ppr;
> > +};
> 
> I'm guessing the implication here is that most supervisor privileged
> registers need to be set by the L1 to the L2 values, before making the
> H_ENTER_NESTED call.  Is that right?

Right - the supervisor privileged registers that are here are the ones
that the L1 guest needs to have valid at all times (e.g. sprgN), or
that can get clobbered at any time (e.g. srr0/1), or that can't be
set to guest values until just before guest entry (cfar, ppr), or that
are not writable by the supervisor (purr, spurr, dpdes, ic, vtb).

> [snip]
> > +static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
> > +{
> > +	int r;
> > +	int srcu_idx;
> > +
> > +	vcpu->stat.sum_exits++;
> > +
> > +	/*
> > +	 * This can happen if an interrupt occurs in the last stages
> > +	 * of guest entry or the first stages of guest exit (i.e. after
> > +	 * setting paca->kvm_hstate.in_guest to KVM_GUEST_MODE_GUEST_HV
> > +	 * and before setting it to KVM_GUEST_MODE_HOST_HV).
> > +	 * That can happen due to a bug, or due to a machine check
> > +	 * occurring at just the wrong time.
> > +	 */
> > +	if (vcpu->arch.shregs.msr & MSR_HV) {
> > +		pr_emerg("KVM trap in HV mode while nested!\n");
> > +		pr_emerg("trap=0x%x | pc=0x%lx | msr=0x%llx\n",
> > +			 vcpu->arch.trap, kvmppc_get_pc(vcpu),
> > +			 vcpu->arch.shregs.msr);
> > +		kvmppc_dump_regs(vcpu);
> > +		return RESUME_HOST;
> 
> To make sure I'm understanding right here, RESUME_HOST will
> effectively mean resume the L0, and RESUME_GUEST (without additional
> processing) will mean resume the L2, right?

RESUME_HOST means resume L1 in fact, and RESUME_GUEST means resume L2.
We never go straight out from L2 to L0 because that would leave L1 in
the middle of a hypercall and we would have to have some sort of extra
state to record that fact.  Instead, if we need to do anything like
that (e.g. because of a signal pending for the task), we get to the
point where the H_ENTER_NESTED is finished and the return code is
stored in L1's r3 before exiting to the L0 userspace.

> > @@ -3098,7 +3203,8 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore *vc)
> >  /*
> >   * Load up hypervisor-mode registers on P9.
> >   */
> > -static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit)
> > +static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu *vcpu, u64 time_limit,
> > +				     unsigned long lpcr)
> 
> I don't understand this change: you've added a parameter, but there's
> no change to the body of the function, so you're not actually using
> the new parameter.

Oops, it should be being used.  Thanks for pointing that out.

> [snip]
> > +/*
> > + * This never fails for a radix guest, as none of the operations it does
> > + * for a radix guest can fail or have a way to report failure.
> > + * kvmhv_run_single_vcpu() relies on this fact.
> > + */
> >  static int kvmhv_setup_mmu(struct kvm_vcpu *vcpu)
> >  {
> >  	int r = 0;
> > @@ -3684,12 +3814,15 @@ static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
> >  	return vcpu->arch.ret;
> >  }
> >  
> > -static int kvmppc_run_single_vcpu(struct kvm_run *kvm_run,
> > -				  struct kvm_vcpu *vcpu, u64 time_limit)
> > +int kvmhv_run_single_vcpu(struct kvm_run *kvm_run,
> > +			  struct kvm_vcpu *vcpu, u64 time_limit,
> > +			  unsigned long lpcr)
> 
> IIRC this is the streamlined entry path introduced earlier in the
> series, yes?  Making the name change where it was introduced would
> avoid the extra churn here.

True.

> It'd be nice to have something here to make it more obvious that this
> path is only for radix guests, but I'm not really sure how to
> accomplish that.

It will probably end up getting used for nested HPT guests, when we
implement support for them.

Paul.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 19/33] KVM: PPC: Book3S HV: Nested guest entry via hypercall
  2018-10-02  8:00     ` Paul Mackerras
@ 2018-10-03  5:09       ` David Gibson
  0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2018-10-03  5:09 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm-ppc, kvm

[-- Attachment #1: Type: text/plain, Size: 4741 bytes --]

On Tue, Oct 02, 2018 at 06:00:16PM +1000, Paul Mackerras wrote:
> On Tue, Oct 02, 2018 at 05:00:09PM +1000, David Gibson wrote:
> > On Fri, Sep 28, 2018 at 07:45:49PM +1000, Paul Mackerras wrote:
> > > This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
> > > hypervisor to enter one of its nested guests.  The hypercall supplies
> > > register values in two structs.  Those values are copied by the level 0
> > > (L0) hypervisor (the one which is running in hypervisor mode) into the
> > > vcpu struct of the L1 guest, and then the guest is run until an
> > > interrupt or error occurs which needs to be reported to L1 via the
> > > hypercall return value.
> > > 
> > > Currently this assumes that the L0 and L1 hypervisors are the same
> > > endianness, and the structs passed as arguments are in native
> > > endianness.  If they are of different endianness, the version number
> > > check will fail and the hcall will be rejected.
> > > 
> > > Nested hypervisors do not support indep_threads_mode=N, so this adds
> > > code to print a warning message if the administrator has set
> > > indep_threads_mode=N, and treat it as Y.
> > > 
> > > Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
> > 
> > [snip]
> > > +/* Register state for entering a nested guest with H_ENTER_NESTED */
> > > +struct hv_guest_state {
> > > +	u64 version;		/* version of this structure layout */
> > > +	u32 lpid;
> > > +	u32 vcpu_token;
> > > +	/* These registers are hypervisor privileged (at least for writing) */
> > > +	u64 lpcr;
> > > +	u64 pcr;
> > > +	u64 amor;
> > > +	u64 dpdes;
> > > +	u64 hfscr;
> > > +	s64 tb_offset;
> > > +	u64 dawr0;
> > > +	u64 dawrx0;
> > > +	u64 ciabr;
> > > +	u64 hdec_expiry;
> > > +	u64 purr;
> > > +	u64 spurr;
> > > +	u64 ic;
> > > +	u64 vtb;
> > > +	u64 hdar;
> > > +	u64 hdsisr;
> > > +	u64 heir;
> > > +	u64 asdr;
> > > +	/* These are OS privileged but need to be set late in guest entry */
> > > +	u64 srr0;
> > > +	u64 srr1;
> > > +	u64 sprg[4];
> > > +	u64 pidr;
> > > +	u64 cfar;
> > > +	u64 ppr;
> > > +};
> > 
> > I'm guessing the implication here is that most supervisor privileged
> > registers need to be set by the L1 to the L2 values, before making the
> > H_ENTER_NESTED call.  Is that right?
> 
> Right - the supervisor privileged registers that are here are the ones
> that the L1 guest needs to have valid at all times (e.g. sprgN), or
> that can get clobbered at any time (e.g. srr0/1), or that can't be
> set to guest values until just before guest entry (cfar, ppr), or that
> are not writable by the supervisor (purr, spurr, dpdes, ic, vtb).
> 
> > [snip]
> > > +static int kvmppc_handle_nested_exit(struct kvm_vcpu *vcpu)
> > > +{
> > > +	int r;
> > > +	int srcu_idx;
> > > +
> > > +	vcpu->stat.sum_exits++;
> > > +
> > > +	/*
> > > +	 * This can happen if an interrupt occurs in the last stages
> > > +	 * of guest entry or the first stages of guest exit (i.e. after
> > > +	 * setting paca->kvm_hstate.in_guest to KVM_GUEST_MODE_GUEST_HV
> > > +	 * and before setting it to KVM_GUEST_MODE_HOST_HV).
> > > +	 * That can happen due to a bug, or due to a machine check
> > > +	 * occurring at just the wrong time.
> > > +	 */
> > > +	if (vcpu->arch.shregs.msr & MSR_HV) {
> > > +		pr_emerg("KVM trap in HV mode while nested!\n");
> > > +		pr_emerg("trap=0x%x | pc=0x%lx | msr=0x%llx\n",
> > > +			 vcpu->arch.trap, kvmppc_get_pc(vcpu),
> > > +			 vcpu->arch.shregs.msr);
> > > +		kvmppc_dump_regs(vcpu);
> > > +		return RESUME_HOST;
> > 
> > To make sure I'm understanding right here, RESUME_HOST will
> > effectively mean resume the L0, and RESUME_GUEST (without additional
> > processing) will mean resume the L2, right?
> 
> RESUME_HOST means resume L1 in fact, and RESUME_GUEST means resume L2.

Hm, ok.  A comment saying that might make this easier to read.

> We never go straight out from L2 to L0 because that would leave L1 in
> the middle of a hypercall and we would have to have some sort of extra
> state to record that fact.  Instead, if we need to do anything like
> that (e.g. because of a signal pending for the task), we get to the
> point where the H_ENTER_NESTED is finished and the return code is
> stored in L1's r3 before exiting to the L0 userspace.

To clarify.. if an L2 is running, but the L1 vcpu it's running on is
descheduled by the L0, does that mean we have to force an L2 exit to
L1 before we can continued scheduling whatever else on the L0?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-10-03  5:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1538127963-15645-1-git-send-email-paulus@ozlabs.org>
     [not found] ` <1538127963-15645-3-git-send-email-paulus@ozlabs.org>
2018-10-02  4:49   ` [PATCH v2 02/33] KVM: PPC: Book3S HV: Remove left-over code in XICS-on-XIVE emulation David Gibson
     [not found] ` <1538127963-15645-7-git-send-email-paulus@ozlabs.org>
2018-10-02  5:15   ` [PATCH v2 06/33] KVM: PPC: Book3S: Rework TM save/restore code and make it C-callable David Gibson
     [not found] ` <1538127963-15645-19-git-send-email-paulus@ozlabs.org>
2018-10-02  6:01   ` [PATCH v2 18/33] KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization David Gibson
2018-10-02  7:48     ` Paul Mackerras
     [not found] ` <1538127963-15645-20-git-send-email-paulus@ozlabs.org>
2018-10-02  7:00   ` [PATCH v2 19/33] KVM: PPC: Book3S HV: Nested guest entry via hypercall David Gibson
2018-10-02  8:00     ` Paul Mackerras
2018-10-03  5:09       ` David Gibson
     [not found] ` <1538127963-15645-21-git-send-email-paulus@ozlabs.org>
2018-10-02  7:02   ` [PATCH v2 20/33] KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).