* [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM
@ 2012-11-22 9:24 Paul Mackerras
2012-11-22 9:25 ` [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking Paul Mackerras
` (5 more replies)
0 siblings, 6 replies; 33+ messages in thread
From: Paul Mackerras @ 2012-11-22 9:24 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
This series of patches fixes various bugs that we have found recently.
The bugs fixed in patches 1, 3 and 4 are also vulnerabilities where
the guest could cause the host to crash or could access host memory
inappropriately. The bug fixed in patch 2 could cause the host to
hang or crash after the guest reboots. The bug fixed in patch 5 is a
simple thinko in the recently-added HPT reading code.
These patches are against Alex Graf's kvm-ppc-next branch. They only
affect HV code.
The first two patches have been posted previously but got no comments.
Please apply - given the nature of these bugs I'd really like this
series to make it into the 3.8 merge window.
Paul.
arch/powerpc/include/asm/kvm_host.h | 5 +-
arch/powerpc/kernel/asm-offsets.c | 4 +-
arch/powerpc/kvm/Makefile | 1 +
arch/powerpc/kvm/book3s_64_mmu_hv.c | 29 ++++++--
arch/powerpc/kvm/book3s_hv.c | 9 ++-
arch/powerpc/kvm/book3s_hv_ras.c | 115 ++++++++++++++++++++++++++++++
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 59 ++++++++++++++--
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 117 +++++++++++++++++--------------
8 files changed, 271 insertions(+), 68 deletions(-)
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
@ 2012-11-22 9:25 ` Paul Mackerras
2012-11-23 14:13 ` Alexander Graf
2012-11-22 9:27 ` [PATCH 2/5] KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT Paul Mackerras
` (4 subsequent siblings)
5 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-22 9:25 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
Currently, if a machine check interrupt happens while we are in the
guest, we exit the guest and call the host's machine check handler,
which tends to cause the host to panic. Some machine checks can be
triggered by the guest; for example, if the guest creates two entries
in the SLB that map the same effective address, and then accesses that
effective address, the CPU will take a machine check interrupt.
To handle this better, we firstly don't call the host's machine check
handler for machine checks that happen inside the guest. Instead we
call a new function, kvmppc_realmode_machine_check(), while still in
real mode before exiting the guest. On POWER7, it handles the cases
that the guest can trigger, either by flushing and reloading the SLB,
or by flushing the TLB, and then it delivers the machine check interrupt
to the guest. (On PPC970 we currently just exit the guest and leave it
up to kvmppc_handle_exit() to handle the condition, which it doesn't,
so the situation is no better -- but no worse -- on PPC970 now.)
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/Makefile | 1 +
arch/powerpc/kvm/book3s_hv_ras.c | 115 +++++++++++++++++++++++++++++++
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 61 +++++++++-------
3 files changed, 153 insertions(+), 24 deletions(-)
create mode 100644 arch/powerpc/kvm/book3s_hv_ras.c
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index cd89658..1e473d4 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -73,6 +73,7 @@ kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
book3s_hv_rmhandlers.o \
book3s_hv_rm_mmu.o \
book3s_64_vio_hv.o \
+ book3s_hv_ras.o \
book3s_hv_builtin.o
kvm-book3s_64-module-objs := \
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
new file mode 100644
index 0000000..2646d07
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -0,0 +1,115 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * Copyright 2012 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
+ */
+
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+
+/* SRR1 bits for machine check on POWER7 */
+#define SRR1_MC_LDSTERR (1ul << (63-42))
+#define SRR1_MC_IFETCH_SH (63-45)
+#define SRR1_MC_IFETCH_MASK 0x7
+#define SRR1_MC_IFETCH_SLBPAR 2 /* SLB parity error */
+#define SRR1_MC_IFETCH_SLBMULTI 3 /* SLB multi-hit */
+#define SRR1_MC_IFETCH_SLBPARMULTI 4 /* SLB parity + multi-hit */
+#define SRR1_MC_IFETCH_TLBMULTI 5 /* I-TLB multi-hit */
+
+/* DSISR bits for machine check on POWER7 */
+#define DSISR_MC_DERAT_MULTI 0x800 /* D-ERAT multi-hit */
+#define DSISR_MC_TLB_MULTI 0x400 /* D-TLB multi-hit */
+#define DSISR_MC_SLB_PARITY 0x100 /* SLB parity error */
+#define DSISR_MC_SLB_MULTI 0x080 /* SLB multi-hit */
+#define DSISR_MC_SLB_PARMULTI 0x040 /* SLB parity + multi-hit */
+
+/* POWER7 SLB flush and reload */
+static void reload_slb(struct kvm_vcpu *vcpu)
+{
+ struct slb_shadow *slb;
+ unsigned long i, n;
+
+ /* First clear out SLB */
+ asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+
+ /* Do they have an SLB shadow buffer registered? */
+ slb = vcpu->arch.slb_shadow.pinned_addr;
+ if (!slb)
+ return;
+
+ /* Sanity check */
+ n = slb->persistent;
+ if (n > SLB_MIN_SIZE)
+ n = SLB_MIN_SIZE;
+ if ((void *) &slb->save_area[n] > vcpu->arch.slb_shadow.pinned_end)
+ return;
+
+ /* Load up the SLB from that */
+ for (i = 0; i < n; ++i) {
+ unsigned long rb = slb->save_area[i].esid;
+ unsigned long rs = slb->save_area[i].vsid;
+
+ rb = (rb & ~0xFFFul) | i; /* insert entry number */
+ asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
+ }
+}
+
+/* POWER7 TLB flush */
+static void flush_tlb_power7(struct kvm_vcpu *vcpu)
+{
+ unsigned long i, rb;
+
+ rb = 0x800; /* IS field = 0b10, flush congruence class */
+ for (i = 0; i < 128; ++i) {
+ asm volatile("tlbiel %0" : : "r" (rb));
+ rb += 0x1000;
+ }
+}
+
+/*
+ * On POWER7, see if we can handle a machine check that occurred inside
+ * the guest in real mode, without switching to the host partition.
+ *
+ * Returns: 0 => exit guest, 1 => deliver machine check to guest
+ */
+static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
+{
+ unsigned long srr1 = vcpu->arch.shregs.msr;
+
+ if (srr1 & SRR1_MC_LDSTERR) {
+ /* error on load/store */
+ unsigned long dsisr = vcpu->arch.shregs.dsisr;
+
+ if (dsisr & (DSISR_MC_SLB_PARMULTI | DSISR_MC_SLB_MULTI |
+ DSISR_MC_SLB_PARITY | DSISR_MC_DERAT_MULTI))
+ /* flush and reload SLB; flushes D-ERAT too */
+ reload_slb(vcpu);
+ if (dsisr & DSISR_MC_TLB_MULTI)
+ flush_tlb_power7(vcpu);
+ }
+
+ switch ((srr1 >> SRR1_MC_IFETCH_SH) & SRR1_MC_IFETCH_MASK) {
+ case SRR1_MC_IFETCH_SLBPAR:
+ case SRR1_MC_IFETCH_SLBMULTI:
+ case SRR1_MC_IFETCH_SLBPARMULTI:
+ reload_slb(vcpu);
+ break;
+ case SRR1_MC_IFETCH_TLBMULTI:
+ flush_tlb_power7(vcpu);
+ break;
+ }
+
+ return 1;
+}
+
+long kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu)
+{
+ if (cpu_has_feature(CPU_FTR_ARCH_206))
+ return kvmppc_realmode_mc_power7(vcpu);
+
+ return 0;
+}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 690d112..b31fb4f 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -682,8 +682,7 @@ BEGIN_FTR_SECTION
1:
END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
-nohpte_cont:
-hcall_real_cont: /* r9 = vcpu, r12 = trap, r13 = paca */
+guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = paca */
/* Save DEC */
mfspr r5,SPRN_DEC
mftb r6
@@ -704,6 +703,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
std r6, VCPU_FAULT_DAR(r9)
stw r7, VCPU_FAULT_DSISR(r9)
+ /* See if it is a machine check */
+ cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+ beq machine_check_realmode
+mc_cont:
+
/* Save guest CTRL register, set runlatch to 1 */
6: mfspr r6,SPRN_CTRLF
stw r6,VCPU_CTRL(r9)
@@ -1114,20 +1118,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
isync
23:
/*
- * For external and machine check interrupts, we need
+ * For external interrupts, we need
* to call the Linux handler to process the interrupt.
* We do that by jumping to the interrupt vector address
* which we have in r12. The [h]rfid at the end of the
* handler will return to the book3s_hv_interrupts.S code.
* For other interrupts we do the rfid to get back
- * to the book3s_interrupts.S code here.
+ * to the book3s_hv_interrupts.S code here.
*/
ld r8, HSTATE_VMHANDLER(r13)
ld r7, HSTATE_HOST_MSR(r13)
cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL
+BEGIN_FTR_SECTION
beq 11f
- cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
/* RFI into the highmem handler, or branch to interrupt handler */
12: mfmsr r6
@@ -1140,11 +1145,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
beqctr
RFI
-11:
-BEGIN_FTR_SECTION
- b 12b
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
- mtspr SPRN_HSRR0, r8
+ /* On POWER7, we have external interrupts set to use HSRR0/1 */
+11: mtspr SPRN_HSRR0, r8
mtspr SPRN_HSRR1, r7
ba 0x500
@@ -1180,7 +1182,7 @@ kvmppc_hdsi:
cmpdi r3, 0 /* retry the instruction */
beq 6f
cmpdi r3, -1 /* handle in kernel mode */
- beq nohpte_cont
+ beq guest_exit_cont
cmpdi r3, -2 /* MMIO emulation; need instr word */
beq 2f
@@ -1194,6 +1196,7 @@ kvmppc_hdsi:
li r10, BOOK3S_INTERRUPT_DATA_STORAGE
li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
rotldi r11, r11, 63
+fast_interrupt_c_return:
6: ld r7, VCPU_CTR(r9)
lwz r8, VCPU_XER(r9)
mtctr r7
@@ -1226,7 +1229,7 @@ kvmppc_hdsi:
/* Unset guest mode. */
li r0, KVM_GUEST_MODE_NONE
stb r0, HSTATE_IN_GUEST(r13)
- b nohpte_cont
+ b guest_exit_cont
/*
* Similarly for an HISI, reflect it to the guest as an ISI unless
@@ -1252,9 +1255,9 @@ kvmppc_hisi:
ld r11, VCPU_MSR(r9)
li r12, BOOK3S_INTERRUPT_H_INST_STORAGE
cmpdi r3, 0 /* retry the instruction */
- beq 6f
+ beq fast_interrupt_c_return
cmpdi r3, -1 /* handle in kernel mode */
- beq nohpte_cont
+ beq guest_exit_cont
/* Synthesize an ISI for the guest */
mr r11, r3
@@ -1263,12 +1266,7 @@ kvmppc_hisi:
li r10, BOOK3S_INTERRUPT_INST_STORAGE
li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
rotldi r11, r11, 63
-6: ld r7, VCPU_CTR(r9)
- lwz r8, VCPU_XER(r9)
- mtctr r7
- mtxer r8
- mr r4, r9
- b fast_guest_return
+ b fast_interrupt_c_return
3: ld r6, VCPU_KVM(r9) /* not relocated, use VRMA */
ld r5, KVM_VRMA_SLB_V(r6)
@@ -1284,14 +1282,14 @@ kvmppc_hisi:
hcall_try_real_mode:
ld r3,VCPU_GPR(R3)(r9)
andi. r0,r11,MSR_PR
- bne hcall_real_cont
+ bne guest_exit_cont
clrrdi r3,r3,2
cmpldi r3,hcall_real_table_end - hcall_real_table
- bge hcall_real_cont
+ bge guest_exit_cont
LOAD_REG_ADDR(r4, hcall_real_table)
lwzx r3,r3,r4
cmpwi r3,0
- beq hcall_real_cont
+ beq guest_exit_cont
add r3,r3,r4
mtctr r3
mr r3,r9 /* get vcpu pointer */
@@ -1312,7 +1310,7 @@ hcall_real_fallback:
li r12,BOOK3S_INTERRUPT_SYSCALL
ld r9, HSTATE_KVM_VCPU(r13)
- b hcall_real_cont
+ b guest_exit_cont
.globl hcall_real_table
hcall_real_table:
@@ -1571,6 +1569,21 @@ kvm_cede_exit:
li r3,H_TOO_HARD
blr
+ /* Try to handle a machine check in real mode */
+machine_check_realmode:
+ mr r3, r9 /* get vcpu pointer */
+ bl .kvmppc_realmode_machine_check
+ nop
+ cmpdi r3, 0 /* continue exiting from guest? */
+ ld r9, HSTATE_KVM_VCPU(r13)
+ li r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+ beq mc_cont
+ /* If not, deliver a machine check. SRR0/1 are already set */
+ li r10, BOOK3S_INTERRUPT_MACHINE_CHECK
+ li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
+ rotldi r11, r11, 63
+ b fast_interrupt_c_return
+
secondary_too_late:
ld r5,HSTATE_KVM_VCORE(r13)
HMT_LOW
--
1.7.10.rc3.219.g53414
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH 2/5] KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
2012-11-22 9:25 ` [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking Paul Mackerras
@ 2012-11-22 9:27 ` Paul Mackerras
2012-11-22 9:28 ` [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations Paul Mackerras
` (3 subsequent siblings)
5 siblings, 0 replies; 33+ messages in thread
From: Paul Mackerras @ 2012-11-22 9:27 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
With HV-style KVM, we maintain reverse-mapping lists that enable us to
find all the HPT (hashed page table) entries that reference each guest
physical page, with the heads of the lists in the memslot->arch.rmap
arrays. When we reset the HPT (i.e. when we reboot the VM), we clear
out all the HPT entries but we were not clearing out the reverse
mapping lists. The result is that as we create new HPT entries, the
lists get corrupted, which can easily lead to loops, resulting in the
host kernel hanging when it tries to traverse those lists.
This fixes the problem by zeroing out all the reverse mapping lists
when we zero out the HPT. This incidentally means that we are also
zeroing our record of the referenced and changed bits (not the bits
in the Linux PTEs, used by the Linux MM subsystem, but the bits used
by the KVM_GET_DIRTY_LOG ioctl, and those used by kvm_age_hva() and
kvm_test_age_hva()).
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_64_mmu_hv.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 0aa4073..1029e22 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -46,6 +46,7 @@
static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
long pte_index, unsigned long pteh,
unsigned long ptel, unsigned long *pte_idx_ret);
+static void kvmppc_rmap_reset(struct kvm *kvm);
long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
{
@@ -144,6 +145,10 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
/* Set the entire HPT to 0, i.e. invalid HPTEs */
memset((void *)kvm->arch.hpt_virt, 0, 1ul << order);
/*
+ * Reset all the reverse-mapping chains for all memslots
+ */
+ kvmppc_rmap_reset(kvm);
+ /*
* Set the whole last_vcpu array to an invalid vcpu number.
* This ensures that each vcpu will flush its TLB on next entry.
*/
@@ -772,6 +777,25 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
goto out_put;
}
+static void kvmppc_rmap_reset(struct kvm *kvm)
+{
+ struct kvm_memslots *slots;
+ struct kvm_memory_slot *memslot;
+ int srcu_idx;
+
+ srcu_idx = srcu_read_lock(&kvm->srcu);
+ slots = kvm->memslots;
+ kvm_for_each_memslot(memslot, slots) {
+ /*
+ * This assumes it is acceptable to lose reference and
+ * change bits across a reset.
+ */
+ memset(memslot->arch.rmap, 0,
+ memslot->npages * sizeof(*memslot->arch.rmap));
+ }
+ srcu_read_unlock(&kvm->srcu, srcu_idx);
+}
+
static int kvm_handle_hva_range(struct kvm *kvm,
unsigned long start,
unsigned long end,
--
1.7.10.rc3.219.g53414
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
2012-11-22 9:25 ` [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking Paul Mackerras
2012-11-22 9:27 ` [PATCH 2/5] KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT Paul Mackerras
@ 2012-11-22 9:28 ` Paul Mackerras
2012-11-23 15:43 ` Alexander Graf
2012-11-22 9:28 ` [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages Paul Mackerras
` (2 subsequent siblings)
5 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-22 9:28 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
When we change or remove a HPT (hashed page table) entry, we can do
either a global TLB invalidation (tlbie) that works across the whole
machine, or a local invalidation (tlbiel) that only affects this core.
Currently we do local invalidations if the VM has only one vcpu or if
the guest requests it with the H_LOCAL flag, though the guest Linux
kernel currently doesn't ever use H_LOCAL. Then, to cope with the
possibility that vcpus moving around to different physical cores might
expose stale TLB entries, there is some code in kvmppc_hv_entry to
flush the whole TLB of entries for this VM if either this vcpu is now
running on a different physical core from where it last ran, or if this
physical core last ran a different vcpu.
There are a number of problems on POWER7 with this as it stands:
- The TLB invalidation is done per thread, whereas it only needs to be
done per core, since the TLB is shared between the threads.
- With the possibility of the host paging out guest pages, the use of
H_LOCAL by an SMP guest is dangerous since the guest could possibly
retain and use a stale TLB entry pointing to a page that had been
removed from the guest.
- The TLB invalidations that we do when a vcpu moves from one physical
core to another are unnecessary in the case of an SMP guest that isn't
using H_LOCAL.
- The optimization of using local invalidations rather than global should
apply to guests with one virtual core, not just one vcpu.
(None of this applies on PPC970, since there we always have to
invalidate the whole TLB when entering and leaving the guest, and we
can't support paging out guest memory.)
To fix these problems and simplify the code, we now maintain a simple
cpumask of which cpus need to flush the TLB on entry to the guest.
(This is indexed by cpu, though we only ever use the bits for thread
0 of each core.) Whenever we do a local TLB invalidation, we set the
bits for every cpu except the bit for thread 0 of the core that we're
currently running on. Whenever we enter a guest, we test and clear the
bit for our core, and flush the TLB if it was set.
On initial startup of the VM, and when resetting the HPT, we set all the
bits in the need_tlb_flush cpumask, since any core could potentially have
stale TLB entries from the previous VM to use the same LPID, or the
previous contents of the HPT.
Then, we maintain a count of the number of online virtual cores, and use
that when deciding whether to use a local invalidation rather than the
number of online vcpus. The code to make that decision is extracted out
into a new function, global_invalidates(). For multi-core guests on
POWER7 (i.e. when we are using mmu notifiers), we now never do local
invalidations regardless of the H_LOCAL flag.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/include/asm/kvm_host.h | 5 +--
arch/powerpc/kernel/asm-offsets.c | 4 +--
arch/powerpc/kvm/book3s_64_mmu_hv.c | 7 ++--
arch/powerpc/kvm/book3s_hv.c | 9 ++++-
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 37 +++++++++++++++++---
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 56 ++++++++++++++-----------------
6 files changed, 73 insertions(+), 45 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 58c7264..62fbd38 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -246,11 +246,12 @@ struct kvm_arch {
int using_mmu_notifiers;
u32 hpt_order;
atomic_t vcpus_running;
+ u32 online_vcores;
unsigned long hpt_npte;
unsigned long hpt_mask;
atomic_t hpte_mod_interest;
spinlock_t slot_phys_lock;
- unsigned short last_vcpu[NR_CPUS];
+ cpumask_t need_tlb_flush;
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
struct kvmppc_linear_info *hpt_li;
#endif /* CONFIG_KVM_BOOK3S_64_HV */
@@ -275,6 +276,7 @@ struct kvmppc_vcore {
int nap_count;
int napping_threads;
u16 pcpu;
+ u16 last_cpu;
u8 vcore_state;
u8 in_guest;
struct list_head runnable_threads;
@@ -523,7 +525,6 @@ struct kvm_vcpu_arch {
u64 dec_jiffies;
u64 dec_expires;
unsigned long pending_exceptions;
- u16 last_cpu;
u8 ceded;
u8 prodded;
u32 last_inst;
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index 7523539..4e23ba2 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -441,8 +441,7 @@ int main(void)
DEFINE(KVM_HOST_LPCR, offsetof(struct kvm, arch.host_lpcr));
DEFINE(KVM_HOST_SDR1, offsetof(struct kvm, arch.host_sdr1));
DEFINE(KVM_TLBIE_LOCK, offsetof(struct kvm, arch.tlbie_lock));
- DEFINE(KVM_ONLINE_CPUS, offsetof(struct kvm, online_vcpus.counter));
- DEFINE(KVM_LAST_VCPU, offsetof(struct kvm, arch.last_vcpu));
+ DEFINE(KVM_NEED_FLUSH, offsetof(struct kvm, arch.need_tlb_flush.bits));
DEFINE(KVM_LPCR, offsetof(struct kvm, arch.lpcr));
DEFINE(KVM_RMOR, offsetof(struct kvm, arch.rmor));
DEFINE(KVM_VRMA_SLB_V, offsetof(struct kvm, arch.vrma_slb_v));
@@ -470,7 +469,6 @@ int main(void)
DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb));
DEFINE(VCPU_SLB_MAX, offsetof(struct kvm_vcpu, arch.slb_max));
DEFINE(VCPU_SLB_NR, offsetof(struct kvm_vcpu, arch.slb_nr));
- DEFINE(VCPU_LAST_CPU, offsetof(struct kvm_vcpu, arch.last_cpu));
DEFINE(VCPU_FAULT_DSISR, offsetof(struct kvm_vcpu, arch.fault_dsisr));
DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, arch.fault_dar));
DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 1029e22..2d61e01 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -148,11 +148,8 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
* Reset all the reverse-mapping chains for all memslots
*/
kvmppc_rmap_reset(kvm);
- /*
- * Set the whole last_vcpu array to an invalid vcpu number.
- * This ensures that each vcpu will flush its TLB on next entry.
- */
- memset(kvm->arch.last_vcpu, 0xff, sizeof(kvm->arch.last_vcpu));
+ /* Ensure that each vcpu will flush its TLB on next entry. */
+ cpumask_setall(&kvm->arch.need_tlb_flush);
*htab_orderp = order;
err = 0;
} else {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a4f59db..ddbec60 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -853,7 +853,6 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
goto free_vcpu;
vcpu->arch.shared = &vcpu->arch.shregs;
- vcpu->arch.last_cpu = -1;
vcpu->arch.mmcr[0] = MMCR0_FC;
vcpu->arch.ctrl = CTRL_RUNLATCH;
/* default to host PVR, since we can't spoof it */
@@ -880,6 +879,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
vcore->preempt_tb = TB_NIL;
}
kvm->arch.vcores[core] = vcore;
+ kvm->arch.online_vcores++;
}
mutex_unlock(&kvm->lock);
@@ -1802,6 +1802,13 @@ int kvmppc_core_init_vm(struct kvm *kvm)
return -ENOMEM;
kvm->arch.lpid = lpid;
+ /*
+ * Since we don't flush the TLB when tearing down a VM,
+ * and this lpid might have previously been used,
+ * make sure we flush on each core before running the new VM.
+ */
+ cpumask_setall(&kvm->arch.need_tlb_flush);
+
INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
kvm->arch.rma = NULL;
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index fc3da32..7e1f7e2 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -35,6 +35,37 @@ static void *real_vmalloc_addr(void *x)
return __va(addr);
}
+/* Return 1 if we need to do a global tlbie, 0 if we can use tlbiel */
+static int global_invalidates(struct kvm *kvm, unsigned long flags)
+{
+ int global;
+
+ /*
+ * If there is only one vcore, and it's currently running,
+ * we can use tlbiel as long as we mark all other physical
+ * cores as potentially having stale TLB entries for this lpid.
+ * If we're not using MMU notifiers, we never take pages away
+ * from the guest, so we can use tlbiel if requested.
+ * Otherwise, don't use tlbiel.
+ */
+ if (kvm->arch.online_vcores == 1 && local_paca->kvm_hstate.kvm_vcore)
+ global = 0;
+ else if (kvm->arch.using_mmu_notifiers)
+ global = 1;
+ else
+ global = !(flags & H_LOCAL);
+
+ if (!global) {
+ /* any other core might now have stale TLB entries... */
+ smp_wmb();
+ cpumask_setall(&kvm->arch.need_tlb_flush);
+ cpumask_clear_cpu(local_paca->kvm_hstate.kvm_vcore->pcpu,
+ &kvm->arch.need_tlb_flush);
+ }
+
+ return global;
+}
+
/*
* Add this HPTE into the chain for the real page.
* Must be called with the chain locked; it unlocks the chain.
@@ -390,7 +421,7 @@ long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
if (v & HPTE_V_VALID) {
hpte[0] &= ~HPTE_V_VALID;
rb = compute_tlbie_rb(v, hpte[1], pte_index);
- if (!(flags & H_LOCAL) && atomic_read(&kvm->online_vcpus) > 1) {
+ if (global_invalidates(kvm, flags)) {
while (!try_lock_tlbie(&kvm->arch.tlbie_lock))
cpu_relax();
asm volatile("ptesync" : : : "memory");
@@ -565,8 +596,6 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
return H_NOT_FOUND;
}
- if (atomic_read(&kvm->online_vcpus) == 1)
- flags |= H_LOCAL;
v = hpte[0];
bits = (flags << 55) & HPTE_R_PP0;
bits |= (flags << 48) & HPTE_R_KEY_HI;
@@ -587,7 +616,7 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
if (v & HPTE_V_VALID) {
rb = compute_tlbie_rb(v, r, pte_index);
hpte[0] = v & ~HPTE_V_VALID;
- if (!(flags & H_LOCAL)) {
+ if (global_invalidates(kvm, flags)) {
while(!try_lock_tlbie(&kvm->arch.tlbie_lock))
cpu_relax();
asm volatile("ptesync" : : : "memory");
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b31fb4f..dbb5ced 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -313,7 +313,33 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
mtspr SPRN_SDR1,r6 /* switch to partition page table */
mtspr SPRN_LPID,r7
isync
+
+ /* See if we need to flush the TLB */
+ lhz r6,PACAPACAINDEX(r13) /* test_bit(cpu, need_tlb_flush) */
+ clrldi r7,r6,64-6 /* extract bit number (6 bits) */
+ srdi r6,r6,6 /* doubleword number */
+ sldi r6,r6,3 /* address offset */
+ add r6,r6,r9
+ addi r6,r6,KVM_NEED_FLUSH /* dword in kvm->arch.need_tlb_flush */
li r0,1
+ sld r0,r0,r7
+ ld r7,0(r6)
+ and. r7,r7,r0
+ beq 22f
+23: ldarx r7,0,r6 /* if set, clear the bit */
+ andc r7,r7,r0
+ stdcx. r7,0,r6
+ bne 23b
+ li r6,128 /* and flush the TLB */
+ mtctr r6
+ li r7,0x800 /* IS field = 0b10 */
+ ptesync
+28: tlbiel r7
+ addi r7,r7,0x1000
+ bdnz 28b
+ ptesync
+
+22: li r0,1
stb r0,VCORE_IN_GUEST(r5) /* signal secondaries to continue */
b 10f
@@ -336,36 +362,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
mr r9,r4
blt hdec_soon
- /*
- * Invalidate the TLB if we could possibly have stale TLB
- * entries for this partition on this core due to the use
- * of tlbiel.
- * XXX maybe only need this on primary thread?
- */
- ld r9,VCPU_KVM(r4) /* pointer to struct kvm */
- lwz r5,VCPU_VCPUID(r4)
- lhz r6,PACAPACAINDEX(r13)
- rldimi r6,r5,0,62 /* XXX map as if threads 1:1 p:v */
- lhz r8,VCPU_LAST_CPU(r4)
- sldi r7,r6,1 /* see if this is the same vcpu */
- add r7,r7,r9 /* as last ran on this pcpu */
- lhz r0,KVM_LAST_VCPU(r7)
- cmpw r6,r8 /* on the same cpu core as last time? */
- bne 3f
- cmpw r0,r5 /* same vcpu as this core last ran? */
- beq 1f
-3: sth r6,VCPU_LAST_CPU(r4) /* if not, invalidate partition TLB */
- sth r5,KVM_LAST_VCPU(r7)
- li r6,128
- mtctr r6
- li r7,0x800 /* IS field = 0b10 */
- ptesync
-2: tlbiel r7
- addi r7,r7,0x1000
- bdnz 2b
- ptesync
-1:
-
/* Save purr/spurr */
mfspr r5,SPRN_PURR
mfspr r6,SPRN_SPURR
--
1.7.10.rc3.219.g53414
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
` (2 preceding siblings ...)
2012-11-22 9:28 ` [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations Paul Mackerras
@ 2012-11-22 9:28 ` Paul Mackerras
2012-11-23 15:47 ` Alexander Graf
2012-11-22 9:29 ` [PATCH 5/5] KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT Paul Mackerras
2012-11-23 15:48 ` [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Alexander Graf
5 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-22 9:28 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
Currently, if the guest does an H_PROTECT hcall requesting that the
permissions on a HPT entry be changed to allow writing, we make the
requested change even if the page is marked read-only in the host
Linux page tables. This is a problem since it would for instance
allow a guest to modify a page that KSM has decided can be shared
between multiple guests.
To fix this, if the new permissions for the page allow writing, we need
to look up the memslot for the page, work out the host virtual address,
and look up the Linux page tables to get the PTE for the page. If that
PTE is read-only, we reduce the HPTE permissions to read-only.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_hv_rm_mmu.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 7e1f7e2..19c93ba 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -629,6 +629,28 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
asm volatile("tlbiel %0" : : "r" (rb));
asm volatile("ptesync" : : : "memory");
}
+ /*
+ * If the host has this page as readonly but the guest
+ * wants to make it read/write, reduce the permissions.
+ * Checking the host permissions involves finding the
+ * memslot and then the Linux PTE for the page.
+ */
+ if (hpte_is_writable(r) && kvm->arch.using_mmu_notifiers) {
+ unsigned long psize, gfn, hva;
+ struct kvm_memory_slot *memslot;
+ pgd_t *pgdir = vcpu->arch.pgdir;
+ pte_t pte;
+
+ psize = hpte_page_size(v, r);
+ gfn = ((r & HPTE_R_RPN) & ~(psize - 1)) >> PAGE_SHIFT;
+ memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
+ if (memslot) {
+ hva = __gfn_to_hva_memslot(memslot, gfn);
+ pte = lookup_linux_pte(pgdir, hva, 1, &psize);
+ if (pte_present(pte) && !pte_write(pte))
+ r = hpte_make_readonly(r);
+ }
+ }
}
hpte[1] = r;
eieio();
--
1.7.10.rc3.219.g53414
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [PATCH 5/5] KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
` (3 preceding siblings ...)
2012-11-22 9:28 ` [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages Paul Mackerras
@ 2012-11-22 9:29 ` Paul Mackerras
2012-11-23 15:48 ` [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Alexander Graf
5 siblings, 0 replies; 33+ messages in thread
From: Paul Mackerras @ 2012-11-22 9:29 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
This fixes a bug in the code which allows userspace to read out the
contents of the guest's hashed page table (HPT). On the second and
subsequent passes through the HPT, when we are reporting only those
entries that have changed, we were incorrectly initializing the index
field of the header with the index of the first entry we skipped
rather than the first changed entry. This fixes it.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2d61e01..8cc18ab 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1279,7 +1279,6 @@ static ssize_t kvm_htab_read(struct file *file, char __user *buf,
while (nb + sizeof(hdr) + HPTE_SIZE < count) {
/* Initialize header */
hptr = (struct kvm_get_htab_header __user *)buf;
- hdr.index = i;
hdr.n_valid = 0;
hdr.n_invalid = 0;
nw = nb;
@@ -1295,6 +1294,7 @@ static ssize_t kvm_htab_read(struct file *file, char __user *buf,
++revp;
}
}
+ hdr.index = i;
/* Grab a series of valid entries */
while (i < kvm->arch.hpt_npte &&
--
1.7.10.rc3.219.g53414
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-22 9:25 ` [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking Paul Mackerras
@ 2012-11-23 14:13 ` Alexander Graf
2012-11-23 21:42 ` Paul Mackerras
2012-11-24 8:37 ` [PATCH v2] " Paul Mackerras
0 siblings, 2 replies; 33+ messages in thread
From: Alexander Graf @ 2012-11-23 14:13 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 22.11.2012, at 10:25, Paul Mackerras wrote:
> Currently, if a machine check interrupt happens while we are in the
> guest, we exit the guest and call the host's machine check handler,
> which tends to cause the host to panic. Some machine checks can be
> triggered by the guest; for example, if the guest creates two entries
> in the SLB that map the same effective address, and then accesses that
> effective address, the CPU will take a machine check interrupt.
>
> To handle this better, we firstly don't call the host's machine check
> handler for machine checks that happen inside the guest. Instead we
> call a new function, kvmppc_realmode_machine_check(), while still in
> real mode before exiting the guest. On POWER7, it handles the cases
> that the guest can trigger, either by flushing and reloading the SLB,
> or by flushing the TLB, and then it delivers the machine check interrupt
> to the guest. (On PPC970 we currently just exit the guest and leave it
> up to kvmppc_handle_exit() to handle the condition, which it doesn't,
> so the situation is no better -- but no worse -- on PPC970 now.)
>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/kvm/Makefile | 1 +
> arch/powerpc/kvm/book3s_hv_ras.c | 115 +++++++++++++++++++++++++++++++
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 61 +++++++++-------
> 3 files changed, 153 insertions(+), 24 deletions(-)
> create mode 100644 arch/powerpc/kvm/book3s_hv_ras.c
>
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index cd89658..1e473d4 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -73,6 +73,7 @@ kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
> book3s_hv_rmhandlers.o \
> book3s_hv_rm_mmu.o \
> book3s_64_vio_hv.o \
> + book3s_hv_ras.o \
> book3s_hv_builtin.o
>
> kvm-book3s_64-module-objs := \
> diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
> new file mode 100644
> index 0000000..2646d07
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_hv_ras.c
> @@ -0,0 +1,115 @@
> +/*
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License, version 2, as
> + * published by the Free Software Foundation.
> + *
> + * Copyright 2012 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/kvm.h>
> +#include <linux/kvm_host.h>
> +
> +/* SRR1 bits for machine check on POWER7 */
> +#define SRR1_MC_LDSTERR (1ul << (63-42))
> +#define SRR1_MC_IFETCH_SH (63-45)
> +#define SRR1_MC_IFETCH_MASK 0x7
> +#define SRR1_MC_IFETCH_SLBPAR 2 /* SLB parity error */
> +#define SRR1_MC_IFETCH_SLBMULTI 3 /* SLB multi-hit */
> +#define SRR1_MC_IFETCH_SLBPARMULTI 4 /* SLB parity + multi-hit */
> +#define SRR1_MC_IFETCH_TLBMULTI 5 /* I-TLB multi-hit */
> +
> +/* DSISR bits for machine check on POWER7 */
> +#define DSISR_MC_DERAT_MULTI 0x800 /* D-ERAT multi-hit */
> +#define DSISR_MC_TLB_MULTI 0x400 /* D-TLB multi-hit */
> +#define DSISR_MC_SLB_PARITY 0x100 /* SLB parity error */
> +#define DSISR_MC_SLB_MULTI 0x080 /* SLB multi-hit */
> +#define DSISR_MC_SLB_PARMULTI 0x040 /* SLB parity + multi-hit */
> +
> +/* POWER7 SLB flush and reload */
> +static void reload_slb(struct kvm_vcpu *vcpu)
> +{
> + struct slb_shadow *slb;
> + unsigned long i, n;
> +
> + /* First clear out SLB */
> + asm volatile("slbmte %0,%0; slbia" : : "r" (0));
> +
> + /* Do they have an SLB shadow buffer registered? */
> + slb = vcpu->arch.slb_shadow.pinned_addr;
> + if (!slb)
> + return;
Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
> +
> + /* Sanity check */
> + n = slb->persistent;
> + if (n > SLB_MIN_SIZE)
> + n = SLB_MIN_SIZE;
Please use a max() macro here.
> + if ((void *) &slb->save_area[n] > vcpu->arch.slb_shadow.pinned_end)
> + return;
> +
> + /* Load up the SLB from that */
> + for (i = 0; i < n; ++i) {
> + unsigned long rb = slb->save_area[i].esid;
> + unsigned long rs = slb->save_area[i].vsid;
> +
> + rb = (rb & ~0xFFFul) | i; /* insert entry number */
> + asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
> + }
> +}
> +
> +/* POWER7 TLB flush */
> +static void flush_tlb_power7(struct kvm_vcpu *vcpu)
> +{
> + unsigned long i, rb;
> +
> + rb = 0x800; /* IS field = 0b10, flush congruence class */
> + for (i = 0; i < 128; ++i) {
Please put up a #define for this. POWER7_TLB_SIZE or so. Is there any way to fetch that number from an SPR? I don't really want to have a p7+ and a p8 function in here too ;).
> + asm volatile("tlbiel %0" : : "r" (rb));
> + rb += 0x1000;
I assume this also means (1 << TLBIE_ENTRY_SHIFT)? Would be nice to keep the code readable without guessing :).
So I take it that p7 does not implement tlbia?
> + }
> +}
> +
> +/*
> + * On POWER7, see if we can handle a machine check that occurred inside
> + * the guest in real mode, without switching to the host partition.
> + *
> + * Returns: 0 => exit guest, 1 => deliver machine check to guest
> + */
> +static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
> +{
> + unsigned long srr1 = vcpu->arch.shregs.msr;
> +
> + if (srr1 & SRR1_MC_LDSTERR) {
> + /* error on load/store */
> + unsigned long dsisr = vcpu->arch.shregs.dsisr;
> +
> + if (dsisr & (DSISR_MC_SLB_PARMULTI | DSISR_MC_SLB_MULTI |
> + DSISR_MC_SLB_PARITY | DSISR_MC_DERAT_MULTI))
> + /* flush and reload SLB; flushes D-ERAT too */
> + reload_slb(vcpu);
> + if (dsisr & DSISR_MC_TLB_MULTI)
> + flush_tlb_power7(vcpu);
> + }
> +
> + switch ((srr1 >> SRR1_MC_IFETCH_SH) & SRR1_MC_IFETCH_MASK) {
> + case SRR1_MC_IFETCH_SLBPAR:
> + case SRR1_MC_IFETCH_SLBMULTI:
> + case SRR1_MC_IFETCH_SLBPARMULTI:
> + reload_slb(vcpu);
> + break;
> + case SRR1_MC_IFETCH_TLBMULTI:
> + flush_tlb_power7(vcpu);
> + break;
> + }
> +
> + return 1;
So we never return 0? How about ECC errors and the likes? Wouldn't those also be #MCs that the host needs to handle?
> +}
> +
> +long kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu)
> +{
> + if (cpu_has_feature(CPU_FTR_ARCH_206))
> + return kvmppc_realmode_mc_power7(vcpu);
> +
> + return 0;
> +}
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index 690d112..b31fb4f 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -682,8 +682,7 @@ BEGIN_FTR_SECTION
> 1:
> END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
>
> -nohpte_cont:
> -hcall_real_cont: /* r9 = vcpu, r12 = trap, r13 = paca */
> +guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = paca */
> /* Save DEC */
> mfspr r5,SPRN_DEC
> mftb r6
> @@ -704,6 +703,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
> std r6, VCPU_FAULT_DAR(r9)
> stw r7, VCPU_FAULT_DSISR(r9)
>
> + /* See if it is a machine check */
> + cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
> + beq machine_check_realmode
> +mc_cont:
> +
> /* Save guest CTRL register, set runlatch to 1 */
> 6: mfspr r6,SPRN_CTRLF
> stw r6,VCPU_CTRL(r9)
> @@ -1114,20 +1118,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
> isync
> 23:
> /*
> - * For external and machine check interrupts, we need
> + * For external interrupts, we need
> * to call the Linux handler to process the interrupt.
> * We do that by jumping to the interrupt vector address
> * which we have in r12. The [h]rfid at the end of the
> * handler will return to the book3s_hv_interrupts.S code.
> * For other interrupts we do the rfid to get back
> - * to the book3s_interrupts.S code here.
> + * to the book3s_hv_interrupts.S code here.
> */
> ld r8, HSTATE_VMHANDLER(r13)
> ld r7, HSTATE_HOST_MSR(r13)
>
> cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL
> +BEGIN_FTR_SECTION
> beq 11f
> - cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
Mind to explain the logic that happens here? Do we get external interrupts on 970? If not, the cmpwi should also be inside the FTR section. Also, if we do a beq here, why do the beqctr below again?
Alex
>
> /* RFI into the highmem handler, or branch to interrupt handler */
> 12: mfmsr r6
> @@ -1140,11 +1145,8 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
> beqctr
> RFI
>
> -11:
> -BEGIN_FTR_SECTION
> - b 12b
> -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
> - mtspr SPRN_HSRR0, r8
> + /* On POWER7, we have external interrupts set to use HSRR0/1 */
> +11: mtspr SPRN_HSRR0, r8
> mtspr SPRN_HSRR1, r7
> ba 0x500
>
> @@ -1180,7 +1182,7 @@ kvmppc_hdsi:
> cmpdi r3, 0 /* retry the instruction */
> beq 6f
> cmpdi r3, -1 /* handle in kernel mode */
> - beq nohpte_cont
> + beq guest_exit_cont
> cmpdi r3, -2 /* MMIO emulation; need instr word */
> beq 2f
>
> @@ -1194,6 +1196,7 @@ kvmppc_hdsi:
> li r10, BOOK3S_INTERRUPT_DATA_STORAGE
> li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
> rotldi r11, r11, 63
> +fast_interrupt_c_return:
> 6: ld r7, VCPU_CTR(r9)
> lwz r8, VCPU_XER(r9)
> mtctr r7
> @@ -1226,7 +1229,7 @@ kvmppc_hdsi:
> /* Unset guest mode. */
> li r0, KVM_GUEST_MODE_NONE
> stb r0, HSTATE_IN_GUEST(r13)
> - b nohpte_cont
> + b guest_exit_cont
>
> /*
> * Similarly for an HISI, reflect it to the guest as an ISI unless
> @@ -1252,9 +1255,9 @@ kvmppc_hisi:
> ld r11, VCPU_MSR(r9)
> li r12, BOOK3S_INTERRUPT_H_INST_STORAGE
> cmpdi r3, 0 /* retry the instruction */
> - beq 6f
> + beq fast_interrupt_c_return
> cmpdi r3, -1 /* handle in kernel mode */
> - beq nohpte_cont
> + beq guest_exit_cont
>
> /* Synthesize an ISI for the guest */
> mr r11, r3
> @@ -1263,12 +1266,7 @@ kvmppc_hisi:
> li r10, BOOK3S_INTERRUPT_INST_STORAGE
> li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
> rotldi r11, r11, 63
> -6: ld r7, VCPU_CTR(r9)
> - lwz r8, VCPU_XER(r9)
> - mtctr r7
> - mtxer r8
> - mr r4, r9
> - b fast_guest_return
> + b fast_interrupt_c_return
>
> 3: ld r6, VCPU_KVM(r9) /* not relocated, use VRMA */
> ld r5, KVM_VRMA_SLB_V(r6)
> @@ -1284,14 +1282,14 @@ kvmppc_hisi:
> hcall_try_real_mode:
> ld r3,VCPU_GPR(R3)(r9)
> andi. r0,r11,MSR_PR
> - bne hcall_real_cont
> + bne guest_exit_cont
> clrrdi r3,r3,2
> cmpldi r3,hcall_real_table_end - hcall_real_table
> - bge hcall_real_cont
> + bge guest_exit_cont
> LOAD_REG_ADDR(r4, hcall_real_table)
> lwzx r3,r3,r4
> cmpwi r3,0
> - beq hcall_real_cont
> + beq guest_exit_cont
> add r3,r3,r4
> mtctr r3
> mr r3,r9 /* get vcpu pointer */
> @@ -1312,7 +1310,7 @@ hcall_real_fallback:
> li r12,BOOK3S_INTERRUPT_SYSCALL
> ld r9, HSTATE_KVM_VCPU(r13)
>
> - b hcall_real_cont
> + b guest_exit_cont
>
> .globl hcall_real_table
> hcall_real_table:
> @@ -1571,6 +1569,21 @@ kvm_cede_exit:
> li r3,H_TOO_HARD
> blr
>
> + /* Try to handle a machine check in real mode */
> +machine_check_realmode:
> + mr r3, r9 /* get vcpu pointer */
> + bl .kvmppc_realmode_machine_check
> + nop
> + cmpdi r3, 0 /* continue exiting from guest? */
> + ld r9, HSTATE_KVM_VCPU(r13)
> + li r12, BOOK3S_INTERRUPT_MACHINE_CHECK
> + beq mc_cont
> + /* If not, deliver a machine check. SRR0/1 are already set */
> + li r10, BOOK3S_INTERRUPT_MACHINE_CHECK
> + li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
> + rotldi r11, r11, 63
> + b fast_interrupt_c_return
> +
> secondary_too_late:
> ld r5,HSTATE_KVM_VCORE(r13)
> HMT_LOW
> --
> 1.7.10.rc3.219.g53414
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-22 9:28 ` [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations Paul Mackerras
@ 2012-11-23 15:43 ` Alexander Graf
2012-11-23 22:07 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-23 15:43 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 22.11.2012, at 10:28, Paul Mackerras wrote:
> When we change or remove a HPT (hashed page table) entry, we can do
> either a global TLB invalidation (tlbie) that works across the whole
> machine, or a local invalidation (tlbiel) that only affects this core.
> Currently we do local invalidations if the VM has only one vcpu or if
> the guest requests it with the H_LOCAL flag, though the guest Linux
> kernel currently doesn't ever use H_LOCAL. Then, to cope with the
> possibility that vcpus moving around to different physical cores might
> expose stale TLB entries, there is some code in kvmppc_hv_entry to
> flush the whole TLB of entries for this VM if either this vcpu is now
> running on a different physical core from where it last ran, or if this
> physical core last ran a different vcpu.
>
> There are a number of problems on POWER7 with this as it stands:
>
> - The TLB invalidation is done per thread, whereas it only needs to be
> done per core, since the TLB is shared between the threads.
> - With the possibility of the host paging out guest pages, the use of
> H_LOCAL by an SMP guest is dangerous since the guest could possibly
> retain and use a stale TLB entry pointing to a page that had been
> removed from the guest.
I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
Alex
> - The TLB invalidations that we do when a vcpu moves from one physical
> core to another are unnecessary in the case of an SMP guest that isn't
> using H_LOCAL.
> - The optimization of using local invalidations rather than global should
> apply to guests with one virtual core, not just one vcpu.
>
> (None of this applies on PPC970, since there we always have to
> invalidate the whole TLB when entering and leaving the guest, and we
> can't support paging out guest memory.)
>
> To fix these problems and simplify the code, we now maintain a simple
> cpumask of which cpus need to flush the TLB on entry to the guest.
> (This is indexed by cpu, though we only ever use the bits for thread
> 0 of each core.) Whenever we do a local TLB invalidation, we set the
> bits for every cpu except the bit for thread 0 of the core that we're
> currently running on. Whenever we enter a guest, we test and clear the
> bit for our core, and flush the TLB if it was set.
>
> On initial startup of the VM, and when resetting the HPT, we set all the
> bits in the need_tlb_flush cpumask, since any core could potentially have
> stale TLB entries from the previous VM to use the same LPID, or the
> previous contents of the HPT.
>
> Then, we maintain a count of the number of online virtual cores, and use
> that when deciding whether to use a local invalidation rather than the
> number of online vcpus. The code to make that decision is extracted out
> into a new function, global_invalidates(). For multi-core guests on
> POWER7 (i.e. when we are using mmu notifiers), we now never do local
> invalidations regardless of the H_LOCAL flag.
>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/include/asm/kvm_host.h | 5 +--
> arch/powerpc/kernel/asm-offsets.c | 4 +--
> arch/powerpc/kvm/book3s_64_mmu_hv.c | 7 ++--
> arch/powerpc/kvm/book3s_hv.c | 9 ++++-
> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 37 +++++++++++++++++---
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 56 ++++++++++++++-----------------
> 6 files changed, 73 insertions(+), 45 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 58c7264..62fbd38 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -246,11 +246,12 @@ struct kvm_arch {
> int using_mmu_notifiers;
> u32 hpt_order;
> atomic_t vcpus_running;
> + u32 online_vcores;
> unsigned long hpt_npte;
> unsigned long hpt_mask;
> atomic_t hpte_mod_interest;
> spinlock_t slot_phys_lock;
> - unsigned short last_vcpu[NR_CPUS];
> + cpumask_t need_tlb_flush;
> struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
> struct kvmppc_linear_info *hpt_li;
> #endif /* CONFIG_KVM_BOOK3S_64_HV */
> @@ -275,6 +276,7 @@ struct kvmppc_vcore {
> int nap_count;
> int napping_threads;
> u16 pcpu;
> + u16 last_cpu;
> u8 vcore_state;
> u8 in_guest;
> struct list_head runnable_threads;
> @@ -523,7 +525,6 @@ struct kvm_vcpu_arch {
> u64 dec_jiffies;
> u64 dec_expires;
> unsigned long pending_exceptions;
> - u16 last_cpu;
> u8 ceded;
> u8 prodded;
> u32 last_inst;
> diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
> index 7523539..4e23ba2 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -441,8 +441,7 @@ int main(void)
> DEFINE(KVM_HOST_LPCR, offsetof(struct kvm, arch.host_lpcr));
> DEFINE(KVM_HOST_SDR1, offsetof(struct kvm, arch.host_sdr1));
> DEFINE(KVM_TLBIE_LOCK, offsetof(struct kvm, arch.tlbie_lock));
> - DEFINE(KVM_ONLINE_CPUS, offsetof(struct kvm, online_vcpus.counter));
> - DEFINE(KVM_LAST_VCPU, offsetof(struct kvm, arch.last_vcpu));
> + DEFINE(KVM_NEED_FLUSH, offsetof(struct kvm, arch.need_tlb_flush.bits));
> DEFINE(KVM_LPCR, offsetof(struct kvm, arch.lpcr));
> DEFINE(KVM_RMOR, offsetof(struct kvm, arch.rmor));
> DEFINE(KVM_VRMA_SLB_V, offsetof(struct kvm, arch.vrma_slb_v));
> @@ -470,7 +469,6 @@ int main(void)
> DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb));
> DEFINE(VCPU_SLB_MAX, offsetof(struct kvm_vcpu, arch.slb_max));
> DEFINE(VCPU_SLB_NR, offsetof(struct kvm_vcpu, arch.slb_nr));
> - DEFINE(VCPU_LAST_CPU, offsetof(struct kvm_vcpu, arch.last_cpu));
> DEFINE(VCPU_FAULT_DSISR, offsetof(struct kvm_vcpu, arch.fault_dsisr));
> DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, arch.fault_dar));
> DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 1029e22..2d61e01 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -148,11 +148,8 @@ long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
> * Reset all the reverse-mapping chains for all memslots
> */
> kvmppc_rmap_reset(kvm);
> - /*
> - * Set the whole last_vcpu array to an invalid vcpu number.
> - * This ensures that each vcpu will flush its TLB on next entry.
> - */
> - memset(kvm->arch.last_vcpu, 0xff, sizeof(kvm->arch.last_vcpu));
> + /* Ensure that each vcpu will flush its TLB on next entry. */
> + cpumask_setall(&kvm->arch.need_tlb_flush);
> *htab_orderp = order;
> err = 0;
> } else {
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index a4f59db..ddbec60 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -853,7 +853,6 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> goto free_vcpu;
>
> vcpu->arch.shared = &vcpu->arch.shregs;
> - vcpu->arch.last_cpu = -1;
> vcpu->arch.mmcr[0] = MMCR0_FC;
> vcpu->arch.ctrl = CTRL_RUNLATCH;
> /* default to host PVR, since we can't spoof it */
> @@ -880,6 +879,7 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
> vcore->preempt_tb = TB_NIL;
> }
> kvm->arch.vcores[core] = vcore;
> + kvm->arch.online_vcores++;
> }
> mutex_unlock(&kvm->lock);
>
> @@ -1802,6 +1802,13 @@ int kvmppc_core_init_vm(struct kvm *kvm)
> return -ENOMEM;
> kvm->arch.lpid = lpid;
>
> + /*
> + * Since we don't flush the TLB when tearing down a VM,
> + * and this lpid might have previously been used,
> + * make sure we flush on each core before running the new VM.
> + */
> + cpumask_setall(&kvm->arch.need_tlb_flush);
> +
> INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
>
> kvm->arch.rma = NULL;
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index fc3da32..7e1f7e2 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -35,6 +35,37 @@ static void *real_vmalloc_addr(void *x)
> return __va(addr);
> }
>
> +/* Return 1 if we need to do a global tlbie, 0 if we can use tlbiel */
> +static int global_invalidates(struct kvm *kvm, unsigned long flags)
> +{
> + int global;
> +
> + /*
> + * If there is only one vcore, and it's currently running,
> + * we can use tlbiel as long as we mark all other physical
> + * cores as potentially having stale TLB entries for this lpid.
> + * If we're not using MMU notifiers, we never take pages away
> + * from the guest, so we can use tlbiel if requested.
> + * Otherwise, don't use tlbiel.
> + */
> + if (kvm->arch.online_vcores == 1 && local_paca->kvm_hstate.kvm_vcore)
> + global = 0;
> + else if (kvm->arch.using_mmu_notifiers)
> + global = 1;
> + else
> + global = !(flags & H_LOCAL);
> +
> + if (!global) {
> + /* any other core might now have stale TLB entries... */
> + smp_wmb();
> + cpumask_setall(&kvm->arch.need_tlb_flush);
> + cpumask_clear_cpu(local_paca->kvm_hstate.kvm_vcore->pcpu,
> + &kvm->arch.need_tlb_flush);
> + }
> +
> + return global;
> +}
> +
> /*
> * Add this HPTE into the chain for the real page.
> * Must be called with the chain locked; it unlocks the chain.
> @@ -390,7 +421,7 @@ long kvmppc_do_h_remove(struct kvm *kvm, unsigned long flags,
> if (v & HPTE_V_VALID) {
> hpte[0] &= ~HPTE_V_VALID;
> rb = compute_tlbie_rb(v, hpte[1], pte_index);
> - if (!(flags & H_LOCAL) && atomic_read(&kvm->online_vcpus) > 1) {
> + if (global_invalidates(kvm, flags)) {
> while (!try_lock_tlbie(&kvm->arch.tlbie_lock))
> cpu_relax();
> asm volatile("ptesync" : : : "memory");
> @@ -565,8 +596,6 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
> return H_NOT_FOUND;
> }
>
> - if (atomic_read(&kvm->online_vcpus) == 1)
> - flags |= H_LOCAL;
> v = hpte[0];
> bits = (flags << 55) & HPTE_R_PP0;
> bits |= (flags << 48) & HPTE_R_KEY_HI;
> @@ -587,7 +616,7 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
> if (v & HPTE_V_VALID) {
> rb = compute_tlbie_rb(v, r, pte_index);
> hpte[0] = v & ~HPTE_V_VALID;
> - if (!(flags & H_LOCAL)) {
> + if (global_invalidates(kvm, flags)) {
> while(!try_lock_tlbie(&kvm->arch.tlbie_lock))
> cpu_relax();
> asm volatile("ptesync" : : : "memory");
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b31fb4f..dbb5ced 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -313,7 +313,33 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
> mtspr SPRN_SDR1,r6 /* switch to partition page table */
> mtspr SPRN_LPID,r7
> isync
> +
> + /* See if we need to flush the TLB */
> + lhz r6,PACAPACAINDEX(r13) /* test_bit(cpu, need_tlb_flush) */
> + clrldi r7,r6,64-6 /* extract bit number (6 bits) */
> + srdi r6,r6,6 /* doubleword number */
> + sldi r6,r6,3 /* address offset */
> + add r6,r6,r9
> + addi r6,r6,KVM_NEED_FLUSH /* dword in kvm->arch.need_tlb_flush */
> li r0,1
> + sld r0,r0,r7
> + ld r7,0(r6)
> + and. r7,r7,r0
> + beq 22f
> +23: ldarx r7,0,r6 /* if set, clear the bit */
> + andc r7,r7,r0
> + stdcx. r7,0,r6
> + bne 23b
> + li r6,128 /* and flush the TLB */
> + mtctr r6
> + li r7,0x800 /* IS field = 0b10 */
> + ptesync
> +28: tlbiel r7
> + addi r7,r7,0x1000
> + bdnz 28b
> + ptesync
> +
> +22: li r0,1
> stb r0,VCORE_IN_GUEST(r5) /* signal secondaries to continue */
> b 10f
>
> @@ -336,36 +362,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
> mr r9,r4
> blt hdec_soon
>
> - /*
> - * Invalidate the TLB if we could possibly have stale TLB
> - * entries for this partition on this core due to the use
> - * of tlbiel.
> - * XXX maybe only need this on primary thread?
> - */
> - ld r9,VCPU_KVM(r4) /* pointer to struct kvm */
> - lwz r5,VCPU_VCPUID(r4)
> - lhz r6,PACAPACAINDEX(r13)
> - rldimi r6,r5,0,62 /* XXX map as if threads 1:1 p:v */
> - lhz r8,VCPU_LAST_CPU(r4)
> - sldi r7,r6,1 /* see if this is the same vcpu */
> - add r7,r7,r9 /* as last ran on this pcpu */
> - lhz r0,KVM_LAST_VCPU(r7)
> - cmpw r6,r8 /* on the same cpu core as last time? */
> - bne 3f
> - cmpw r0,r5 /* same vcpu as this core last ran? */
> - beq 1f
> -3: sth r6,VCPU_LAST_CPU(r4) /* if not, invalidate partition TLB */
> - sth r5,KVM_LAST_VCPU(r7)
> - li r6,128
> - mtctr r6
> - li r7,0x800 /* IS field = 0b10 */
> - ptesync
> -2: tlbiel r7
> - addi r7,r7,0x1000
> - bdnz 2b
> - ptesync
> -1:
> -
> /* Save purr/spurr */
> mfspr r5,SPRN_PURR
> mfspr r6,SPRN_SPURR
> --
> 1.7.10.rc3.219.g53414
>
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
2012-11-22 9:28 ` [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages Paul Mackerras
@ 2012-11-23 15:47 ` Alexander Graf
2012-11-23 22:13 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-23 15:47 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 22.11.2012, at 10:28, Paul Mackerras wrote:
> Currently, if the guest does an H_PROTECT hcall requesting that the
> permissions on a HPT entry be changed to allow writing, we make the
> requested change even if the page is marked read-only in the host
> Linux page tables. This is a problem since it would for instance
> allow a guest to modify a page that KSM has decided can be shared
> between multiple guests.
>
> To fix this, if the new permissions for the page allow writing, we need
> to look up the memslot for the page, work out the host virtual address,
> and look up the Linux page tables to get the PTE for the page. If that
> PTE is read-only, we reduce the HPTE permissions to read-only.
How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one?
Alex
>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> index 7e1f7e2..19c93ba 100644
> --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> @@ -629,6 +629,28 @@ long kvmppc_h_protect(struct kvm_vcpu *vcpu, unsigned long flags,
> asm volatile("tlbiel %0" : : "r" (rb));
> asm volatile("ptesync" : : : "memory");
> }
> + /*
> + * If the host has this page as readonly but the guest
> + * wants to make it read/write, reduce the permissions.
> + * Checking the host permissions involves finding the
> + * memslot and then the Linux PTE for the page.
> + */
> + if (hpte_is_writable(r) && kvm->arch.using_mmu_notifiers) {
> + unsigned long psize, gfn, hva;
> + struct kvm_memory_slot *memslot;
> + pgd_t *pgdir = vcpu->arch.pgdir;
> + pte_t pte;
> +
> + psize = hpte_page_size(v, r);
> + gfn = ((r & HPTE_R_RPN) & ~(psize - 1)) >> PAGE_SHIFT;
> + memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
> + if (memslot) {
> + hva = __gfn_to_hva_memslot(memslot, gfn);
> + pte = lookup_linux_pte(pgdir, hva, 1, &psize);
> + if (pte_present(pte) && !pte_write(pte))
> + r = hpte_make_readonly(r);
> + }
> + }
> }
> hpte[1] = r;
> eieio();
> --
> 1.7.10.rc3.219.g53414
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
` (4 preceding siblings ...)
2012-11-22 9:29 ` [PATCH 5/5] KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT Paul Mackerras
@ 2012-11-23 15:48 ` Alexander Graf
5 siblings, 0 replies; 33+ messages in thread
From: Alexander Graf @ 2012-11-23 15:48 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 22.11.2012, at 10:24, Paul Mackerras wrote:
> This series of patches fixes various bugs that we have found recently.
> The bugs fixed in patches 1, 3 and 4 are also vulnerabilities where
> the guest could cause the host to crash or could access host memory
> inappropriately. The bug fixed in patch 2 could cause the host to
> hang or crash after the guest reboots. The bug fixed in patch 5 is a
> simple thinko in the recently-added HPT reading code.
>
> These patches are against Alex Graf's kvm-ppc-next branch. They only
> affect HV code.
>
> The first two patches have been posted previously but got no comments.
>
> Please apply - given the nature of these bugs I'd really like this
> series to make it into the 3.8 merge window.
Thanks, applied 2/5 and 5/5 to kvm-ppc-next :)
Alex
>
> Paul.
>
> arch/powerpc/include/asm/kvm_host.h | 5 +-
> arch/powerpc/kernel/asm-offsets.c | 4 +-
> arch/powerpc/kvm/Makefile | 1 +
> arch/powerpc/kvm/book3s_64_mmu_hv.c | 29 ++++++--
> arch/powerpc/kvm/book3s_hv.c | 9 ++-
> arch/powerpc/kvm/book3s_hv_ras.c | 115 ++++++++++++++++++++++++++++++
> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 59 ++++++++++++++--
> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 117 +++++++++++++++++--------------
> 8 files changed, 271 insertions(+), 68 deletions(-)
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-23 14:13 ` Alexander Graf
@ 2012-11-23 21:42 ` Paul Mackerras
2012-11-26 13:15 ` Alexander Graf
2012-11-24 8:37 ` [PATCH v2] " Paul Mackerras
1 sibling, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-23 21:42 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
>
> On 22.11.2012, at 10:25, Paul Mackerras wrote:
>
> > + /* Do they have an SLB shadow buffer registered? */
> > + slb = vcpu->arch.slb_shadow.pinned_addr;
> > + if (!slb)
> > + return;
>
> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
Yes, we leave the guest with an empty SLB, the access gets retried and
this time the guest gets an SLB miss interrupt, which it can hopefully
handle using an SLB miss handler that runs entirely in real mode.
This could happen for instance while the guest is in SLOF or yaboot or
some other code that runs basically in real mode but occasionally
turns the MMU on for some accesses, and happens to have a bug where it
creates a duplicate SLB entry.
> > + /* Sanity check */
> > + n = slb->persistent;
> > + if (n > SLB_MIN_SIZE)
> > + n = SLB_MIN_SIZE;
>
> Please use a max() macro here.
OK.
> > + rb = 0x800; /* IS field = 0b10, flush congruence class */
> > + for (i = 0; i < 128; ++i) {
>
> Please put up a #define for this. POWER7_TLB_SIZE or so. Is there any way to fetch that number from an SPR? I don't really want to have a p7+ and a p8 function in here too ;).
>
> > + asm volatile("tlbiel %0" : : "r" (rb));
> > + rb += 0x1000;
>
> I assume this also means (1 << TLBIE_ENTRY_SHIFT)? Would be nice to keep the code readable without guessing :).
The 0x800 and 0x1000 are taken from the architecture - it defines
fields in the RB value for the flush type and TLB index. The 128 is
POWER7-specific and isn't in any SPR that I know of. Eventually we'll
probably have to put it (the number of TLB congruence classes) in the
cputable, but for now I'll just do a define.
> So I take it that p7 does not implement tlbia?
Correct.
> So we never return 0? How about ECC errors and the likes? Wouldn't those also be #MCs that the host needs to handle?
Yes, true. In fact the OPAL firmware gets to see the machine checks
before we do (see the opal_register_exception_handler() calls in
arch/powerpc/platforms/powernv/opal.c), so it should have already
handled recoverable things like L1 cache parity errors.
I'll make the function return 0 if there's an error that it doesn't
know about.
> > ld r8, HSTATE_VMHANDLER(r13)
> > ld r7, HSTATE_HOST_MSR(r13)
> >
> > cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL
> > +BEGIN_FTR_SECTION
> > beq 11f
> > - cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
> > +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
>
> Mind to explain the logic that happens here? Do we get external interrupts on 970? If not, the cmpwi should also be inside the FTR section. Also, if we do a beq here, why do the beqctr below again?
I was making it not call the host kernel machine check handler if it
was a machine check that pulled us out of the guest. In fact we
probably do still want to call the handler, but we don't want to jump
to 0x200, since that has been patched by OPAL, and we don't want to
make OPAL think we just got another machine check. Instead we would
need to jump to machine_check_pSeries.
The feature section is because POWER7 sets HSRR0/1 on external
interrupts, whereas PPC970 sets SRR0/1.
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-23 15:43 ` Alexander Graf
@ 2012-11-23 22:07 ` Paul Mackerras
2012-11-26 13:10 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-23 22:07 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
>
> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>
> > - With the possibility of the host paging out guest pages, the use of
> > H_LOCAL by an SMP guest is dangerous since the guest could possibly
> > retain and use a stale TLB entry pointing to a page that had been
> > removed from the guest.
>
> I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
The H_LOCAL flag is something that we invented to allow the guest to
tell the host "I only ever used this translation (HPTE) on the current
vcpu" when it's removing or modifying an HPTE. The idea is that that
would then let the host use the tlbiel instruction (local TLB
invalidate) rather than the usual global tlbie instruction. Tlbiel is
faster because it doesn't need to go out on the fabric and get
processed by all cpus. In fact our guests don't use it at present,
but we put it in because we thought we should be able to get a
performance improvement, particularly on large machines.
However, the catch is that the guest's setting of H_LOCAL might be
incorrect, in which case we could have a stale TLB entry on another
physical cpu. While the physical page that it refers to is still
owned by the guest, that stale entry doesn't matter from the host's
point of view. But if the host wants to take that page away from the
guest, the stale entry becomes a problem.
The solution I implement here is just not to use tlbiel in SMP guests.
UP guests are not so much of a problem because the potential attack
from the guest relies on having one cpu remove the HPTE and do tlbiel
while another cpu uses the stale TLB entry, which you can't do if you
only have one cpu.
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
2012-11-23 15:47 ` Alexander Graf
@ 2012-11-23 22:13 ` Paul Mackerras
2012-11-24 9:05 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-23 22:13 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote:
>
> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>
> > Currently, if the guest does an H_PROTECT hcall requesting that the
> > permissions on a HPT entry be changed to allow writing, we make the
> > requested change even if the page is marked read-only in the host
> > Linux page tables. This is a problem since it would for instance
> > allow a guest to modify a page that KSM has decided can be shared
> > between multiple guests.
> >
> > To fix this, if the new permissions for the page allow writing, we need
> > to look up the memslot for the page, work out the host virtual address,
> > and look up the Linux page tables to get the PTE for the page. If that
> > PTE is read-only, we reduce the HPTE permissions to read-only.
>
> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one?
The scenario goes something like this:
1. Guest creates an HPTE with RO permissions.
2. KSM decides the page is identical to another page and changes the
HPTE to point to a shared copy. Permissions are still RO.
3. Guest decides it wants write access to the page and does an
H_PROTECT hcall to change the permissions on the HPTE to RW.
The bug is that we actually make the requested change in step 3.
Instead we should leave it at RO, then when the guest tries to write
to the page, we take a hypervisor page fault, copy the page and give
the guest write access to its own copy of the page.
So what this patch does is add code to H_PROTECT so that if the guest
is requesting RW access, we check the Linux PTE to see if the
underlying guest page is RO, and if so reduce the permissions in the
HPTE to RO.
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-23 14:13 ` Alexander Graf
2012-11-23 21:42 ` Paul Mackerras
@ 2012-11-24 8:37 ` Paul Mackerras
2012-11-26 23:16 ` Alexander Graf
2012-12-22 14:09 ` [PATCH] KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV Andreas Schwab
1 sibling, 2 replies; 33+ messages in thread
From: Paul Mackerras @ 2012-11-24 8:37 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
Currently, if a machine check interrupt happens while we are in the
guest, we exit the guest and call the host's machine check handler,
which tends to cause the host to panic. Some machine checks can be
triggered by the guest; for example, if the guest creates two entries
in the SLB that map the same effective address, and then accesses that
effective address, the CPU will take a machine check interrupt.
To handle this better, when a machine check happens inside the guest,
we call a new function, kvmppc_realmode_machine_check(), while still in
real mode before exiting the guest. On POWER7, it handles the cases
that the guest can trigger, either by flushing and reloading the SLB,
or by flushing the TLB, and then it delivers the machine check interrupt
directly to the guest without going back to the host. On POWER7, the
OPAL firmware patches the machine check interrupt vector so that it
gets control first, and it leaves behind its analysis of the situation
in a structure pointed to by the opal_mc_evt field of the paca. The
kvmppc_realmode_machine_check() function looks at this, and if OPAL
reports that there was no error, or that it has handled the error, we
also go straight back to the guest with a machine check. We have to
deliver a machine check to the guest since the machine check interrupt
might have trashed valid values in SRR0/1.
If the machine check is one we can't handle in real mode, and one that
OPAL hasn't already handled, or on PPC970, we exit the guest and call
the host's machine check handler. We do this by jumping to the
machine_check_fwnmi label, rather than absolute address 0x200, because
we don't want to re-execute OPAL's handler on POWER7. On PPC970, the
two are equivalent because address 0x200 just contains a branch.
Then, if the host machine check handler decides that the system can
continue executing, kvmppc_handle_exit() delivers a machine check
interrupt to the guest -- once again to let the guest know that SRR0/1
have been modified.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
Alex: let me know if you want me to redo 3/5 and 4/5 at all. They
should still apply cleanly on top of this.
arch/powerpc/include/asm/mmu-hash64.h | 10 +++
arch/powerpc/kvm/Makefile | 1 +
arch/powerpc/kvm/book3s_hv.c | 11 +++
arch/powerpc/kvm/book3s_hv_ras.c | 144 +++++++++++++++++++++++++++++++
arch/powerpc/kvm/book3s_hv_rmhandlers.S | 75 ++++++++++------
5 files changed, 213 insertions(+), 28 deletions(-)
create mode 100644 arch/powerpc/kvm/book3s_hv_ras.c
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 9673f73..2fdb47a 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -121,6 +121,16 @@ extern char initial_stab[];
#define PP_RXRX 3 /* Supervisor read, User read */
#define PP_RXXX (HPTE_R_PP0 | 2) /* Supervisor read, user none */
+/* Fields for tlbiel instruction in architecture 2.06 */
+#define TLBIEL_INVAL_SEL_MASK 0xc00 /* invalidation selector */
+#define TLBIEL_INVAL_PAGE 0x000 /* invalidate a single page */
+#define TLBIEL_INVAL_SET_LPID 0x800 /* invalidate a set for current LPID */
+#define TLBIEL_INVAL_SET 0xc00 /* invalidate a set for all LPIDs */
+#define TLBIEL_INVAL_SET_MASK 0xfff000 /* set number to inval. */
+#define TLBIEL_INVAL_SET_SHIFT 12
+
+#define POWER7_TLB_SETS 128 /* # sets in POWER7 TLB */
+
#ifndef __ASSEMBLY__
struct hash_pte {
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index cd89658..1e473d4 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -73,6 +73,7 @@ kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
book3s_hv_rmhandlers.o \
book3s_hv_rm_mmu.o \
book3s_64_vio_hv.o \
+ book3s_hv_ras.o \
book3s_hv_builtin.o
kvm-book3s_64-module-objs := \
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index a4f59db..483c503 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -545,6 +545,17 @@ static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
case BOOK3S_INTERRUPT_PERFMON:
r = RESUME_GUEST;
break;
+ case BOOK3S_INTERRUPT_MACHINE_CHECK:
+ /*
+ * Deliver a machine check interrupt to the guest.
+ * We have to do this, even if the host has handled the
+ * machine check, because machine checks use SRR0/1 and
+ * the interrupt might have trashed guest state in them.
+ */
+ kvmppc_book3s_queue_irqprio(vcpu,
+ BOOK3S_INTERRUPT_MACHINE_CHECK);
+ r = RESUME_GUEST;
+ break;
case BOOK3S_INTERRUPT_PROGRAM:
{
ulong flags;
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
new file mode 100644
index 0000000..e6645f6
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -0,0 +1,144 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * Copyright 2012 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
+ */
+
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/kvm.h>
+#include <linux/kvm_host.h>
+#include <linux/kernel.h>
+#include <asm/opal.h>
+
+/* SRR1 bits for machine check on POWER7 */
+#define SRR1_MC_LDSTERR (1ul << (63-42))
+#define SRR1_MC_IFETCH_SH (63-45)
+#define SRR1_MC_IFETCH_MASK 0x7
+#define SRR1_MC_IFETCH_SLBPAR 2 /* SLB parity error */
+#define SRR1_MC_IFETCH_SLBMULTI 3 /* SLB multi-hit */
+#define SRR1_MC_IFETCH_SLBPARMULTI 4 /* SLB parity + multi-hit */
+#define SRR1_MC_IFETCH_TLBMULTI 5 /* I-TLB multi-hit */
+
+/* DSISR bits for machine check on POWER7 */
+#define DSISR_MC_DERAT_MULTI 0x800 /* D-ERAT multi-hit */
+#define DSISR_MC_TLB_MULTI 0x400 /* D-TLB multi-hit */
+#define DSISR_MC_SLB_PARITY 0x100 /* SLB parity error */
+#define DSISR_MC_SLB_MULTI 0x080 /* SLB multi-hit */
+#define DSISR_MC_SLB_PARMULTI 0x040 /* SLB parity + multi-hit */
+
+/* POWER7 SLB flush and reload */
+static void reload_slb(struct kvm_vcpu *vcpu)
+{
+ struct slb_shadow *slb;
+ unsigned long i, n;
+
+ /* First clear out SLB */
+ asm volatile("slbmte %0,%0; slbia" : : "r" (0));
+
+ /* Do they have an SLB shadow buffer registered? */
+ slb = vcpu->arch.slb_shadow.pinned_addr;
+ if (!slb)
+ return;
+
+ /* Sanity check */
+ n = min(slb->persistent, (u32) SLB_MIN_SIZE);
+ if ((void *) &slb->save_area[n] > vcpu->arch.slb_shadow.pinned_end)
+ return;
+
+ /* Load up the SLB from that */
+ for (i = 0; i < n; ++i) {
+ unsigned long rb = slb->save_area[i].esid;
+ unsigned long rs = slb->save_area[i].vsid;
+
+ rb = (rb & ~0xFFFul) | i; /* insert entry number */
+ asm volatile("slbmte %0,%1" : : "r" (rs), "r" (rb));
+ }
+}
+
+/* POWER7 TLB flush */
+static void flush_tlb_power7(struct kvm_vcpu *vcpu)
+{
+ unsigned long i, rb;
+
+ rb = TLBIEL_INVAL_SET_LPID;
+ for (i = 0; i < POWER7_TLB_SETS; ++i) {
+ asm volatile("tlbiel %0" : : "r" (rb));
+ rb += 1 << TLBIEL_INVAL_SET_SHIFT;
+ }
+}
+
+/*
+ * On POWER7, see if we can handle a machine check that occurred inside
+ * the guest in real mode, without switching to the host partition.
+ *
+ * Returns: 0 => exit guest, 1 => deliver machine check to guest
+ */
+static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
+{
+ unsigned long srr1 = vcpu->arch.shregs.msr;
+ struct opal_machine_check_event *opal_evt;
+ long handled = 1;
+
+ if (srr1 & SRR1_MC_LDSTERR) {
+ /* error on load/store */
+ unsigned long dsisr = vcpu->arch.shregs.dsisr;
+
+ if (dsisr & (DSISR_MC_SLB_PARMULTI | DSISR_MC_SLB_MULTI |
+ DSISR_MC_SLB_PARITY | DSISR_MC_DERAT_MULTI)) {
+ /* flush and reload SLB; flushes D-ERAT too */
+ reload_slb(vcpu);
+ dsisr &= ~(DSISR_MC_SLB_PARMULTI | DSISR_MC_SLB_MULTI |
+ DSISR_MC_SLB_PARITY | DSISR_MC_DERAT_MULTI);
+ }
+ if (dsisr & DSISR_MC_TLB_MULTI) {
+ flush_tlb_power7(vcpu);
+ dsisr &= ~DSISR_MC_TLB_MULTI;
+ }
+ /* Any other errors we don't understand? */
+ if (dsisr & 0xffffffffUL)
+ handled = 0;
+ }
+
+ switch ((srr1 >> SRR1_MC_IFETCH_SH) & SRR1_MC_IFETCH_MASK) {
+ case 0:
+ break;
+ case SRR1_MC_IFETCH_SLBPAR:
+ case SRR1_MC_IFETCH_SLBMULTI:
+ case SRR1_MC_IFETCH_SLBPARMULTI:
+ reload_slb(vcpu);
+ break;
+ case SRR1_MC_IFETCH_TLBMULTI:
+ flush_tlb_power7(vcpu);
+ break;
+ default:
+ handled = 0;
+ }
+
+ /*
+ * See if OPAL has already handled the condition.
+ * We assume that if the condition is recovered then OPAL
+ * will have generated an error log event that we will pick
+ * up and log later.
+ */
+ opal_evt = local_paca->opal_mc_evt;
+ if (opal_evt->version == OpalMCE_V1 &&
+ (opal_evt->severity == OpalMCE_SEV_NO_ERROR ||
+ opal_evt->disposition == OpalMCE_DISPOSITION_RECOVERED))
+ handled = 1;
+
+ if (handled)
+ opal_evt->in_use = 0;
+
+ return handled;
+}
+
+long kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu)
+{
+ if (cpu_has_feature(CPU_FTR_ARCH_206))
+ return kvmppc_realmode_mc_power7(vcpu);
+
+ return 0;
+}
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 690d112..d4d6d8a 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -27,6 +27,7 @@
#include <asm/asm-offsets.h>
#include <asm/exception-64s.h>
#include <asm/kvm_book3s_asm.h>
+#include <asm/mmu-hash64.h>
/*****************************************************************************
* *
@@ -682,8 +683,7 @@ BEGIN_FTR_SECTION
1:
END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
-nohpte_cont:
-hcall_real_cont: /* r9 = vcpu, r12 = trap, r13 = paca */
+guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = paca */
/* Save DEC */
mfspr r5,SPRN_DEC
mftb r6
@@ -704,6 +704,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
std r6, VCPU_FAULT_DAR(r9)
stw r7, VCPU_FAULT_DSISR(r9)
+ /* See if it is a machine check */
+ cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+ beq machine_check_realmode
+mc_cont:
+
/* Save guest CTRL register, set runlatch to 1 */
6: mfspr r6,SPRN_CTRLF
stw r6,VCPU_CTRL(r9)
@@ -1116,38 +1121,41 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
/*
* For external and machine check interrupts, we need
* to call the Linux handler to process the interrupt.
- * We do that by jumping to the interrupt vector address
- * which we have in r12. The [h]rfid at the end of the
+ * We do that by jumping to absolute address 0x500 for
+ * external interrupts, or the machine_check_fwnmi label
+ * for machine checks (since firmware might have patched
+ * the vector area at 0x200). The [h]rfid at the end of the
* handler will return to the book3s_hv_interrupts.S code.
* For other interrupts we do the rfid to get back
- * to the book3s_interrupts.S code here.
+ * to the book3s_hv_interrupts.S code here.
*/
ld r8, HSTATE_VMHANDLER(r13)
ld r7, HSTATE_HOST_MSR(r13)
+ cmpwi cr1, r12, BOOK3S_INTERRUPT_MACHINE_CHECK
cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL
+BEGIN_FTR_SECTION
beq 11f
- cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
/* RFI into the highmem handler, or branch to interrupt handler */
-12: mfmsr r6
- mtctr r12
+ mfmsr r6
li r0, MSR_RI
andc r6, r6, r0
mtmsrd r6, 1 /* Clear RI in MSR */
mtsrr0 r8
mtsrr1 r7
- beqctr
+ beqa 0x500 /* external interrupt (PPC970) */
+ beq cr1, 13f /* machine check */
RFI
-11:
-BEGIN_FTR_SECTION
- b 12b
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_201)
- mtspr SPRN_HSRR0, r8
+ /* On POWER7, we have external interrupts set to use HSRR0/1 */
+11: mtspr SPRN_HSRR0, r8
mtspr SPRN_HSRR1, r7
ba 0x500
+13: b machine_check_fwnmi
+
/*
* Check whether an HDSI is an HPTE not found fault or something else.
* If it is an HPTE not found fault that is due to the guest accessing
@@ -1180,7 +1188,7 @@ kvmppc_hdsi:
cmpdi r3, 0 /* retry the instruction */
beq 6f
cmpdi r3, -1 /* handle in kernel mode */
- beq nohpte_cont
+ beq guest_exit_cont
cmpdi r3, -2 /* MMIO emulation; need instr word */
beq 2f
@@ -1194,6 +1202,7 @@ kvmppc_hdsi:
li r10, BOOK3S_INTERRUPT_DATA_STORAGE
li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
rotldi r11, r11, 63
+fast_interrupt_c_return:
6: ld r7, VCPU_CTR(r9)
lwz r8, VCPU_XER(r9)
mtctr r7
@@ -1226,7 +1235,7 @@ kvmppc_hdsi:
/* Unset guest mode. */
li r0, KVM_GUEST_MODE_NONE
stb r0, HSTATE_IN_GUEST(r13)
- b nohpte_cont
+ b guest_exit_cont
/*
* Similarly for an HISI, reflect it to the guest as an ISI unless
@@ -1252,9 +1261,9 @@ kvmppc_hisi:
ld r11, VCPU_MSR(r9)
li r12, BOOK3S_INTERRUPT_H_INST_STORAGE
cmpdi r3, 0 /* retry the instruction */
- beq 6f
+ beq fast_interrupt_c_return
cmpdi r3, -1 /* handle in kernel mode */
- beq nohpte_cont
+ beq guest_exit_cont
/* Synthesize an ISI for the guest */
mr r11, r3
@@ -1263,12 +1272,7 @@ kvmppc_hisi:
li r10, BOOK3S_INTERRUPT_INST_STORAGE
li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
rotldi r11, r11, 63
-6: ld r7, VCPU_CTR(r9)
- lwz r8, VCPU_XER(r9)
- mtctr r7
- mtxer r8
- mr r4, r9
- b fast_guest_return
+ b fast_interrupt_c_return
3: ld r6, VCPU_KVM(r9) /* not relocated, use VRMA */
ld r5, KVM_VRMA_SLB_V(r6)
@@ -1284,14 +1288,14 @@ kvmppc_hisi:
hcall_try_real_mode:
ld r3,VCPU_GPR(R3)(r9)
andi. r0,r11,MSR_PR
- bne hcall_real_cont
+ bne guest_exit_cont
clrrdi r3,r3,2
cmpldi r3,hcall_real_table_end - hcall_real_table
- bge hcall_real_cont
+ bge guest_exit_cont
LOAD_REG_ADDR(r4, hcall_real_table)
lwzx r3,r3,r4
cmpwi r3,0
- beq hcall_real_cont
+ beq guest_exit_cont
add r3,r3,r4
mtctr r3
mr r3,r9 /* get vcpu pointer */
@@ -1312,7 +1316,7 @@ hcall_real_fallback:
li r12,BOOK3S_INTERRUPT_SYSCALL
ld r9, HSTATE_KVM_VCPU(r13)
- b hcall_real_cont
+ b guest_exit_cont
.globl hcall_real_table
hcall_real_table:
@@ -1571,6 +1575,21 @@ kvm_cede_exit:
li r3,H_TOO_HARD
blr
+ /* Try to handle a machine check in real mode */
+machine_check_realmode:
+ mr r3, r9 /* get vcpu pointer */
+ bl .kvmppc_realmode_machine_check
+ nop
+ cmpdi r3, 0 /* continue exiting from guest? */
+ ld r9, HSTATE_KVM_VCPU(r13)
+ li r12, BOOK3S_INTERRUPT_MACHINE_CHECK
+ beq mc_cont
+ /* If not, deliver a machine check. SRR0/1 are already set */
+ li r10, BOOK3S_INTERRUPT_MACHINE_CHECK
+ li r11, (MSR_ME << 1) | 1 /* synthesize MSR_SF | MSR_ME */
+ rotldi r11, r11, 63
+ b fast_interrupt_c_return
+
secondary_too_late:
ld r5,HSTATE_KVM_VCORE(r13)
HMT_LOW
--
1.7.10.rc3.219.g53414
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
2012-11-23 22:13 ` Paul Mackerras
@ 2012-11-24 9:05 ` Alexander Graf
2012-11-24 9:32 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-24 9:05 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
On 23.11.2012, at 23:13, Paul Mackerras <paulus@samba.org> wrote:
> On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote:
>>
>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>>
>>> Currently, if the guest does an H_PROTECT hcall requesting that the
>>> permissions on a HPT entry be changed to allow writing, we make the
>>> requested change even if the page is marked read-only in the host
>>> Linux page tables. This is a problem since it would for instance
>>> allow a guest to modify a page that KSM has decided can be shared
>>> between multiple guests.
>>>
>>> To fix this, if the new permissions for the page allow writing, we need
>>> to look up the memslot for the page, work out the host virtual address,
>>> and look up the Linux page tables to get the PTE for the page. If that
>>> PTE is read-only, we reduce the HPTE permissions to read-only.
>>
>> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one?
>
> The scenario goes something like this:
>
> 1. Guest creates an HPTE with RO permissions.
> 2. KSM decides the page is identical to another page and changes the
> HPTE to point to a shared copy. Permissions are still RO.
> 3. Guest decides it wants write access to the page and does an
> H_PROTECT hcall to change the permissions on the HPTE to RW.
>
> The bug is that we actually make the requested change in step 3.
> Instead we should leave it at RO, then when the guest tries to write
> to the page, we take a hypervisor page fault, copy the page and give
> the guest write access to its own copy of the page.
>
> So what this patch does is add code to H_PROTECT so that if the guest
> is requesting RW access, we check the Linux PTE to see if the
> underlying guest page is RO, and if so reduce the permissions in the
> HPTE to RO.
But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right?
So the flow with this patch is:
- guest page permission fault
- guest does H_PROTECT to mark page r/w
- H_PROTECT doesn't do anything
- guest returns from permission handler, triggers write fault
2 questions here:
How does the host know that the page is actually r/w?
How does this work on 970? I thought page faults always go straight to the guest there.
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
2012-11-24 9:05 ` Alexander Graf
@ 2012-11-24 9:32 ` Paul Mackerras
2012-11-26 13:09 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-24 9:32 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote:
>
>
> On 23.11.2012, at 23:13, Paul Mackerras <paulus@samba.org> wrote:
>
> > On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote:
> >>
> >> On 22.11.2012, at 10:28, Paul Mackerras wrote:
> >>
> >>> Currently, if the guest does an H_PROTECT hcall requesting that the
> >>> permissions on a HPT entry be changed to allow writing, we make the
> >>> requested change even if the page is marked read-only in the host
> >>> Linux page tables. This is a problem since it would for instance
> >>> allow a guest to modify a page that KSM has decided can be shared
> >>> between multiple guests.
> >>>
> >>> To fix this, if the new permissions for the page allow writing, we need
> >>> to look up the memslot for the page, work out the host virtual address,
> >>> and look up the Linux page tables to get the PTE for the page. If that
> >>> PTE is read-only, we reduce the HPTE permissions to read-only.
> >>
> >> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one?
> >
> > The scenario goes something like this:
> >
> > 1. Guest creates an HPTE with RO permissions.
> > 2. KSM decides the page is identical to another page and changes the
> > HPTE to point to a shared copy. Permissions are still RO.
> > 3. Guest decides it wants write access to the page and does an
> > H_PROTECT hcall to change the permissions on the HPTE to RW.
> >
> > The bug is that we actually make the requested change in step 3.
> > Instead we should leave it at RO, then when the guest tries to write
> > to the page, we take a hypervisor page fault, copy the page and give
> > the guest write access to its own copy of the page.
> >
> > So what this patch does is add code to H_PROTECT so that if the guest
> > is requesting RW access, we check the Linux PTE to see if the
> > underlying guest page is RO, and if so reduce the permissions in the
> > HPTE to RO.
>
> But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right?
No - the guest view of the HPTE has R/W permissions. The guest view
of the HPTE is made up of doubleword 0 from the real HPT plus
rev->guest_rpte for doubleword 1 (where rev is the entry in the revmap
array, kvm->arch.revmap, for the HPTE). The guest view can be
different from the host/hardware view, which is in the real HPT. For
instance, the guest view of a HPTE might be valid but the host view
might be invalid because the underlying real page has been paged out -
in that case we use a software bit which we call HPTE_V_ABSENT to
remind ourselves that there is something valid there from the guest's
point of view. Or the guest view can be R/W but the host view is RO,
as in the case where KSM has merged the page.
> So the flow with this patch is:
>
> - guest page permission fault
This comes through the host (kvmppc_hpte_hv_fault()) which looks at
the guest view of the HPTE, sees that it has RO permissions, and sends
the page fault to the guest.
> - guest does H_PROTECT to mark page r/w
> - H_PROTECT doesn't do anything
> - guest returns from permission handler, triggers write fault
This comes once again to kvmppc_hpte_hv_fault(), which sees that the
guest view of the HPTE has R/W permissions now, and sends the page
fault to kvmppc_book3s_hv_page_fault(), which requests write access to
the page, possibly triggering copy-on-write or whatever, and updates
the real HPTE to have R/W permissions and possibly point to a new page
of memory.
>
> 2 questions here:
>
> How does the host know that the page is actually r/w?
I assume you mean RO? It looks up the memslot for the guest physical
address (which it gets from rev->guest_rpte), uses that to work out
the host virtual address (i.e. the address in qemu's address space),
looks up the Linux PTE in qemu's Linux page tables, and looks at the
_PAGE_RW bit there.
> How does this work on 970? I thought page faults always go straight to the guest there.
They do, which is why PPC970 can't do any of this. On PPC970 we have
kvm->arch.using_mmu_notifiers == 0, and that makes the code pin every
page of guest memory that is mapped by a guest HPTE (with a Linux
guest, that means every page, because of the linear mapping). On
POWER7 we have kvm->arch.using_mmu_notifiers == 1, which enables
host paging and deduplication of guest memory.
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
2012-11-24 9:32 ` Paul Mackerras
@ 2012-11-26 13:09 ` Alexander Graf
0 siblings, 0 replies; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 13:09 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
On 24.11.2012, at 10:32, Paul Mackerras wrote:
> On Sat, Nov 24, 2012 at 10:05:37AM +0100, Alexander Graf wrote:
>>
>>
>> On 23.11.2012, at 23:13, Paul Mackerras <paulus@samba.org> wrote:
>>
>>> On Fri, Nov 23, 2012 at 04:47:45PM +0100, Alexander Graf wrote:
>>>>
>>>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>>>>
>>>>> Currently, if the guest does an H_PROTECT hcall requesting that the
>>>>> permissions on a HPT entry be changed to allow writing, we make the
>>>>> requested change even if the page is marked read-only in the host
>>>>> Linux page tables. This is a problem since it would for instance
>>>>> allow a guest to modify a page that KSM has decided can be shared
>>>>> between multiple guests.
>>>>>
>>>>> To fix this, if the new permissions for the page allow writing, we need
>>>>> to look up the memslot for the page, work out the host virtual address,
>>>>> and look up the Linux page tables to get the PTE for the page. If that
>>>>> PTE is read-only, we reduce the HPTE permissions to read-only.
>>>>
>>>> How does KSM handle this usually? If you reduce the permissions to R/O, how do you ever get a R/W page from a deduplicated one?
>>>
>>> The scenario goes something like this:
>>>
>>> 1. Guest creates an HPTE with RO permissions.
>>> 2. KSM decides the page is identical to another page and changes the
>>> HPTE to point to a shared copy. Permissions are still RO.
>>> 3. Guest decides it wants write access to the page and does an
>>> H_PROTECT hcall to change the permissions on the HPTE to RW.
>>>
>>> The bug is that we actually make the requested change in step 3.
>>> Instead we should leave it at RO, then when the guest tries to write
>>> to the page, we take a hypervisor page fault, copy the page and give
>>> the guest write access to its own copy of the page.
>>>
>>> So what this patch does is add code to H_PROTECT so that if the guest
>>> is requesting RW access, we check the Linux PTE to see if the
>>> underlying guest page is RO, and if so reduce the permissions in the
>>> HPTE to RO.
>>
>> But this will be guest visible, because now H_PROTECT doesn't actually mark the page R/W in the HTAB, right?
>
> No - the guest view of the HPTE has R/W permissions. The guest view
> of the HPTE is made up of doubleword 0 from the real HPT plus
> rev->guest_rpte for doubleword 1 (where rev is the entry in the revmap
> array, kvm->arch.revmap, for the HPTE). The guest view can be
> different from the host/hardware view, which is in the real HPT. For
> instance, the guest view of a HPTE might be valid but the host view
> might be invalid because the underlying real page has been paged out -
> in that case we use a software bit which we call HPTE_V_ABSENT to
> remind ourselves that there is something valid there from the guest's
> point of view. Or the guest view can be R/W but the host view is RO,
> as in the case where KSM has merged the page.
>
>> So the flow with this patch is:
>>
>> - guest page permission fault
>
> This comes through the host (kvmppc_hpte_hv_fault()) which looks at
> the guest view of the HPTE, sees that it has RO permissions, and sends
> the page fault to the guest.
>
>> - guest does H_PROTECT to mark page r/w
>> - H_PROTECT doesn't do anything
>> - guest returns from permission handler, triggers write fault
>
> This comes once again to kvmppc_hpte_hv_fault(), which sees that the
> guest view of the HPTE has R/W permissions now, and sends the page
> fault to kvmppc_book3s_hv_page_fault(), which requests write access to
> the page, possibly triggering copy-on-write or whatever, and updates
> the real HPTE to have R/W permissions and possibly point to a new page
> of memory.
>
>>
>> 2 questions here:
>>
>> How does the host know that the page is actually r/w?
>
> I assume you mean RO? It looks up the memslot for the guest physical
> address (which it gets from rev->guest_rpte), uses that to work out
> the host virtual address (i.e. the address in qemu's address space),
> looks up the Linux PTE in qemu's Linux page tables, and looks at the
> _PAGE_RW bit there.
>
>> How does this work on 970? I thought page faults always go straight to the guest there.
>
> They do, which is why PPC970 can't do any of this. On PPC970 we have
> kvm->arch.using_mmu_notifiers == 0, and that makes the code pin every
> page of guest memory that is mapped by a guest HPTE (with a Linux
> guest, that means every page, because of the linear mapping). On
> POWER7 we have kvm->arch.using_mmu_notifiers == 1, which enables
> host paging and deduplication of guest memory.
Thanks a lot for the detailed explanation! Maybe you guys should just release an HV capable p7 system publicly, so we can deprecate 970 support. That would make a few things quite a bit easier ;)
Thanks, applied to kvm-ppc-next.
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-23 22:07 ` Paul Mackerras
@ 2012-11-26 13:10 ` Alexander Graf
2012-11-26 21:48 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 13:10 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 23.11.2012, at 23:07, Paul Mackerras wrote:
> On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
>>
>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>>
>>> - With the possibility of the host paging out guest pages, the use of
>>> H_LOCAL by an SMP guest is dangerous since the guest could possibly
>>> retain and use a stale TLB entry pointing to a page that had been
>>> removed from the guest.
>>
>> I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
>
> The H_LOCAL flag is something that we invented to allow the guest to
> tell the host "I only ever used this translation (HPTE) on the current
> vcpu" when it's removing or modifying an HPTE. The idea is that that
> would then let the host use the tlbiel instruction (local TLB
> invalidate) rather than the usual global tlbie instruction. Tlbiel is
> faster because it doesn't need to go out on the fabric and get
> processed by all cpus. In fact our guests don't use it at present,
> but we put it in because we thought we should be able to get a
> performance improvement, particularly on large machines.
>
> However, the catch is that the guest's setting of H_LOCAL might be
> incorrect, in which case we could have a stale TLB entry on another
> physical cpu. While the physical page that it refers to is still
> owned by the guest, that stale entry doesn't matter from the host's
> point of view. But if the host wants to take that page away from the
> guest, the stale entry becomes a problem.
That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest?
Alex
>
> The solution I implement here is just not to use tlbiel in SMP guests.
> UP guests are not so much of a problem because the potential attack
> from the guest relies on having one cpu remove the HPTE and do tlbiel
> while another cpu uses the stale TLB entry, which you can't do if you
> only have one cpu.
>
> Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-23 21:42 ` Paul Mackerras
@ 2012-11-26 13:15 ` Alexander Graf
2012-11-26 21:33 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 13:15 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 23.11.2012, at 22:42, Paul Mackerras wrote:
> On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
>>
>> On 22.11.2012, at 10:25, Paul Mackerras wrote:
>>
>>> + /* Do they have an SLB shadow buffer registered? */
>>> + slb = vcpu->arch.slb_shadow.pinned_addr;
>>> + if (!slb)
>>> + return;
>>
>> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
>
> Yes, we leave the guest with an empty SLB, the access gets retried and
> this time the guest gets an SLB miss interrupt, which it can hopefully
> handle using an SLB miss handler that runs entirely in real mode.
> This could happen for instance while the guest is in SLOF or yaboot or
> some other code that runs basically in real mode but occasionally
> turns the MMU on for some accesses, and happens to have a bug where it
> creates a duplicate SLB entry.
Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong?
Alex
>
>>> + /* Sanity check */
>>> + n = slb->persistent;
>>> + if (n > SLB_MIN_SIZE)
>>> + n = SLB_MIN_SIZE;
>>
>> Please use a max() macro here.
>
> OK.
>
>>> + rb = 0x800; /* IS field = 0b10, flush congruence class */
>>> + for (i = 0; i < 128; ++i) {
>>
>> Please put up a #define for this. POWER7_TLB_SIZE or so. Is there any way to fetch that number from an SPR? I don't really want to have a p7+ and a p8 function in here too ;).
>>
>>> + asm volatile("tlbiel %0" : : "r" (rb));
>>> + rb += 0x1000;
>>
>> I assume this also means (1 << TLBIE_ENTRY_SHIFT)? Would be nice to keep the code readable without guessing :).
>
> The 0x800 and 0x1000 are taken from the architecture - it defines
> fields in the RB value for the flush type and TLB index. The 128 is
> POWER7-specific and isn't in any SPR that I know of. Eventually we'll
> probably have to put it (the number of TLB congruence classes) in the
> cputable, but for now I'll just do a define.
>
>> So I take it that p7 does not implement tlbia?
>
> Correct.
>
>> So we never return 0? How about ECC errors and the likes? Wouldn't those also be #MCs that the host needs to handle?
>
> Yes, true. In fact the OPAL firmware gets to see the machine checks
> before we do (see the opal_register_exception_handler() calls in
> arch/powerpc/platforms/powernv/opal.c), so it should have already
> handled recoverable things like L1 cache parity errors.
>
> I'll make the function return 0 if there's an error that it doesn't
> know about.
>
>>> ld r8, HSTATE_VMHANDLER(r13)
>>> ld r7, HSTATE_HOST_MSR(r13)
>>>
>>> cmpwi r12, BOOK3S_INTERRUPT_EXTERNAL
>>> +BEGIN_FTR_SECTION
>>> beq 11f
>>> - cmpwi r12, BOOK3S_INTERRUPT_MACHINE_CHECK
>>> +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
>>
>> Mind to explain the logic that happens here? Do we get external interrupts on 970? If not, the cmpwi should also be inside the FTR section. Also, if we do a beq here, why do the beqctr below again?
>
> I was making it not call the host kernel machine check handler if it
> was a machine check that pulled us out of the guest. In fact we
> probably do still want to call the handler, but we don't want to jump
> to 0x200, since that has been patched by OPAL, and we don't want to
> make OPAL think we just got another machine check. Instead we would
> need to jump to machine_check_pSeries.
>
> The feature section is because POWER7 sets HSRR0/1 on external
> interrupts, whereas PPC970 sets SRR0/1.
>
> Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 13:15 ` Alexander Graf
@ 2012-11-26 21:33 ` Paul Mackerras
2012-11-26 21:55 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-26 21:33 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
>
> On 23.11.2012, at 22:42, Paul Mackerras wrote:
>
> > On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
> >>
> >> On 22.11.2012, at 10:25, Paul Mackerras wrote:
> >>
> >>> + /* Do they have an SLB shadow buffer registered? */
> >>> + slb = vcpu->arch.slb_shadow.pinned_addr;
> >>> + if (!slb)
> >>> + return;
> >>
> >> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
> >
> > Yes, we leave the guest with an empty SLB, the access gets retried and
> > this time the guest gets an SLB miss interrupt, which it can hopefully
> > handle using an SLB miss handler that runs entirely in real mode.
> > This could happen for instance while the guest is in SLOF or yaboot or
> > some other code that runs basically in real mode but occasionally
> > turns the MMU on for some accesses, and happens to have a bug where it
> > creates a duplicate SLB entry.
>
> Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong?
Yes, yes and we do. Anytime we get a machine check while in the guest
we give the guest a machine check interrupt.
Ultimately we want to implement the "FWNMI" (Firmware-assisted NMI)
thing defined in PAPR which makes the handling of system reset and
machine check slightly nicer for the guest, but that's for later. It
will build on top of the stuff in this patch.
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-26 13:10 ` Alexander Graf
@ 2012-11-26 21:48 ` Paul Mackerras
2012-11-26 22:03 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-26 21:48 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
>
> On 23.11.2012, at 23:07, Paul Mackerras wrote:
>
> > On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
> >>
> >> On 22.11.2012, at 10:28, Paul Mackerras wrote:
> >>
> >>> - With the possibility of the host paging out guest pages, the use of
> >>> H_LOCAL by an SMP guest is dangerous since the guest could possibly
> >>> retain and use a stale TLB entry pointing to a page that had been
> >>> removed from the guest.
> >>
> >> I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
> >
> > The H_LOCAL flag is something that we invented to allow the guest to
> > tell the host "I only ever used this translation (HPTE) on the current
> > vcpu" when it's removing or modifying an HPTE. The idea is that that
> > would then let the host use the tlbiel instruction (local TLB
> > invalidate) rather than the usual global tlbie instruction. Tlbiel is
> > faster because it doesn't need to go out on the fabric and get
> > processed by all cpus. In fact our guests don't use it at present,
> > but we put it in because we thought we should be able to get a
> > performance improvement, particularly on large machines.
> >
> > However, the catch is that the guest's setting of H_LOCAL might be
> > incorrect, in which case we could have a stale TLB entry on another
> > physical cpu. While the physical page that it refers to is still
> > owned by the guest, that stale entry doesn't matter from the host's
> > point of view. But if the host wants to take that page away from the
> > guest, the stale entry becomes a problem.
>
> That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest?
The question is how to find the TLB entry if the HPTE it came from is
no longer present. Flushing a TLB entry requires a virtual address.
When we're taking a page away from the guest we have the real address
of the page, not the virtual address. We can use the reverse-mapping
chains to loop through all the HPTEs that map the page, and from each
HPTE we can (and do) calculate a virtual address and do a TLBIE on
that virtual address (each HPTE could be at a different virtual
address).
The difficulty comes when we no longer have the HPTE but we
potentially have a stale TLB entry, due to having used tlbiel when we
removed the HPTE. Without the HPTE the only way to get rid of the
stale TLB entry would be to completely flush all the TLB entries for
the guest's LPID on every physical CPU it had ever run on. Since I
don't want to go to that much effort, what I am proposing, and what
this patch implements, is to not ever use tlbiel when removing HPTEs
in SMP guests on POWER7.
In other words, what this patch is about is making sure we don't get
these troublesome stale TLB entries.
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 21:33 ` Paul Mackerras
@ 2012-11-26 21:55 ` Alexander Graf
2012-11-26 22:03 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 21:55 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 26.11.2012, at 22:33, Paul Mackerras wrote:
> On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
>>
>> On 23.11.2012, at 22:42, Paul Mackerras wrote:
>>
>>> On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
>>>>
>>>> On 22.11.2012, at 10:25, Paul Mackerras wrote:
>>>>
>>>>> + /* Do they have an SLB shadow buffer registered? */
>>>>> + slb = vcpu->arch.slb_shadow.pinned_addr;
>>>>> + if (!slb)
>>>>> + return;
>>>>
>>>> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
>>>
>>> Yes, we leave the guest with an empty SLB, the access gets retried and
>>> this time the guest gets an SLB miss interrupt, which it can hopefully
>>> handle using an SLB miss handler that runs entirely in real mode.
>>> This could happen for instance while the guest is in SLOF or yaboot or
>>> some other code that runs basically in real mode but occasionally
>>> turns the MMU on for some accesses, and happens to have a bug where it
>>> creates a duplicate SLB entry.
>>
>> Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong?
>
> Yes, yes and we do. Anytime we get a machine check while in the guest
> we give the guest a machine check interrupt.
>
> Ultimately we want to implement the "FWNMI" (Firmware-assisted NMI)
> thing defined in PAPR which makes the handling of system reset and
> machine check slightly nicer for the guest, but that's for later. It
> will build on top of the stuff in this patch.
So why would the function return 1 then which means "MC is handled, forget about it" rather than 0, which means "inject MC into the guest"?
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-26 21:48 ` Paul Mackerras
@ 2012-11-26 22:03 ` Alexander Graf
2012-11-26 23:16 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 22:03 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 26.11.2012, at 22:48, Paul Mackerras wrote:
> On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
>>
>> On 23.11.2012, at 23:07, Paul Mackerras wrote:
>>
>>> On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
>>>>
>>>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>>>>
>>>>> - With the possibility of the host paging out guest pages, the use of
>>>>> H_LOCAL by an SMP guest is dangerous since the guest could possibly
>>>>> retain and use a stale TLB entry pointing to a page that had been
>>>>> removed from the guest.
>>>>
>>>> I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
>>>
>>> The H_LOCAL flag is something that we invented to allow the guest to
>>> tell the host "I only ever used this translation (HPTE) on the current
>>> vcpu" when it's removing or modifying an HPTE. The idea is that that
>>> would then let the host use the tlbiel instruction (local TLB
>>> invalidate) rather than the usual global tlbie instruction. Tlbiel is
>>> faster because it doesn't need to go out on the fabric and get
>>> processed by all cpus. In fact our guests don't use it at present,
>>> but we put it in because we thought we should be able to get a
>>> performance improvement, particularly on large machines.
>>>
>>> However, the catch is that the guest's setting of H_LOCAL might be
>>> incorrect, in which case we could have a stale TLB entry on another
>>> physical cpu. While the physical page that it refers to is still
>>> owned by the guest, that stale entry doesn't matter from the host's
>>> point of view. But if the host wants to take that page away from the
>>> guest, the stale entry becomes a problem.
>>
>> That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest?
>
> The question is how to find the TLB entry if the HPTE it came from is
> no longer present. Flushing a TLB entry requires a virtual address.
> When we're taking a page away from the guest we have the real address
> of the page, not the virtual address. We can use the reverse-mapping
> chains to loop through all the HPTEs that map the page, and from each
> HPTE we can (and do) calculate a virtual address and do a TLBIE on
> that virtual address (each HPTE could be at a different virtual
> address).
>
> The difficulty comes when we no longer have the HPTE but we
> potentially have a stale TLB entry, due to having used tlbiel when we
> removed the HPTE. Without the HPTE the only way to get rid of the
> stale TLB entry would be to completely flush all the TLB entries for
> the guest's LPID on every physical CPU it had ever run on. Since I
> don't want to go to that much effort, what I am proposing, and what
> this patch implements, is to not ever use tlbiel when removing HPTEs
> in SMP guests on POWER7.
>
> In other words, what this patch is about is making sure we don't get
> these troublesome stale TLB entries.
I see. You could keep a list of to-be-flushed VAs around that you could skim through when taking a page away from the guest. That way you make the fast case fast (add/remove of page from the guest) and the slow path slow (paging).
But I'm fine with disallowing local flushes on remove completely for now. It would be nice to get performance data on how much this would be a net win though. There are certainly ways of keeping local flushes alive with the scheme above.
Thanks, applied to kvm-ppc-next.
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 21:55 ` Alexander Graf
@ 2012-11-26 22:03 ` Alexander Graf
2012-11-26 23:11 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 22:03 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 26.11.2012, at 22:55, Alexander Graf wrote:
>
> On 26.11.2012, at 22:33, Paul Mackerras wrote:
>
>> On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
>>>
>>> On 23.11.2012, at 22:42, Paul Mackerras wrote:
>>>
>>>> On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
>>>>>
>>>>> On 22.11.2012, at 10:25, Paul Mackerras wrote:
>>>>>
>>>>>> + /* Do they have an SLB shadow buffer registered? */
>>>>>> + slb = vcpu->arch.slb_shadow.pinned_addr;
>>>>>> + if (!slb)
>>>>>> + return;
>>>>>
>>>>> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
>>>>
>>>> Yes, we leave the guest with an empty SLB, the access gets retried and
>>>> this time the guest gets an SLB miss interrupt, which it can hopefully
>>>> handle using an SLB miss handler that runs entirely in real mode.
>>>> This could happen for instance while the guest is in SLOF or yaboot or
>>>> some other code that runs basically in real mode but occasionally
>>>> turns the MMU on for some accesses, and happens to have a bug where it
>>>> creates a duplicate SLB entry.
>>>
>>> Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong?
>>
>> Yes, yes and we do. Anytime we get a machine check while in the guest
>> we give the guest a machine check interrupt.
>>
>> Ultimately we want to implement the "FWNMI" (Firmware-assisted NMI)
>> thing defined in PAPR which makes the handling of system reset and
>> machine check slightly nicer for the guest, but that's for later. It
>> will build on top of the stuff in this patch.
>
> So why would the function return 1 then which means "MC is handled, forget about it" rather than 0, which means "inject MC into the guest"?
Oh wait - 1 means "have the host handle it". Let me check up the code again.
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 22:03 ` Alexander Graf
@ 2012-11-26 23:11 ` Paul Mackerras
0 siblings, 0 replies; 33+ messages in thread
From: Paul Mackerras @ 2012-11-26 23:11 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Mon, Nov 26, 2012 at 11:03:48PM +0100, Alexander Graf wrote:
>
> On 26.11.2012, at 22:55, Alexander Graf wrote:
>
> >
> > On 26.11.2012, at 22:33, Paul Mackerras wrote:
> >
> >> On Mon, Nov 26, 2012 at 02:15:16PM +0100, Alexander Graf wrote:
> >>>
> >>> On 23.11.2012, at 22:42, Paul Mackerras wrote:
> >>>
> >>>> On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
> >>>>>
> >>>>> On 22.11.2012, at 10:25, Paul Mackerras wrote:
> >>>>>
> >>>>>> + /* Do they have an SLB shadow buffer registered? */
> >>>>>> + slb = vcpu->arch.slb_shadow.pinned_addr;
> >>>>>> + if (!slb)
> >>>>>> + return;
> >>>>>
> >>>>> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?
> >>>>
> >>>> Yes, we leave the guest with an empty SLB, the access gets retried and
> >>>> this time the guest gets an SLB miss interrupt, which it can hopefully
> >>>> handle using an SLB miss handler that runs entirely in real mode.
> >>>> This could happen for instance while the guest is in SLOF or yaboot or
> >>>> some other code that runs basically in real mode but occasionally
> >>>> turns the MMU on for some accesses, and happens to have a bug where it
> >>>> creates a duplicate SLB entry.
> >>>
> >>> Is this what pHyp does? Also, is this what we want? Why don't we populate an #MC into the guest so it knows it did something wrong?
> >>
> >> Yes, yes and we do. Anytime we get a machine check while in the guest
> >> we give the guest a machine check interrupt.
> >>
> >> Ultimately we want to implement the "FWNMI" (Firmware-assisted NMI)
> >> thing defined in PAPR which makes the handling of system reset and
> >> machine check slightly nicer for the guest, but that's for later. It
> >> will build on top of the stuff in this patch.
> >
> > So why would the function return 1 then which means "MC is handled, forget about it" rather than 0, which means "inject MC into the guest"?
>
> Oh wait - 1 means "have the host handle it". Let me check up the code again.
1 means "the problem is fixed, now give the guest a machine check
interrupt".
0 means "exit the guest, have the host's MC handler look at it, then
give the guest a machine check". In this case the delivery of the MC
to the guest happens in kvmppc_handle_exit().
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-26 22:03 ` Alexander Graf
@ 2012-11-26 23:16 ` Paul Mackerras
2012-11-26 23:18 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-26 23:16 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Mon, Nov 26, 2012 at 11:03:19PM +0100, Alexander Graf wrote:
>
> On 26.11.2012, at 22:48, Paul Mackerras wrote:
>
> > On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
> >>
> >> On 23.11.2012, at 23:07, Paul Mackerras wrote:
> >>
> >>> On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
> >>>>
> >>>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
> >>>>
> >>>>> - With the possibility of the host paging out guest pages, the use of
> >>>>> H_LOCAL by an SMP guest is dangerous since the guest could possibly
> >>>>> retain and use a stale TLB entry pointing to a page that had been
> >>>>> removed from the guest.
> >>>>
> >>>> I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
> >>>
> >>> The H_LOCAL flag is something that we invented to allow the guest to
> >>> tell the host "I only ever used this translation (HPTE) on the current
> >>> vcpu" when it's removing or modifying an HPTE. The idea is that that
> >>> would then let the host use the tlbiel instruction (local TLB
> >>> invalidate) rather than the usual global tlbie instruction. Tlbiel is
> >>> faster because it doesn't need to go out on the fabric and get
> >>> processed by all cpus. In fact our guests don't use it at present,
> >>> but we put it in because we thought we should be able to get a
> >>> performance improvement, particularly on large machines.
> >>>
> >>> However, the catch is that the guest's setting of H_LOCAL might be
> >>> incorrect, in which case we could have a stale TLB entry on another
> >>> physical cpu. While the physical page that it refers to is still
> >>> owned by the guest, that stale entry doesn't matter from the host's
> >>> point of view. But if the host wants to take that page away from the
> >>> guest, the stale entry becomes a problem.
> >>
> >> That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest?
> >
> > The question is how to find the TLB entry if the HPTE it came from is
> > no longer present. Flushing a TLB entry requires a virtual address.
> > When we're taking a page away from the guest we have the real address
> > of the page, not the virtual address. We can use the reverse-mapping
> > chains to loop through all the HPTEs that map the page, and from each
> > HPTE we can (and do) calculate a virtual address and do a TLBIE on
> > that virtual address (each HPTE could be at a different virtual
> > address).
> >
> > The difficulty comes when we no longer have the HPTE but we
> > potentially have a stale TLB entry, due to having used tlbiel when we
> > removed the HPTE. Without the HPTE the only way to get rid of the
> > stale TLB entry would be to completely flush all the TLB entries for
> > the guest's LPID on every physical CPU it had ever run on. Since I
> > don't want to go to that much effort, what I am proposing, and what
> > this patch implements, is to not ever use tlbiel when removing HPTEs
> > in SMP guests on POWER7.
> >
> > In other words, what this patch is about is making sure we don't get
> > these troublesome stale TLB entries.
>
> I see. You could keep a list of to-be-flushed VAs around that you could skim through when taking a page away from the guest. That way you make the fast case fast (add/remove of page from the guest) and the slow path slow (paging).
Yes, I thought about that, but the problem is that the list of VAs
could get arbitrarily long and take up a lot of host memory.
> But I'm fine with disallowing local flushes on remove completely for now. It would be nice to get performance data on how much this would be a net win though. There are certainly ways of keeping local flushes alive with the scheme above.
Yes, I definitely want to get some good performance data to see how
much of a win it would be, and if there is a good win, work out some
scheme to let us use the local flushes.
> Thanks, applied to kvm-ppc-next.
Thanks,
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-24 8:37 ` [PATCH v2] " Paul Mackerras
@ 2012-11-26 23:16 ` Alexander Graf
2012-11-26 23:18 ` Paul Mackerras
2012-12-22 14:09 ` [PATCH] KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV Andreas Schwab
1 sibling, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 23:16 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 24.11.2012, at 09:37, Paul Mackerras wrote:
> Currently, if a machine check interrupt happens while we are in the
> guest, we exit the guest and call the host's machine check handler,
> which tends to cause the host to panic. Some machine checks can be
> triggered by the guest; for example, if the guest creates two entries
> in the SLB that map the same effective address, and then accesses that
> effective address, the CPU will take a machine check interrupt.
>
> To handle this better, when a machine check happens inside the guest,
> we call a new function, kvmppc_realmode_machine_check(), while still in
> real mode before exiting the guest. On POWER7, it handles the cases
> that the guest can trigger, either by flushing and reloading the SLB,
> or by flushing the TLB, and then it delivers the machine check interrupt
> directly to the guest without going back to the host. On POWER7, the
> OPAL firmware patches the machine check interrupt vector so that it
> gets control first, and it leaves behind its analysis of the situation
> in a structure pointed to by the opal_mc_evt field of the paca. The
> kvmppc_realmode_machine_check() function looks at this, and if OPAL
> reports that there was no error, or that it has handled the error, we
> also go straight back to the guest with a machine check. We have to
> deliver a machine check to the guest since the machine check interrupt
> might have trashed valid values in SRR0/1.
>
> If the machine check is one we can't handle in real mode, and one that
> OPAL hasn't already handled, or on PPC970, we exit the guest and call
> the host's machine check handler. We do this by jumping to the
> machine_check_fwnmi label, rather than absolute address 0x200, because
> we don't want to re-execute OPAL's handler on POWER7. On PPC970, the
> two are equivalent because address 0x200 just contains a branch.
>
> Then, if the host machine check handler decides that the system can
> continue executing, kvmppc_handle_exit() delivers a machine check
> interrupt to the guest -- once again to let the guest know that SRR0/1
> have been modified.
>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
Thanks for the semantic explanations :). From that POV things are clear and good with me now. That leaves only checkpatch ;)
WARNING: please, no space before tabs
#142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21:
+#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$
WARNING: please, no space before tabs
#143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22:
+#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$
WARNING: min() should probably be min_t(u32, slb->persistent, SLB_MIN_SIZE)
#168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47:
+ n = min(slb->persistent, (u32) SLB_MIN_SIZE);
total: 0 errors, 3 warnings, 357 lines checked
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 23:16 ` Alexander Graf
@ 2012-11-26 23:18 ` Paul Mackerras
2012-11-26 23:20 ` Alexander Graf
0 siblings, 1 reply; 33+ messages in thread
From: Paul Mackerras @ 2012-11-26 23:18 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Tue, Nov 27, 2012 at 12:16:28AM +0100, Alexander Graf wrote:
>
> On 24.11.2012, at 09:37, Paul Mackerras wrote:
>
> > Currently, if a machine check interrupt happens while we are in the
> > guest, we exit the guest and call the host's machine check handler,
> > which tends to cause the host to panic. Some machine checks can be
> > triggered by the guest; for example, if the guest creates two entries
> > in the SLB that map the same effective address, and then accesses that
> > effective address, the CPU will take a machine check interrupt.
> >
> > To handle this better, when a machine check happens inside the guest,
> > we call a new function, kvmppc_realmode_machine_check(), while still in
> > real mode before exiting the guest. On POWER7, it handles the cases
> > that the guest can trigger, either by flushing and reloading the SLB,
> > or by flushing the TLB, and then it delivers the machine check interrupt
> > directly to the guest without going back to the host. On POWER7, the
> > OPAL firmware patches the machine check interrupt vector so that it
> > gets control first, and it leaves behind its analysis of the situation
> > in a structure pointed to by the opal_mc_evt field of the paca. The
> > kvmppc_realmode_machine_check() function looks at this, and if OPAL
> > reports that there was no error, or that it has handled the error, we
> > also go straight back to the guest with a machine check. We have to
> > deliver a machine check to the guest since the machine check interrupt
> > might have trashed valid values in SRR0/1.
> >
> > If the machine check is one we can't handle in real mode, and one that
> > OPAL hasn't already handled, or on PPC970, we exit the guest and call
> > the host's machine check handler. We do this by jumping to the
> > machine_check_fwnmi label, rather than absolute address 0x200, because
> > we don't want to re-execute OPAL's handler on POWER7. On PPC970, the
> > two are equivalent because address 0x200 just contains a branch.
> >
> > Then, if the host machine check handler decides that the system can
> > continue executing, kvmppc_handle_exit() delivers a machine check
> > interrupt to the guest -- once again to let the guest know that SRR0/1
> > have been modified.
> >
> > Signed-off-by: Paul Mackerras <paulus@samba.org>
>
> Thanks for the semantic explanations :). From that POV things are clear and good with me now. That leaves only checkpatch ;)
>
>
> WARNING: please, no space before tabs
> #142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21:
> +#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$
>
> WARNING: please, no space before tabs
> #143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22:
> +#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$
>
> WARNING: min() should probably be min_t(u32, slb->persistent, SLB_MIN_SIZE)
> #168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47:
> + n = min(slb->persistent, (u32) SLB_MIN_SIZE);
>
> total: 0 errors, 3 warnings, 357 lines checked
Phooey. Do you want me to resubmit the patch, or will you fix it up?
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
2012-11-26 23:16 ` Paul Mackerras
@ 2012-11-26 23:18 ` Alexander Graf
0 siblings, 0 replies; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 23:18 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 27.11.2012, at 00:16, Paul Mackerras wrote:
> On Mon, Nov 26, 2012 at 11:03:19PM +0100, Alexander Graf wrote:
>>
>> On 26.11.2012, at 22:48, Paul Mackerras wrote:
>>
>>> On Mon, Nov 26, 2012 at 02:10:33PM +0100, Alexander Graf wrote:
>>>>
>>>> On 23.11.2012, at 23:07, Paul Mackerras wrote:
>>>>
>>>>> On Fri, Nov 23, 2012 at 04:43:03PM +0100, Alexander Graf wrote:
>>>>>>
>>>>>> On 22.11.2012, at 10:28, Paul Mackerras wrote:
>>>>>>
>>>>>>> - With the possibility of the host paging out guest pages, the use of
>>>>>>> H_LOCAL by an SMP guest is dangerous since the guest could possibly
>>>>>>> retain and use a stale TLB entry pointing to a page that had been
>>>>>>> removed from the guest.
>>>>>>
>>>>>> I don't understand this part. Don't we flush the TLB when the page gets evicted from the shadow HTAB?
>>>>>
>>>>> The H_LOCAL flag is something that we invented to allow the guest to
>>>>> tell the host "I only ever used this translation (HPTE) on the current
>>>>> vcpu" when it's removing or modifying an HPTE. The idea is that that
>>>>> would then let the host use the tlbiel instruction (local TLB
>>>>> invalidate) rather than the usual global tlbie instruction. Tlbiel is
>>>>> faster because it doesn't need to go out on the fabric and get
>>>>> processed by all cpus. In fact our guests don't use it at present,
>>>>> but we put it in because we thought we should be able to get a
>>>>> performance improvement, particularly on large machines.
>>>>>
>>>>> However, the catch is that the guest's setting of H_LOCAL might be
>>>>> incorrect, in which case we could have a stale TLB entry on another
>>>>> physical cpu. While the physical page that it refers to is still
>>>>> owned by the guest, that stale entry doesn't matter from the host's
>>>>> point of view. But if the host wants to take that page away from the
>>>>> guest, the stale entry becomes a problem.
>>>>
>>>> That's exactly where my question lies. Does that mean we don't flush the TLB entry regardless when we take the page away from the guest?
>>>
>>> The question is how to find the TLB entry if the HPTE it came from is
>>> no longer present. Flushing a TLB entry requires a virtual address.
>>> When we're taking a page away from the guest we have the real address
>>> of the page, not the virtual address. We can use the reverse-mapping
>>> chains to loop through all the HPTEs that map the page, and from each
>>> HPTE we can (and do) calculate a virtual address and do a TLBIE on
>>> that virtual address (each HPTE could be at a different virtual
>>> address).
>>>
>>> The difficulty comes when we no longer have the HPTE but we
>>> potentially have a stale TLB entry, due to having used tlbiel when we
>>> removed the HPTE. Without the HPTE the only way to get rid of the
>>> stale TLB entry would be to completely flush all the TLB entries for
>>> the guest's LPID on every physical CPU it had ever run on. Since I
>>> don't want to go to that much effort, what I am proposing, and what
>>> this patch implements, is to not ever use tlbiel when removing HPTEs
>>> in SMP guests on POWER7.
>>>
>>> In other words, what this patch is about is making sure we don't get
>>> these troublesome stale TLB entries.
>>
>> I see. You could keep a list of to-be-flushed VAs around that you could skim through when taking a page away from the guest. That way you make the fast case fast (add/remove of page from the guest) and the slow path slow (paging).
>
> Yes, I thought about that, but the problem is that the list of VAs
> could get arbitrarily long and take up a lot of host memory.
You can always cap it at an arbitrary number, similar to how the TLB itself is limited too.
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 23:18 ` Paul Mackerras
@ 2012-11-26 23:20 ` Alexander Graf
2012-11-27 0:20 ` Paul Mackerras
0 siblings, 1 reply; 33+ messages in thread
From: Alexander Graf @ 2012-11-26 23:20 UTC (permalink / raw)
To: Paul Mackerras; +Cc: kvm-ppc, kvm
On 27.11.2012, at 00:18, Paul Mackerras wrote:
> On Tue, Nov 27, 2012 at 12:16:28AM +0100, Alexander Graf wrote:
>>
>> On 24.11.2012, at 09:37, Paul Mackerras wrote:
>>
>>> Currently, if a machine check interrupt happens while we are in the
>>> guest, we exit the guest and call the host's machine check handler,
>>> which tends to cause the host to panic. Some machine checks can be
>>> triggered by the guest; for example, if the guest creates two entries
>>> in the SLB that map the same effective address, and then accesses that
>>> effective address, the CPU will take a machine check interrupt.
>>>
>>> To handle this better, when a machine check happens inside the guest,
>>> we call a new function, kvmppc_realmode_machine_check(), while still in
>>> real mode before exiting the guest. On POWER7, it handles the cases
>>> that the guest can trigger, either by flushing and reloading the SLB,
>>> or by flushing the TLB, and then it delivers the machine check interrupt
>>> directly to the guest without going back to the host. On POWER7, the
>>> OPAL firmware patches the machine check interrupt vector so that it
>>> gets control first, and it leaves behind its analysis of the situation
>>> in a structure pointed to by the opal_mc_evt field of the paca. The
>>> kvmppc_realmode_machine_check() function looks at this, and if OPAL
>>> reports that there was no error, or that it has handled the error, we
>>> also go straight back to the guest with a machine check. We have to
>>> deliver a machine check to the guest since the machine check interrupt
>>> might have trashed valid values in SRR0/1.
>>>
>>> If the machine check is one we can't handle in real mode, and one that
>>> OPAL hasn't already handled, or on PPC970, we exit the guest and call
>>> the host's machine check handler. We do this by jumping to the
>>> machine_check_fwnmi label, rather than absolute address 0x200, because
>>> we don't want to re-execute OPAL's handler on POWER7. On PPC970, the
>>> two are equivalent because address 0x200 just contains a branch.
>>>
>>> Then, if the host machine check handler decides that the system can
>>> continue executing, kvmppc_handle_exit() delivers a machine check
>>> interrupt to the guest -- once again to let the guest know that SRR0/1
>>> have been modified.
>>>
>>> Signed-off-by: Paul Mackerras <paulus@samba.org>
>>
>> Thanks for the semantic explanations :). From that POV things are clear and good with me now. That leaves only checkpatch ;)
>>
>>
>> WARNING: please, no space before tabs
>> #142: FILE: arch/powerpc/kvm/book3s_hv_ras.c:21:
>> +#define SRR1_MC_IFETCH_SLBMULTI ^I3^I/* SLB multi-hit */$
>>
>> WARNING: please, no space before tabs
>> #143: FILE: arch/powerpc/kvm/book3s_hv_ras.c:22:
>> +#define SRR1_MC_IFETCH_SLBPARMULTI ^I4^I/* SLB parity + multi-hit */$
>>
>> WARNING: min() should probably be min_t(u32, slb->persistent, SLB_MIN_SIZE)
>> #168: FILE: arch/powerpc/kvm/book3s_hv_ras.c:47:
>> + n = min(slb->persistent, (u32) SLB_MIN_SIZE);
>>
>> total: 0 errors, 3 warnings, 357 lines checked
>
> Phooey. Do you want me to resubmit the patch, or will you fix it up?
Hrm. Promise to run checkpatch yourself next time and I'll fix it up for you this time ;)
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH v2] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
2012-11-26 23:20 ` Alexander Graf
@ 2012-11-27 0:20 ` Paul Mackerras
0 siblings, 0 replies; 33+ messages in thread
From: Paul Mackerras @ 2012-11-27 0:20 UTC (permalink / raw)
To: Alexander Graf; +Cc: kvm-ppc, kvm
On Tue, Nov 27, 2012 at 12:20:08AM +0100, Alexander Graf wrote:
> Hrm. Promise to run checkpatch yourself next time and I'll fix it up for you this time ;)
OK, will do, thanks. :)
Paul.
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV
2012-11-24 8:37 ` [PATCH v2] " Paul Mackerras
2012-11-26 23:16 ` Alexander Graf
@ 2012-12-22 14:09 ` Andreas Schwab
2013-01-06 13:05 ` Alexander Graf
1 sibling, 1 reply; 33+ messages in thread
From: Andreas Schwab @ 2012-12-22 14:09 UTC (permalink / raw)
To: Paul Mackerras; +Cc: Alexander Graf, kvm-ppc, kvm
Fixes this build breakage:
arch/powerpc/kvm/book3s_hv_ras.c: In function ‘kvmppc_realmode_mc_power7’:
arch/powerpc/kvm/book3s_hv_ras.c:126:23: error: ‘struct paca_struct’ has no member named ‘opal_mc_evt’
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
---
arch/powerpc/kvm/book3s_hv_ras.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index 35f3cf0..a353c48 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -79,7 +79,9 @@ static void flush_tlb_power7(struct kvm_vcpu *vcpu)
static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
{
unsigned long srr1 = vcpu->arch.shregs.msr;
+#ifdef CONFIG_PPC_POWERNV
struct opal_machine_check_event *opal_evt;
+#endif
long handled = 1;
if (srr1 & SRR1_MC_LDSTERR) {
@@ -117,6 +119,7 @@ static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
handled = 0;
}
+#ifdef CONFIG_PPC_POWERNV
/*
* See if OPAL has already handled the condition.
* We assume that if the condition is recovered then OPAL
@@ -131,6 +134,7 @@ static long kvmppc_realmode_mc_power7(struct kvm_vcpu *vcpu)
if (handled)
opal_evt->in_use = 0;
+#endif
return handled;
}
--
1.8.0.2
--
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH] KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV
2012-12-22 14:09 ` [PATCH] KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV Andreas Schwab
@ 2013-01-06 13:05 ` Alexander Graf
0 siblings, 0 replies; 33+ messages in thread
From: Alexander Graf @ 2013-01-06 13:05 UTC (permalink / raw)
To: Andreas Schwab; +Cc: Paul Mackerras, kvm-ppc, kvm
On 22.12.2012, at 15:09, Andreas Schwab wrote:
> Fixes this build breakage:
>
> arch/powerpc/kvm/book3s_hv_ras.c: In function ‘kvmppc_realmode_mc_power7’:
> arch/powerpc/kvm/book3s_hv_ras.c:126:23: error: ‘struct paca_struct’ has no member named ‘opal_mc_evt’
>
> Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
Thanks, applied to kvm-ppc-3.8.
Paul, in the long run it might make sense to make HV KVM conditional on OPAL and just completely remove 970 support. At least if it significantly makes your life easier.
Alex
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2013-01-06 13:05 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-22 9:24 [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Paul Mackerras
2012-11-22 9:25 ` [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking Paul Mackerras
2012-11-23 14:13 ` Alexander Graf
2012-11-23 21:42 ` Paul Mackerras
2012-11-26 13:15 ` Alexander Graf
2012-11-26 21:33 ` Paul Mackerras
2012-11-26 21:55 ` Alexander Graf
2012-11-26 22:03 ` Alexander Graf
2012-11-26 23:11 ` Paul Mackerras
2012-11-24 8:37 ` [PATCH v2] " Paul Mackerras
2012-11-26 23:16 ` Alexander Graf
2012-11-26 23:18 ` Paul Mackerras
2012-11-26 23:20 ` Alexander Graf
2012-11-27 0:20 ` Paul Mackerras
2012-12-22 14:09 ` [PATCH] KVM: PPC: Book3S HV: Fix compilation without CONFIG_PPC_POWERNV Andreas Schwab
2013-01-06 13:05 ` Alexander Graf
2012-11-22 9:27 ` [PATCH 2/5] KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT Paul Mackerras
2012-11-22 9:28 ` [PATCH 3/5] KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations Paul Mackerras
2012-11-23 15:43 ` Alexander Graf
2012-11-23 22:07 ` Paul Mackerras
2012-11-26 13:10 ` Alexander Graf
2012-11-26 21:48 ` Paul Mackerras
2012-11-26 22:03 ` Alexander Graf
2012-11-26 23:16 ` Paul Mackerras
2012-11-26 23:18 ` Alexander Graf
2012-11-22 9:28 ` [PATCH 4/5] KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages Paul Mackerras
2012-11-23 15:47 ` Alexander Graf
2012-11-23 22:13 ` Paul Mackerras
2012-11-24 9:05 ` Alexander Graf
2012-11-24 9:32 ` Paul Mackerras
2012-11-26 13:09 ` Alexander Graf
2012-11-22 9:29 ` [PATCH 5/5] KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT Paul Mackerras
2012-11-23 15:48 ` [PATCH 0/5] KVM: PPC: Fix various bugs and vulnerabilities in HV KVM Alexander Graf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox