[PATCH v5 5/5] KVM: PPC: e500: MMU API

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 5/5] KVM: PPC: e500: MMU API
@ 2011-07-07 23:41 Scott Wood
  2011-07-08 12:57 ` Alexander Graf
                   ` (17 more replies)
  0 siblings, 18 replies; 19+ messages in thread
From: Scott Wood @ 2011-07-07 23:41 UTC (permalink / raw)
  To: kvm-ppc

This implements a shared-memory API for giving host userspace access to
the guest's TLB.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
v5:
 - respin on top of fixes
 - remove unused kvm_dump_tlbs() now that there's another way to get the
   data
 - clarify in the documentation that even though hardware ignores tsize
   on a fixed-size array, KVM wants it to be set properly in the shared
   array.

 Documentation/virtual/kvm/api.txt   |   86 ++++++++-
 arch/powerpc/include/asm/kvm.h      |   35 ++++
 arch/powerpc/include/asm/kvm_e500.h |   24 ++--
 arch/powerpc/include/asm/kvm_ppc.h  |    7 +
 arch/powerpc/kvm/e500.c             |    5 +-
 arch/powerpc/kvm/e500_emulate.c     |   12 +-
 arch/powerpc/kvm/e500_tlb.c         |  372 ++++++++++++++++++++++++----------
 arch/powerpc/kvm/e500_tlb.h         |   38 ++--
 arch/powerpc/kvm/powerpc.c          |   28 +++
 include/linux/kvm.h                 |   19 ++
 10 files changed, 473 insertions(+), 153 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
index b251136..31df5b0 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1265,7 +1265,7 @@ struct kvm_assigned_msix_entry {
 	__u16 padding[3];
 };
 
-4.54 KVM_SET_TSC_KHZ
+4.55 KVM_SET_TSC_KHZ
 
 Capability: KVM_CAP_TSC_CONTROL
 Architectures: x86
@@ -1276,7 +1276,7 @@ Returns: 0 on success, -1 on error
 Specifies the tsc frequency for the virtual machine. The unit of the
 frequency is KHz.
 
-4.55 KVM_GET_TSC_KHZ
+4.56 KVM_GET_TSC_KHZ
 
 Capability: KVM_CAP_GET_TSC_KHZ
 Architectures: x86
@@ -1288,7 +1288,7 @@ Returns the tsc frequency of the guest. The unit of the return value is
 KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
 error.
 
-4.56 KVM_GET_LAPIC
+4.57 KVM_GET_LAPIC
 
 Capability: KVM_CAP_IRQCHIP
 Architectures: x86
@@ -1304,7 +1304,7 @@ struct kvm_lapic_state {
 Reads the Local APIC registers and copies them into the input argument.  The
 data format and layout are the same as documented in the architecture manual.
 
-4.57 KVM_SET_LAPIC
+4.58 KVM_SET_LAPIC
 
 Capability: KVM_CAP_IRQCHIP
 Architectures: x86
@@ -1320,7 +1320,7 @@ struct kvm_lapic_state {
 Copies the input argument into the the Local APIC registers.  The data format
 and layout are the same as documented in the architecture manual.
 
-4.58 KVM_IOEVENTFD
+4.59 KVM_IOEVENTFD
 
 Capability: KVM_CAP_IOEVENTFD
 Architectures: all
@@ -1350,6 +1350,82 @@ The following flags are defined:
 If datamatch flag is set, the event will be signaled only if the written value
 to the registered address is equal to datamatch in struct kvm_ioeventfd.
 
+4.60 KVM_CONFIG_TLB
+
+Capability: KVM_CAP_SW_TLB
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_config_tlb (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_config_tlb {
+	__u64 params;
+	__u64 array;
+	__u32 mmu_type;
+	__u32 array_len;
+};
+
+Configures the virtual CPU's TLB array, establishing a shared memory area
+between userspace and KVM.  The "params" and "array" fields are userspace
+addresses of mmu-type-specific data structures.  The "array_len" field is an
+safety mechanism, and should be set to the size in bytes of the memory that
+userspace has reserved for the array.  It must be at least the size dictated
+by "mmu_type" and "params".
+
+While KVM_RUN is active, the shared region is under control of KVM.  Its
+contents are undefined, and any modification by userspace results in
+boundedly undefined behavior.
+
+On return from KVM_RUN, the shared region will reflect the current state of
+the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
+to tell KVM which entries have been changed, prior to calling KVM_RUN again
+on this vcpu.
+
+For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+ - The "params" field is of type "struct kvm_book3e_206_tlb_params".
+ - The "array" field points to an array of type "struct
+   kvm_book3e_206_tlb_entry".
+ - The array consists of all entries in the first TLB, followed by all
+   entries in the second TLB.
+ - Within a TLB, entries are ordered first by increasing set number.  Within a
+   set, entries are ordered by way (increasing ESEL).
+ - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
+   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
+ - The tsize field of mas1 shall be set to 4K on TLB0, even though the
+   hardware ignores this value for TLB0.
+
+4.61 KVM_DIRTY_TLB
+
+Capability: KVM_CAP_SW_TLB
+Architectures: ppc
+Type: vcpu ioctl
+Parameters: struct kvm_dirty_tlb (in)
+Returns: 0 on success, -1 on error
+
+struct kvm_dirty_tlb {
+	__u64 bitmap;
+	__u32 num_dirty;
+};
+
+This must be called whenever userspace has changed an entry in the shared
+TLB, prior to calling KVM_RUN on the associated vcpu.
+
+The "bitmap" field is the userspace address of an array.  This array
+consists of a number of bits, equal to the total number of TLB entries as
+determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
+nearest multiple of 64.
+
+Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
+array.
+
+The array is little-endian: the bit 0 is the least significant bit of the
+first byte, bit 8 is the least significant bit of the second byte, etc.
+This avoids any complications with differing word sizes.
+
+The "num_dirty" field is a performance hint for KVM to determine whether it
+should skip processing the bitmap and just invalidate everything.  It must
+be set to the number of set bits in the bitmap.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
index d2ca5ed..1a6dedf 100644
--- a/arch/powerpc/include/asm/kvm.h
+++ b/arch/powerpc/include/asm/kvm.h
@@ -272,4 +272,39 @@ struct kvm_guest_debug_arch {
 #define KVM_INTERRUPT_UNSET	-2U
 #define KVM_INTERRUPT_SET_LEVEL	-3U
 
+struct kvm_book3e_206_tlb_entry {
+	__u32 mas8;
+	__u32 mas1;
+	__u64 mas2;
+	__u64 mas7_3;
+};
+
+struct kvm_book3e_206_tlb_params {
+	/*
+	 * For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
+	 *
+	 * - The number of ways of TLB0 must be a power of two between 2 and
+	 *   16.
+	 * - TLB1 must be fully associative.
+	 * - The size of TLB0 must be a multiple of the number of ways, and
+	 *   the number of sets must be a power of two.
+	 * - The size of TLB1 may not exceed 64 entries.
+	 * - TLB0 supports 4 KiB pages.
+	 * - The page sizes supported by TLB1 are as indicated by
+	 *   TLB1CFG (if MMUCFG[MAVN] = 0) or TLB1PS (if MMUCFG[MAVN] = 1)
+	 *   as returned by KVM_GET_SREGS.
+	 * - TLB2 and TLB3 are reserved, and their entries in tlb_sizes[]
+	 *   and tlb_ways[] must be zero.
+	 *
+	 * tlb_ways[n] = tlb_sizes[n] means the array is fully associative.
+	 *
+	 * KVM will adjust TLBnCFG based on the sizes configured here,
+	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
+	 * set to zero.
+	 */
+	__u32 tlb_sizes[4];
+	__u32 tlb_ways[4];
+	__u32 reserved[8];
+};
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/include/asm/kvm_e500.h b/arch/powerpc/include/asm/kvm_e500.h
index 87aab98..e1ac268 100644
--- a/arch/powerpc/include/asm/kvm_e500.h
+++ b/arch/powerpc/include/asm/kvm_e500.h
@@ -22,13 +22,6 @@
 #define E500_PID_NUM   3
 #define E500_TLB_NUM   2
 
-struct tlbe{
-	u32 mas1;
-	u32 mas2;
-	u32 mas3;
-	u32 mas7;
-};
-
 #define E500_TLB_VALID 1
 #define E500_TLB_DIRTY 2
 
@@ -44,8 +37,11 @@ struct tlbe_priv {
 struct vcpu_id_table;
 
 struct kvmppc_vcpu_e500 {
-	/* Unmodified copy of the guest's TLB. */
-	struct tlbe *gtlb_arch[E500_TLB_NUM];
+	/* Unmodified copy of the guest's TLB -- shared with host userspace. */
+	struct kvm_book3e_206_tlb_entry *gtlb_arch;
+
+	/* Starting entry number in gtlb_arch[] */
+	int gtlb_offset[E500_TLB_NUM];
 
 	/* KVM internal information associated with each guest TLB entry */
 	struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
@@ -53,6 +49,9 @@ struct kvmppc_vcpu_e500 {
 	unsigned int gtlb_size[E500_TLB_NUM];
 	unsigned int gtlb_nv[E500_TLB_NUM];
 
+	unsigned int gtlb0_ways;
+	unsigned int gtlb0_sets;
+
 	/*
 	 * information associated with each host TLB entry --
 	 * TLB1 only for now.  If/when guest TLB1 entries can be
@@ -64,7 +63,6 @@ struct kvmppc_vcpu_e500 {
 	 * and back, and our host TLB entries got evicted).
 	 */
 	struct tlbe_ref *tlb_refs[E500_TLB_NUM];
-
 	unsigned int host_tlb1_nv;
 
 	u32 host_pid[E500_PID_NUM];
@@ -74,11 +72,10 @@ struct kvmppc_vcpu_e500 {
 	u32 mas0;
 	u32 mas1;
 	u32 mas2;
-	u32 mas3;
+	u64 mas7_3;
 	u32 mas4;
 	u32 mas5;
 	u32 mas6;
-	u32 mas7;
 
 	/* vcpu id table */
 	struct vcpu_id_table *idt;
@@ -91,6 +88,9 @@ struct kvmppc_vcpu_e500 {
 	u32 tlb1cfg;
 	u64 mcar;
 
+	struct page **shared_tlb_pages;
+	int num_shared_tlb_pages;
+
 	struct kvm_vcpu vcpu;
 };
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index c662f14..bb3d418 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -152,4 +152,11 @@ int kvmppc_set_sregs_ivor(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
+			      struct kvm_config_tlb *cfg);
+int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
+			     struct kvm_dirty_tlb *cfg);
+
+void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
index 797a744..b8f065c 100644
--- a/arch/powerpc/kvm/e500.c
+++ b/arch/powerpc/kvm/e500.c
@@ -118,7 +118,7 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 	sregs->u.e.mas0 = vcpu_e500->mas0;
 	sregs->u.e.mas1 = vcpu_e500->mas1;
 	sregs->u.e.mas2 = vcpu_e500->mas2;
-	sregs->u.e.mas7_3 = ((u64)vcpu_e500->mas7 << 32) | vcpu_e500->mas3;
+	sregs->u.e.mas7_3 = vcpu_e500->mas7_3;
 	sregs->u.e.mas4 = vcpu_e500->mas4;
 	sregs->u.e.mas6 = vcpu_e500->mas6;
 
@@ -151,8 +151,7 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
 		vcpu_e500->mas0 = sregs->u.e.mas0;
 		vcpu_e500->mas1 = sregs->u.e.mas1;
 		vcpu_e500->mas2 = sregs->u.e.mas2;
-		vcpu_e500->mas7 = sregs->u.e.mas7_3 >> 32;
-		vcpu_e500->mas3 = (u32)sregs->u.e.mas7_3;
+		vcpu_e500->mas7_3 = sregs->u.e.mas7_3;
 		vcpu_e500->mas4 = sregs->u.e.mas4;
 		vcpu_e500->mas6 = sregs->u.e.mas6;
 	}
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index d48ae39..e0d3609 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -95,13 +95,17 @@ int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int rs)
 	case SPRN_MAS2:
 		vcpu_e500->mas2 = spr_val; break;
 	case SPRN_MAS3:
-		vcpu_e500->mas3 = spr_val; break;
+		vcpu_e500->mas7_3 &= ~(u64)0xffffffff;
+		vcpu_e500->mas7_3 |= spr_val;
+		break;
 	case SPRN_MAS4:
 		vcpu_e500->mas4 = spr_val; break;
 	case SPRN_MAS6:
 		vcpu_e500->mas6 = spr_val; break;
 	case SPRN_MAS7:
-		vcpu_e500->mas7 = spr_val; break;
+		vcpu_e500->mas7_3 &= (u64)0xffffffff;
+		vcpu_e500->mas7_3 |= (u64)spr_val << 32;
+		break;
 	case SPRN_L1CSR0:
 		vcpu_e500->l1csr0 = spr_val;
 		vcpu_e500->l1csr0 &= ~(L1CSR0_DCFI | L1CSR0_CLFC);
@@ -158,13 +162,13 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, int rt)
 	case SPRN_MAS2:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas2); break;
 	case SPRN_MAS3:
-		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas3); break;
+		kvmppc_set_gpr(vcpu, rt, (u32)vcpu_e500->mas7_3); break;
 	case SPRN_MAS4:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas4); break;
 	case SPRN_MAS6:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas6); break;
 	case SPRN_MAS7:
-		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7); break;
+		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7_3 >> 32); break;
 
 	case SPRN_TLB0CFG:
 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->tlb0cfg); break;
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 526170f..512a65e 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -19,6 +19,11 @@
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
 #include <linux/highmem.h>
+#include <linux/log2.h>
+#include <linux/uaccess.h>
+#include <linux/sched.h>
+#include <linux/rwsem.h>
+#include <linux/vmalloc.h>
 #include <asm/kvm_ppc.h>
 #include <asm/kvm_e500.h>
 
@@ -68,6 +73,13 @@ static unsigned int tlb_host_entries[2];
 static unsigned int tlb_host_ways[2];
 static unsigned int tlb_host_sets[2];
 
+static struct kvm_book3e_206_tlb_entry *get_entry(
+	struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel, int entry)
+{
+	int offset = vcpu_e500->gtlb_offset[tlbsel];
+	return &vcpu_e500->gtlb_arch[offset + entry];
+}
+
 /*
  * Allocate a free shadow id and setup a valid sid mapping in given entry.
  * A mapping is only valid when vcpu_id_table and pcpu_id_table are match.
@@ -219,34 +231,13 @@ void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *vcpu_e500)
 	preempt_enable();
 }
 
-void kvmppc_dump_tlbs(struct kvm_vcpu *vcpu)
-{
-	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *tlbe;
-	int i, tlbsel;
-
-	printk("| %8s | %8s | %8s | %8s | %8s |\n",
-			"nr", "mas1", "mas2", "mas3", "mas7");
-
-	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
-		printk("Guest TLB%d:\n", tlbsel);
-		for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
-			tlbe = &vcpu_e500->gtlb_arch[tlbsel][i];
-			if (tlbe->mas1 & MAS1_VALID)
-				printk(" G[%d][%3d] |  %08X | %08X | %08X | %08X |\n",
-					tlbsel, i, tlbe->mas1, tlbe->mas2,
-					tlbe->mas3, tlbe->mas7);
-		}
-	}
-}
-
 static inline unsigned int gtlb0_get_next_victim(
 		struct kvmppc_vcpu_e500 *vcpu_e500)
 {
 	unsigned int victim;
 
 	victim = vcpu_e500->gtlb_nv[0]++;
-	if (unlikely(vcpu_e500->gtlb_nv[0] >= KVM_E500_TLB0_WAY_NUM))
+	if (unlikely(vcpu_e500->gtlb_nv[0] >= vcpu_e500->gtlb0_ways))
 		vcpu_e500->gtlb_nv[0] = 0;
 
 	return victim;
@@ -258,9 +249,9 @@ static inline unsigned int tlb1_max_shadow_size(void)
 	return tlb_host_entries[1] - tlbcam_index - 1;
 }
 
-static inline int tlbe_is_writable(struct tlbe *tlbe)
+static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
 {
-	return tlbe->mas3 & (MAS3_SW|MAS3_UW);
+	return tlbe->mas7_3 & (MAS3_SW|MAS3_UW);
 }
 
 static inline u32 e500_shadow_mas3_attrib(u32 mas3, int usermode)
@@ -291,39 +282,41 @@ static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
 /*
  * writing shadow tlb entry to host TLB
  */
-static inline void __write_host_tlbe(struct tlbe *stlbe, uint32_t mas0)
+static inline void __write_host_tlbe(struct kvm_book3e_206_tlb_entry *stlbe,
+				     uint32_t mas0)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
 	mtspr(SPRN_MAS0, mas0);
 	mtspr(SPRN_MAS1, stlbe->mas1);
-	mtspr(SPRN_MAS2, stlbe->mas2);
-	mtspr(SPRN_MAS3, stlbe->mas3);
-	mtspr(SPRN_MAS7, stlbe->mas7);
+	mtspr(SPRN_MAS2, (unsigned long)stlbe->mas2);
+	mtspr(SPRN_MAS3, (u32)stlbe->mas7_3);
+	mtspr(SPRN_MAS7, (u32)(stlbe->mas7_3 >> 32));
 	asm volatile("isync; tlbwe" : : : "memory");
 	local_irq_restore(flags);
 }
 
 /* esel is index into set, not whole array */
 static inline void write_host_tlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-		int tlbsel, int esel, struct tlbe *stlbe)
+		int tlbsel, int esel, struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	if (tlbsel = 0) {
-		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(esel));
+		int way = esel & (vcpu_e500->gtlb0_ways - 1);
+		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(way));
 	} else {
 		__write_host_tlbe(stlbe,
 				  MAS0_TLBSEL(1) |
 				  MAS0_ESEL(to_htlb1_esel(esel)));
 	}
 	trace_kvm_stlb_write(index_of(tlbsel, esel), stlbe->mas1, stlbe->mas2,
-			     stlbe->mas3, stlbe->mas7);
+			     (u32)stlbe->mas7_3, (u32)(stlbe->mas7_3 >> 32));
 }
 
 void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe magic;
+	struct kvm_book3e_206_tlb_entry magic;
 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
 	unsigned int stid;
 	pfn_t pfn;
@@ -337,9 +330,8 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
 	magic.mas1 = MAS1_VALID | MAS1_TS | MAS1_TID(stid) |
 		     MAS1_TSIZE(BOOK3E_PAGESZ_4K);
 	magic.mas2 = vcpu->arch.magic_page_ea | MAS2_M;
-	magic.mas3 = (pfn << PAGE_SHIFT) |
-		     MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
-	magic.mas7 = pfn >> (32 - PAGE_SHIFT);
+	magic.mas7_3 = ((u64)pfn << PAGE_SHIFT) |
+		       MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
 
 	__write_host_tlbe(&magic, MAS0_TLBSEL(1) | MAS0_ESEL(tlbcam_index));
 	preempt_enable();
@@ -360,7 +352,8 @@ void kvmppc_e500_tlb_put(struct kvm_vcpu *vcpu)
 static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500,
 				int tlbsel, int esel)
 {
-	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	struct kvm_book3e_206_tlb_entry *gtlbe +		get_entry(vcpu_e500, tlbsel, esel);
 	struct vcpu_id_table *idt = vcpu_e500->idt;
 	unsigned int pr, tid, ts, pid;
 	u32 val, eaddr;
@@ -426,9 +419,8 @@ static int tlb0_set_base(gva_t addr, int sets, int ways)
 
 static int gtlb0_set_base(struct kvmppc_vcpu_e500 *vcpu_e500, gva_t addr)
 {
-	int sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
-
-	return tlb0_set_base(addr, sets, KVM_E500_TLB0_WAY_NUM);
+	return tlb0_set_base(addr, vcpu_e500->gtlb0_sets,
+			     vcpu_e500->gtlb0_ways);
 }
 
 static int htlb0_set_base(gva_t addr)
@@ -441,7 +433,7 @@ static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
 	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
 
 	if (tlbsel = 0) {
-		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
+		esel &= vcpu_e500->gtlb0_ways - 1;
 		esel += gtlb0_set_base(vcpu_e500, vcpu_e500->mas2);
 	} else {
 		esel &= vcpu_e500->gtlb_size[tlbsel] - 1;
@@ -455,18 +447,21 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 		gva_t eaddr, int tlbsel, unsigned int pid, int as)
 {
 	int size = vcpu_e500->gtlb_size[tlbsel];
-	unsigned int set_base;
+	unsigned int set_base, offset;
 	int i;
 
 	if (tlbsel = 0) {
 		set_base = gtlb0_set_base(vcpu_e500, eaddr);
-		size = KVM_E500_TLB0_WAY_NUM;
+		size = vcpu_e500->gtlb0_ways;
 	} else {
 		set_base = 0;
 	}
 
+	offset = vcpu_e500->gtlb_offset[tlbsel];
+
 	for (i = 0; i < size; i++) {
-		struct tlbe *tlbe = &vcpu_e500->gtlb_arch[tlbsel][set_base + i];
+		struct kvm_book3e_206_tlb_entry *tlbe +			&vcpu_e500->gtlb_arch[offset + set_base + i];
 		unsigned int tid;
 
 		if (eaddr < get_tlb_eaddr(tlbe))
@@ -492,7 +487,7 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
 }
 
 static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
-					 struct tlbe *gtlbe,
+					 struct kvm_book3e_206_tlb_entry *gtlbe,
 					 pfn_t pfn)
 {
 	ref->pfn = pfn;
@@ -531,6 +526,8 @@ static void clear_tlb_refs(struct kvmppc_vcpu_e500 *vcpu_e500)
 	int stlbsel = 1;
 	int i;
 
+	kvmppc_e500_id_table_reset_all(vcpu_e500);
+
 	for (i = 0; i < tlb_host_entries[stlbsel]; i++) {
 		struct tlbe_ref *ref  			&vcpu_e500->tlb_refs[stlbsel][i];
@@ -560,18 +557,18 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
 		| MAS1_TSIZE(tsized);
 	vcpu_e500->mas2 = (eaddr & MAS2_EPN)
 		| (vcpu_e500->mas4 & MAS2_ATTRIB_MASK);
-	vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
+	vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
 	vcpu_e500->mas6 = (vcpu_e500->mas6 & MAS6_SPID1)
 		| (get_cur_pid(vcpu) << 16)
 		| (as ? MAS6_SAS : 0);
-	vcpu_e500->mas7 = 0;
 }
 
 /* TID must be supplied by the caller */
-static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-					   struct tlbe *gtlbe, int tsize,
-					   struct tlbe_ref *ref,
-					   u64 gvaddr, struct tlbe *stlbe)
+static inline void kvmppc_e500_setup_stlbe(
+	struct kvmppc_vcpu_e500 *vcpu_e500,
+	struct kvm_book3e_206_tlb_entry *gtlbe,
+	int tsize, struct tlbe_ref *ref, u64 gvaddr,
+	struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	pfn_t pfn = ref->pfn;
 
@@ -582,16 +579,16 @@ static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 	stlbe->mas2 = (gvaddr & MAS2_EPN)
 		| e500_shadow_mas2_attrib(gtlbe->mas2,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
-	stlbe->mas3 = ((pfn << PAGE_SHIFT) & MAS3_RPN)
-		| e500_shadow_mas3_attrib(gtlbe->mas3,
+	stlbe->mas7_3 = ((u64)pfn << PAGE_SHIFT)
+		| e500_shadow_mas3_attrib(gtlbe->mas7_3,
 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
-	stlbe->mas7 = (pfn >> (32 - PAGE_SHIFT)) & MAS7_RPN;
 }
 
 /* sesel is an index into the entire array, not just the set */
 static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int sesel,
-	struct tlbe *stlbe, struct tlbe_ref *ref)
+	u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
+	int tlbsel, int sesel, struct kvm_book3e_206_tlb_entry *stlbe,
+	struct tlbe_ref *ref)
 {
 	struct kvm_memory_slot *slot;
 	unsigned long pfn, hva;
@@ -701,15 +698,16 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
 
 /* XXX only map the one-one case, for now use TLB0 */
 static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-				int esel, struct tlbe *stlbe)
+				int esel,
+				struct kvm_book3e_206_tlb_entry *stlbe)
 {
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 	struct tlbe_ref *ref;
 	int sesel = esel & (tlb_host_ways[0] - 1);
 	int sesel_base;
 	gva_t ea;
 
-	gtlbe = &vcpu_e500->gtlb_arch[0][esel];
+	gtlbe = get_entry(vcpu_e500, 0, esel);
 	ref = &vcpu_e500->gtlb_priv[0][esel].ref;
 
 	ea = get_tlb_eaddr(gtlbe);
@@ -726,7 +724,8 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
  * the shadow TLB. */
 /* XXX for both one-one and one-to-many , for now use TLB1 */
 static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
-		u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, struct tlbe *stlbe)
+		u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
+		struct kvm_book3e_206_tlb_entry *stlbe)
 {
 	struct tlbe_ref *ref;
 	unsigned int victim;
@@ -755,7 +754,8 @@ static inline int kvmppc_e500_gtlbe_invalidate(
 				struct kvmppc_vcpu_e500 *vcpu_e500,
 				int tlbsel, int esel)
 {
-	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	struct kvm_book3e_206_tlb_entry *gtlbe +		get_entry(vcpu_e500, tlbsel, esel);
 
 	if (unlikely(get_tlb_iprot(gtlbe)))
 		return -1;
@@ -818,18 +818,17 @@ int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 	int tlbsel, esel;
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 
 	tlbsel = get_tlb_tlbsel(vcpu_e500);
 	esel = get_tlb_esel(vcpu_e500, tlbsel);
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 	vcpu_e500->mas0 &= ~MAS0_NV(~0);
 	vcpu_e500->mas0 |= MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 	vcpu_e500->mas1 = gtlbe->mas1;
 	vcpu_e500->mas2 = gtlbe->mas2;
-	vcpu_e500->mas3 = gtlbe->mas3;
-	vcpu_e500->mas7 = gtlbe->mas7;
+	vcpu_e500->mas7_3 = gtlbe->mas7_3;
 
 	return EMULATE_DONE;
 }
@@ -840,7 +839,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	int as = !!get_cur_sas(vcpu_e500);
 	unsigned int pid = get_cur_spid(vcpu_e500);
 	int esel, tlbsel;
-	struct tlbe *gtlbe = NULL;
+	struct kvm_book3e_206_tlb_entry *gtlbe = NULL;
 	gva_t ea;
 
 	ea = kvmppc_get_gpr(vcpu, rb);
@@ -848,21 +847,20 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
 		esel = kvmppc_e500_tlb_index(vcpu_e500, ea, tlbsel, pid, as);
 		if (esel >= 0) {
-			gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+			gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 			break;
 		}
 	}
 
 	if (gtlbe) {
 		if (tlbsel = 0)
-			esel &= KVM_E500_TLB0_WAY_NUM - 1;
+			esel &= vcpu_e500->gtlb0_ways - 1;
 
 		vcpu_e500->mas0 = MAS0_TLBSEL(tlbsel) | MAS0_ESEL(esel)
 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
 		vcpu_e500->mas1 = gtlbe->mas1;
 		vcpu_e500->mas2 = gtlbe->mas2;
-		vcpu_e500->mas3 = gtlbe->mas3;
-		vcpu_e500->mas7 = gtlbe->mas7;
+		vcpu_e500->mas7_3 = gtlbe->mas7_3;
 	} else {
 		int victim;
 
@@ -877,8 +875,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 			| (vcpu_e500->mas4 & MAS4_TSIZED(~0));
 		vcpu_e500->mas2 &= MAS2_EPN;
 		vcpu_e500->mas2 |= vcpu_e500->mas4 & MAS2_ATTRIB_MASK;
-		vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
-		vcpu_e500->mas7 = 0;
+		vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
 	}
 
 	kvmppc_set_exit_type(vcpu, EMULATED_TLBSX_EXITS);
@@ -887,8 +884,8 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
 
 /* sesel is index into the set, not the whole array */
 static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
-			struct tlbe *gtlbe,
-			struct tlbe *stlbe,
+			struct kvm_book3e_206_tlb_entry *gtlbe,
+			struct kvm_book3e_206_tlb_entry *stlbe,
 			int stlbsel, int sesel)
 {
 	int stid;
@@ -906,28 +903,27 @@ static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
 int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *gtlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
 	int tlbsel, esel;
 
 	tlbsel = get_tlb_tlbsel(vcpu_e500);
 	esel = get_tlb_esel(vcpu_e500, tlbsel);
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 
 	if (get_tlb_v(gtlbe))
 		inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
 
 	gtlbe->mas1 = vcpu_e500->mas1;
 	gtlbe->mas2 = vcpu_e500->mas2;
-	gtlbe->mas3 = vcpu_e500->mas3;
-	gtlbe->mas7 = vcpu_e500->mas7;
+	gtlbe->mas7_3 = vcpu_e500->mas7_3;
 
 	trace_kvm_gtlb_write(vcpu_e500->mas0, gtlbe->mas1, gtlbe->mas2,
-			     gtlbe->mas3, gtlbe->mas7);
+			     (u32)gtlbe->mas7_3, (u32)(gtlbe->mas7_3 >> 32));
 
 	/* Invalidate shadow mappings for the about-to-be-clobbered TLBE. */
 	if (tlbe_is_host_safe(vcpu, gtlbe)) {
-		struct tlbe stlbe;
+		struct kvm_book3e_206_tlb_entry stlbe;
 		int stlbsel, sesel;
 		u64 eaddr;
 		u64 raddr;
@@ -1000,9 +996,11 @@ gpa_t kvmppc_mmu_xlate(struct kvm_vcpu *vcpu, unsigned int index,
 			gva_t eaddr)
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
-	struct tlbe *gtlbe -		&vcpu_e500->gtlb_arch[tlbsel_of(index)][esel_of(index)];
-	u64 pgmask = get_tlb_bytes(gtlbe) - 1;
+	struct kvm_book3e_206_tlb_entry *gtlbe;
+	u64 pgmask;
+
+	gtlbe = get_entry(vcpu_e500, tlbsel_of(index), esel_of(index));
+	pgmask = get_tlb_bytes(gtlbe) - 1;
 
 	return get_tlb_raddr(gtlbe) | (eaddr & pgmask);
 }
@@ -1016,12 +1014,12 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
 {
 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
 	struct tlbe_priv *priv;
-	struct tlbe *gtlbe, stlbe;
+	struct kvm_book3e_206_tlb_entry *gtlbe, stlbe;
 	int tlbsel = tlbsel_of(index);
 	int esel = esel_of(index);
 	int stlbsel, sesel;
 
-	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
+	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
 
 	switch (tlbsel) {
 	case 0:
@@ -1077,25 +1075,186 @@ void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid)
 
 void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	struct tlbe *tlbe;
+	struct kvm_book3e_206_tlb_entry *tlbe;
 
 	/* Insert large initial mapping for guest. */
-	tlbe = &vcpu_e500->gtlb_arch[1][0];
+	tlbe = get_entry(vcpu_e500, 1, 0);
 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_256M);
 	tlbe->mas2 = 0;
-	tlbe->mas3 = E500_TLB_SUPER_PERM_MASK;
-	tlbe->mas7 = 0;
+	tlbe->mas7_3 = E500_TLB_SUPER_PERM_MASK;
 
 	/* 4K map for serial output. Used by kernel wrapper. */
-	tlbe = &vcpu_e500->gtlb_arch[1][1];
+	tlbe = get_entry(vcpu_e500, 1, 1);
 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_4K);
 	tlbe->mas2 = (0xe0004500 & 0xFFFFF000) | MAS2_I | MAS2_G;
-	tlbe->mas3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
-	tlbe->mas7 = 0;
+	tlbe->mas7_3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
+}
+
+static void free_gtlb(struct kvmppc_vcpu_e500 *vcpu_e500)
+{
+	int i;
+
+	clear_tlb_refs(vcpu_e500);
+	kfree(vcpu_e500->gtlb_priv[0]);
+	kfree(vcpu_e500->gtlb_priv[1]);
+
+	if (vcpu_e500->shared_tlb_pages) {
+		vfree((void *)(round_down((uintptr_t)vcpu_e500->gtlb_arch,
+					  PAGE_SIZE)));
+
+		for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
+			put_page(vcpu_e500->shared_tlb_pages[i]);
+
+		vcpu_e500->num_shared_tlb_pages = 0;
+		vcpu_e500->shared_tlb_pages = NULL;
+	} else {
+		kfree(vcpu_e500->gtlb_arch);
+	}
+
+	vcpu_e500->gtlb_arch = NULL;
+}
+
+int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
+			      struct kvm_config_tlb *cfg)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+	struct kvm_book3e_206_tlb_params params;
+	char *virt;
+	struct page **pages;
+	struct tlbe_priv *privs[2] = {};
+	size_t array_len;
+	u32 sets;
+	int num_pages, ret, i;
+
+	if (cfg->mmu_type != KVM_MMU_FSL_BOOKE_NOHV)
+		return -EINVAL;
+
+	if (copy_from_user(&params, (void __user *)(uintptr_t)cfg->params,
+			   sizeof(params)))
+		return -EFAULT;
+
+	if (params.tlb_sizes[1] > 64)
+		return -EINVAL;
+	if (params.tlb_ways[1] != params.tlb_sizes[1])
+		return -EINVAL;
+	if (params.tlb_sizes[2] != 0 || params.tlb_sizes[3] != 0)
+		return -EINVAL;
+	if (params.tlb_ways[2] != 0 || params.tlb_ways[3] != 0)
+		return -EINVAL;
+
+	if (!is_power_of_2(params.tlb_ways[0]))
+		return -EINVAL;
+
+	sets = params.tlb_sizes[0] >> ilog2(params.tlb_ways[0]);
+	if (!is_power_of_2(sets))
+		return -EINVAL;
+
+	array_len = params.tlb_sizes[0] + params.tlb_sizes[1];
+	array_len *= sizeof(struct kvm_book3e_206_tlb_entry);
+
+	if (cfg->array_len < array_len)
+		return -EINVAL;
+
+	num_pages = DIV_ROUND_UP(cfg->array + array_len - 1, PAGE_SIZE) -
+		    cfg->array / PAGE_SIZE;
+	pages = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	ret = get_user_pages_fast(cfg->array, num_pages, 1, pages);
+	if (ret < 0)
+		goto err_pages;
+
+	if (ret != num_pages) {
+		num_pages = ret;
+		ret = -EFAULT;
+		goto err_put_page;
+	}
+
+	virt = vmap(pages, num_pages, VM_MAP, PAGE_KERNEL);
+	if (!virt)
+		goto err_put_page;
+
+	privs[0] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[0],
+			   GFP_KERNEL);
+	privs[1] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[1],
+			   GFP_KERNEL);
+
+	if (!privs[0] || !privs[1])
+		goto err_put_page;
+
+	free_gtlb(vcpu_e500);
+
+	vcpu_e500->gtlb_priv[0] = privs[0];
+	vcpu_e500->gtlb_priv[1] = privs[1];
+
+	vcpu_e500->gtlb_arch = (struct kvm_book3e_206_tlb_entry *)
+		(virt + (cfg->array & (PAGE_SIZE - 1)));
+
+	vcpu_e500->gtlb_size[0] = params.tlb_sizes[0];
+	vcpu_e500->gtlb_size[1] = params.tlb_sizes[1];
+
+	vcpu_e500->gtlb_offset[0] = 0;
+	vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0];
+
+	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
+	if (params.tlb_sizes[0] <= 2048)
+		vcpu_e500->tlb0cfg |= params.tlb_sizes[0];
+
+	vcpu_e500->tlb1cfg = mfspr(SPRN_TLB1CFG) & ~0xfffUL;
+	vcpu_e500->tlb1cfg |= params.tlb_sizes[1];
+
+	vcpu_e500->shared_tlb_pages = pages;
+	vcpu_e500->num_shared_tlb_pages = num_pages;
+
+	vcpu_e500->gtlb0_ways = params.tlb_ways[0];
+	vcpu_e500->gtlb0_sets = sets;
+
+	return 0;
+
+err_put_page:
+	kfree(privs[0]);
+	kfree(privs[1]);
+
+	for (i = 0; i < num_pages; i++)
+		put_page(pages[i]);
+
+err_pages:
+	kfree(pages);
+	return ret;
+}
+
+int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
+			     struct kvm_dirty_tlb *dirty)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+
+	clear_tlb_refs(vcpu_e500);
+	return 0;
+}
+
+void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
+{
+	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+	int i;
+
+	/*
+	 * We may have modified the guest TLB, so mark it dirty.
+	 * We only do it on an actual return to userspace, to avoid
+	 * adding more overhead to getting scheduled out -- and avoid
+	 * any locking issues with getting preempted in the middle of
+	 * KVM_CONFIG_TLB, etc.
+	 */
+
+	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
+		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
 }
 
 int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
+	int entry_size = sizeof(struct kvm_book3e_206_tlb_entry);
+	int entries = KVM_E500_TLB0_SIZE + KVM_E500_TLB1_SIZE;
+
 	tlb_host_entries[0] = mfspr(SPRN_TLB0CFG) & TLBnCFG_N_ENTRY;
 	tlb_host_entries[1] = mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY;
 
@@ -1126,16 +1285,17 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	tlb_host_sets[1] = 1;
 
 	vcpu_e500->gtlb_size[0] = KVM_E500_TLB0_SIZE;
-	vcpu_e500->gtlb_arch[0] -		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB0_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_arch[0] = NULL)
-		goto err;
-
 	vcpu_e500->gtlb_size[1] = KVM_E500_TLB1_SIZE;
-	vcpu_e500->gtlb_arch[1] -		kzalloc(sizeof(struct tlbe) * KVM_E500_TLB1_SIZE, GFP_KERNEL);
-	if (vcpu_e500->gtlb_arch[1] = NULL)
-		goto err;
+
+	vcpu_e500->gtlb0_ways = KVM_E500_TLB0_WAY_NUM;
+	vcpu_e500->gtlb0_sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
+
+	vcpu_e500->gtlb_arch = kmalloc(entries * entry_size, GFP_KERNEL);
+	if (!vcpu_e500->gtlb_arch)
+		return -ENOMEM;
+
+	vcpu_e500->gtlb_offset[0] = 0;
+	vcpu_e500->gtlb_offset[1] = KVM_E500_TLB0_SIZE;
 
 	vcpu_e500->tlb_refs[0]  		kzalloc(sizeof(struct tlbe_ref) * tlb_host_entries[0],
@@ -1173,25 +1333,17 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	return 0;
 
 err:
+	free_gtlb(vcpu_e500);
 	kfree(vcpu_e500->tlb_refs[0]);
 	kfree(vcpu_e500->tlb_refs[1]);
-	kfree(vcpu_e500->gtlb_priv[0]);
-	kfree(vcpu_e500->gtlb_priv[1]);
-	kfree(vcpu_e500->gtlb_arch[0]);
-	kfree(vcpu_e500->gtlb_arch[1]);
 	return -1;
 }
 
 void kvmppc_e500_tlb_uninit(struct kvmppc_vcpu_e500 *vcpu_e500)
 {
-	clear_tlb_refs(vcpu_e500);
-
+	free_gtlb(vcpu_e500);
 	kvmppc_e500_id_table_free(vcpu_e500);
 
 	kfree(vcpu_e500->tlb_refs[0]);
 	kfree(vcpu_e500->tlb_refs[1]);
-	kfree(vcpu_e500->gtlb_priv[0]);
-	kfree(vcpu_e500->gtlb_priv[1]);
-	kfree(vcpu_e500->gtlb_arch[1]);
-	kfree(vcpu_e500->gtlb_arch[0]);
 }
diff --git a/arch/powerpc/kvm/e500_tlb.h b/arch/powerpc/kvm/e500_tlb.h
index b587f69..2c29640 100644
--- a/arch/powerpc/kvm/e500_tlb.h
+++ b/arch/powerpc/kvm/e500_tlb.h
@@ -20,13 +20,9 @@
 #include <asm/tlb.h>
 #include <asm/kvm_e500.h>
 
-#define KVM_E500_TLB0_WAY_SIZE_BIT	7	/* Fixed */
-#define KVM_E500_TLB0_WAY_SIZE		(1UL << KVM_E500_TLB0_WAY_SIZE_BIT)
-#define KVM_E500_TLB0_WAY_SIZE_MASK	(KVM_E500_TLB0_WAY_SIZE - 1)
-
-#define KVM_E500_TLB0_WAY_NUM_BIT	1	/* No greater than 7 */
-#define KVM_E500_TLB0_WAY_NUM		(1UL << KVM_E500_TLB0_WAY_NUM_BIT)
-#define KVM_E500_TLB0_WAY_NUM_MASK	(KVM_E500_TLB0_WAY_NUM - 1)
+/* This geometry is the legacy default -- can be overridden by userspace */
+#define KVM_E500_TLB0_WAY_SIZE		128
+#define KVM_E500_TLB0_WAY_NUM		2
 
 #define KVM_E500_TLB0_SIZE  (KVM_E500_TLB0_WAY_SIZE * KVM_E500_TLB0_WAY_NUM)
 #define KVM_E500_TLB1_SIZE  16
@@ -58,50 +54,54 @@ extern void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *);
 extern void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *);
 
 /* TLB helper functions */
-static inline unsigned int get_tlb_size(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_size(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 7) & 0x1f;
 }
 
-static inline gva_t get_tlb_eaddr(const struct tlbe *tlbe)
+static inline gva_t get_tlb_eaddr(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return tlbe->mas2 & 0xfffff000;
 }
 
-static inline u64 get_tlb_bytes(const struct tlbe *tlbe)
+static inline u64 get_tlb_bytes(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	unsigned int pgsize = get_tlb_size(tlbe);
 	return 1ULL << 10 << pgsize;
 }
 
-static inline gva_t get_tlb_end(const struct tlbe *tlbe)
+static inline gva_t get_tlb_end(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	u64 bytes = get_tlb_bytes(tlbe);
 	return get_tlb_eaddr(tlbe) + bytes - 1;
 }
 
-static inline u64 get_tlb_raddr(const struct tlbe *tlbe)
+static inline u64 get_tlb_raddr(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
-	u64 rpn = tlbe->mas7;
-	return (rpn << 32) | (tlbe->mas3 & 0xfffff000);
+	return tlbe->mas7_3 & ~0xfffULL;
 }
 
-static inline unsigned int get_tlb_tid(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_tid(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 16) & 0xff;
 }
 
-static inline unsigned int get_tlb_ts(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_ts(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 12) & 0x1;
 }
 
-static inline unsigned int get_tlb_v(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_v(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 31) & 0x1;
 }
 
-static inline unsigned int get_tlb_iprot(const struct tlbe *tlbe)
+static inline unsigned int
+get_tlb_iprot(const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	return (tlbe->mas1 >> 30) & 0x1;
 }
@@ -156,7 +156,7 @@ static inline unsigned int get_tlb_esel_bit(
 }
 
 static inline int tlbe_is_host_safe(const struct kvm_vcpu *vcpu,
-			const struct tlbe *tlbe)
+			const struct kvm_book3e_206_tlb_entry *tlbe)
 {
 	gpa_t gpa;
 
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 24e2b64..02ab74c 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -187,6 +187,9 @@ int kvm_dev_ioctl_check_extension(long ext)
 	case KVM_CAP_ENABLE_CAP:
 	case KVM_CAP_PPC_OSI:
 	case KVM_CAP_PPC_GET_PVINFO:
+#ifdef CONFIG_KVM_E500
+	case KVM_CAP_SW_TLB:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_COALESCED_MMIO:
@@ -503,6 +506,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
 	kvm_guest_exit();
 	local_irq_enable();
 
+#ifdef CONFIG_KVM_E500
+	kvmppc_core_heavy_exit(vcpu);
+#endif
+
 	if (vcpu->sigset_active)
 		sigprocmask(SIG_SETMASK, &sigsaved, NULL);
 
@@ -583,6 +590,27 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
 		break;
 	}
+
+#ifdef CONFIG_KVM_E500
+	case KVM_CONFIG_TLB: {
+		struct kvm_config_tlb cfg;
+		r = -EFAULT;
+		if (copy_from_user(&cfg, argp, sizeof(cfg)))
+			goto out;
+		r = kvm_vcpu_ioctl_config_tlb(vcpu, &cfg);
+		break;
+	}
+
+	case KVM_DIRTY_TLB: {
+		struct kvm_dirty_tlb dirty;
+		r = -EFAULT;
+		if (copy_from_user(&dirty, argp, sizeof(dirty)))
+			goto out;
+		r = kvm_vcpu_ioctl_dirty_tlb(vcpu, &dirty);
+		break;
+	}
+#endif
+
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 9c9ca7c..d5f40fb 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -544,6 +544,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_TSC_CONTROL 60
 #define KVM_CAP_GET_TSC_KHZ 61
 #define KVM_CAP_PPC_BOOKE_SREGS 62
+#define KVM_CAP_SW_TLB 63
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -623,6 +624,21 @@ struct kvm_clock_data {
 	__u32 pad[9];
 };
 
+#define KVM_MMU_FSL_BOOKE_NOHV		0
+#define KVM_MMU_FSL_BOOKE_HV		1
+
+struct kvm_config_tlb {
+	__u64 params;
+	__u64 array;
+	__u32 mmu_type;
+	__u32 array_len;
+};
+
+struct kvm_dirty_tlb {
+	__u64 bitmap;
+	__u32 num_dirty;
+};
+
 /*
  * ioctls for VM fds
  */
@@ -746,6 +762,9 @@ struct kvm_clock_data {
 /* Available with KVM_CAP_XCRS */
 #define KVM_GET_XCRS		  _IOR(KVMIO,  0xa6, struct kvm_xcrs)
 #define KVM_SET_XCRS		  _IOW(KVMIO,  0xa7, struct kvm_xcrs)
+/* Available with KVM_CAP_SW_TLB */
+#define KVM_CONFIG_TLB		  _IOW(KVMIO,  0xa8, struct kvm_config_tlb)
+#define KVM_DIRTY_TLB		  _IOW(KVMIO,  0xa9, struct kvm_dirty_tlb)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
@ 2011-07-08 12:57 ` Alexander Graf
  2011-07-18 10:09 ` Alexander Graf
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-08 12:57 UTC (permalink / raw)
  To: kvm-ppc


On 08.07.2011, at 01:41, Scott Wood wrote:

> This implements a shared-memory API for giving host userspace access to
> the guest's TLB.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v5:
> - respin on top of fixes
> - remove unused kvm_dump_tlbs() now that there's another way to get the
>   data
> - clarify in the documentation that even though hardware ignores tsize
>   on a fixed-size array, KVM wants it to be set properly in the shared
>   array.
> 
> Documentation/virtual/kvm/api.txt   |   86 ++++++++-
> arch/powerpc/include/asm/kvm.h      |   35 ++++
> arch/powerpc/include/asm/kvm_e500.h |   24 ++--
> arch/powerpc/include/asm/kvm_ppc.h  |    7 +
> arch/powerpc/kvm/e500.c             |    5 +-
> arch/powerpc/kvm/e500_emulate.c     |   12 +-
> arch/powerpc/kvm/e500_tlb.c         |  372 ++++++++++++++++++++++++----------
> arch/powerpc/kvm/e500_tlb.h         |   38 ++--
> arch/powerpc/kvm/powerpc.c          |   28 +++
> include/linux/kvm.h                 |   19 ++
> 10 files changed, 473 insertions(+), 153 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index b251136..31df5b0 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1265,7 +1265,7 @@ struct kvm_assigned_msix_entry {
> 	__u16 padding[3];
> };
> 
> -4.54 KVM_SET_TSC_KHZ
> +4.55 KVM_SET_TSC_KHZ
> 
> Capability: KVM_CAP_TSC_CONTROL
> Architectures: x86
> @@ -1276,7 +1276,7 @@ Returns: 0 on success, -1 on error
> Specifies the tsc frequency for the virtual machine. The unit of the
> frequency is KHz.
> 
> -4.55 KVM_GET_TSC_KHZ
> +4.56 KVM_GET_TSC_KHZ
> 
> Capability: KVM_CAP_GET_TSC_KHZ
> Architectures: x86
> @@ -1288,7 +1288,7 @@ Returns the tsc frequency of the guest. The unit of the return value is
> KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
> error.
> 
> -4.56 KVM_GET_LAPIC
> +4.57 KVM_GET_LAPIC
> 
> Capability: KVM_CAP_IRQCHIP
> Architectures: x86
> @@ -1304,7 +1304,7 @@ struct kvm_lapic_state {
> Reads the Local APIC registers and copies them into the input argument.  The
> data format and layout are the same as documented in the architecture manual.
> 
> -4.57 KVM_SET_LAPIC
> +4.58 KVM_SET_LAPIC
> 
> Capability: KVM_CAP_IRQCHIP
> Architectures: x86
> @@ -1320,7 +1320,7 @@ struct kvm_lapic_state {
> Copies the input argument into the the Local APIC registers.  The data format
> and layout are the same as documented in the architecture manual.
> 
> -4.58 KVM_IOEVENTFD
> +4.59 KVM_IOEVENTFD
> 
> Capability: KVM_CAP_IOEVENTFD
> Architectures: all
> @@ -1350,6 +1350,82 @@ The following flags are defined:
> If datamatch flag is set, the event will be signaled only if the written value
> to the registered address is equal to datamatch in struct kvm_ioeventfd.
> 
> +4.60 KVM_CONFIG_TLB
> +
> +Capability: KVM_CAP_SW_TLB
> +Architectures: ppc
> +Type: vcpu ioctl
> +Parameters: struct kvm_config_tlb (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_config_tlb {
> +	__u64 params;
> +	__u64 array;
> +	__u32 mmu_type;
> +	__u32 array_len;
> +};

There's no real need to do this through its own ioctl. You could just use ENABLE_CAP and pass the config as arguments :).


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
  2011-07-08 12:57 ` Alexander Graf
@ 2011-07-18 10:09 ` Alexander Graf
  2011-07-18 16:18 ` Scott Wood
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-18 10:09 UTC (permalink / raw)
  To: kvm-ppc


On 08.07.2011, at 01:41, Scott Wood wrote:

> This implements a shared-memory API for giving host userspace access to
> the guest's TLB.
> 
> Signed-off-by: Scott Wood <scottwood@freescale.com>
> ---
> v5:
> - respin on top of fixes
> - remove unused kvm_dump_tlbs() now that there's another way to get the
>   data
> - clarify in the documentation that even though hardware ignores tsize
>   on a fixed-size array, KVM wants it to be set properly in the shared
>   array.
> 
> Documentation/virtual/kvm/api.txt   |   86 ++++++++-
> arch/powerpc/include/asm/kvm.h      |   35 ++++
> arch/powerpc/include/asm/kvm_e500.h |   24 ++--
> arch/powerpc/include/asm/kvm_ppc.h  |    7 +
> arch/powerpc/kvm/e500.c             |    5 +-
> arch/powerpc/kvm/e500_emulate.c     |   12 +-
> arch/powerpc/kvm/e500_tlb.c         |  372 ++++++++++++++++++++++++----------
> arch/powerpc/kvm/e500_tlb.h         |   38 ++--
> arch/powerpc/kvm/powerpc.c          |   28 +++
> include/linux/kvm.h                 |   19 ++
> 10 files changed, 473 insertions(+), 153 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> index b251136..31df5b0 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -1265,7 +1265,7 @@ struct kvm_assigned_msix_entry {
> 	__u16 padding[3];
> };
> 
> -4.54 KVM_SET_TSC_KHZ
> +4.55 KVM_SET_TSC_KHZ
> 
> Capability: KVM_CAP_TSC_CONTROL
> Architectures: x86
> @@ -1276,7 +1276,7 @@ Returns: 0 on success, -1 on error
> Specifies the tsc frequency for the virtual machine. The unit of the
> frequency is KHz.
> 
> -4.55 KVM_GET_TSC_KHZ
> +4.56 KVM_GET_TSC_KHZ
> 
> Capability: KVM_CAP_GET_TSC_KHZ
> Architectures: x86
> @@ -1288,7 +1288,7 @@ Returns the tsc frequency of the guest. The unit of the return value is
> KHz. If the host has unstable tsc this ioctl returns -EIO instead as an
> error.
> 
> -4.56 KVM_GET_LAPIC
> +4.57 KVM_GET_LAPIC
> 
> Capability: KVM_CAP_IRQCHIP
> Architectures: x86
> @@ -1304,7 +1304,7 @@ struct kvm_lapic_state {
> Reads the Local APIC registers and copies them into the input argument.  The
> data format and layout are the same as documented in the architecture manual.
> 
> -4.57 KVM_SET_LAPIC
> +4.58 KVM_SET_LAPIC
> 
> Capability: KVM_CAP_IRQCHIP
> Architectures: x86
> @@ -1320,7 +1320,7 @@ struct kvm_lapic_state {
> Copies the input argument into the the Local APIC registers.  The data format
> and layout are the same as documented in the architecture manual.
> 
> -4.58 KVM_IOEVENTFD
> +4.59 KVM_IOEVENTFD
> 
> Capability: KVM_CAP_IOEVENTFD
> Architectures: all
> @@ -1350,6 +1350,82 @@ The following flags are defined:
> If datamatch flag is set, the event will be signaled only if the written value
> to the registered address is equal to datamatch in struct kvm_ioeventfd.
> 
> +4.60 KVM_CONFIG_TLB
> +
> +Capability: KVM_CAP_SW_TLB
> +Architectures: ppc
> +Type: vcpu ioctl
> +Parameters: struct kvm_config_tlb (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_config_tlb {
> +	__u64 params;
> +	__u64 array;
> +	__u32 mmu_type;
> +	__u32 array_len;
> +};
> +
> +Configures the virtual CPU's TLB array, establishing a shared memory area
> +between userspace and KVM.  The "params" and "array" fields are userspace
> +addresses of mmu-type-specific data structures.  The "array_len" field is an
> +safety mechanism, and should be set to the size in bytes of the memory that
> +userspace has reserved for the array.  It must be at least the size dictated
> +by "mmu_type" and "params".
> +
> +While KVM_RUN is active, the shared region is under control of KVM.  Its
> +contents are undefined, and any modification by userspace results in
> +boundedly undefined behavior.
> +
> +On return from KVM_RUN, the shared region will reflect the current state of
> +the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
> +to tell KVM which entries have been changed, prior to calling KVM_RUN again
> +on this vcpu.
> +
> +For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> + - The "params" field is of type "struct kvm_book3e_206_tlb_params".
> + - The "array" field points to an array of type "struct
> +   kvm_book3e_206_tlb_entry".
> + - The array consists of all entries in the first TLB, followed by all
> +   entries in the second TLB.
> + - Within a TLB, entries are ordered first by increasing set number.  Within a
> +   set, entries are ordered by way (increasing ESEL).
> + - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
> +   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
> + - The tsize field of mas1 shall be set to 4K on TLB0, even though the
> +   hardware ignores this value for TLB0.
> +
> +4.61 KVM_DIRTY_TLB
> +
> +Capability: KVM_CAP_SW_TLB
> +Architectures: ppc
> +Type: vcpu ioctl
> +Parameters: struct kvm_dirty_tlb (in)
> +Returns: 0 on success, -1 on error
> +
> +struct kvm_dirty_tlb {
> +	__u64 bitmap;
> +	__u32 num_dirty;
> +};
> +
> +This must be called whenever userspace has changed an entry in the shared
> +TLB, prior to calling KVM_RUN on the associated vcpu.
> +
> +The "bitmap" field is the userspace address of an array.  This array
> +consists of a number of bits, equal to the total number of TLB entries as
> +determined by the last successful call to KVM_CONFIG_TLB, rounded up to the
> +nearest multiple of 64.
> +
> +Each bit corresponds to one TLB entry, ordered the same as in the shared TLB
> +array.
> +
> +The array is little-endian: the bit 0 is the least significant bit of the
> +first byte, bit 8 is the least significant bit of the second byte, etc.
> +This avoids any complications with differing word sizes.
> +
> +The "num_dirty" field is a performance hint for KVM to determine whether it
> +should skip processing the bitmap and just invalidate everything.  It must
> +be set to the number of set bits in the bitmap.
> +
> 5. The kvm_run structure
> 
> Application code obtains a pointer to the kvm_run structure by
> diff --git a/arch/powerpc/include/asm/kvm.h b/arch/powerpc/include/asm/kvm.h
> index d2ca5ed..1a6dedf 100644
> --- a/arch/powerpc/include/asm/kvm.h
> +++ b/arch/powerpc/include/asm/kvm.h
> @@ -272,4 +272,39 @@ struct kvm_guest_debug_arch {
> #define KVM_INTERRUPT_UNSET	-2U
> #define KVM_INTERRUPT_SET_LEVEL	-3U
> 
> +struct kvm_book3e_206_tlb_entry {
> +	__u32 mas8;
> +	__u32 mas1;
> +	__u64 mas2;
> +	__u64 mas7_3;
> +};
> +
> +struct kvm_book3e_206_tlb_params {
> +	/*
> +	 * For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:
> +	 *
> +	 * - The number of ways of TLB0 must be a power of two between 2 and
> +	 *   16.
> +	 * - TLB1 must be fully associative.
> +	 * - The size of TLB0 must be a multiple of the number of ways, and
> +	 *   the number of sets must be a power of two.
> +	 * - The size of TLB1 may not exceed 64 entries.
> +	 * - TLB0 supports 4 KiB pages.
> +	 * - The page sizes supported by TLB1 are as indicated by
> +	 *   TLB1CFG (if MMUCFG[MAVN] = 0) or TLB1PS (if MMUCFG[MAVN] = 1)
> +	 *   as returned by KVM_GET_SREGS.
> +	 * - TLB2 and TLB3 are reserved, and their entries in tlb_sizes[]
> +	 *   and tlb_ways[] must be zero.
> +	 *
> +	 * tlb_ways[n] = tlb_sizes[n] means the array is fully associative.
> +	 *
> +	 * KVM will adjust TLBnCFG based on the sizes configured here,
> +	 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
> +	 * set to zero.
> +	 */
> +	__u32 tlb_sizes[4];
> +	__u32 tlb_ways[4];
> +	__u32 reserved[8];
> +};
> +
> #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/include/asm/kvm_e500.h b/arch/powerpc/include/asm/kvm_e500.h
> index 87aab98..e1ac268 100644
> --- a/arch/powerpc/include/asm/kvm_e500.h
> +++ b/arch/powerpc/include/asm/kvm_e500.h
> @@ -22,13 +22,6 @@
> #define E500_PID_NUM   3
> #define E500_TLB_NUM   2
> 
> -struct tlbe{
> -	u32 mas1;
> -	u32 mas2;
> -	u32 mas3;
> -	u32 mas7;
> -};
> -
> #define E500_TLB_VALID 1
> #define E500_TLB_DIRTY 2
> 
> @@ -44,8 +37,11 @@ struct tlbe_priv {
> struct vcpu_id_table;
> 
> struct kvmppc_vcpu_e500 {
> -	/* Unmodified copy of the guest's TLB. */
> -	struct tlbe *gtlb_arch[E500_TLB_NUM];
> +	/* Unmodified copy of the guest's TLB -- shared with host userspace. */
> +	struct kvm_book3e_206_tlb_entry *gtlb_arch;
> +
> +	/* Starting entry number in gtlb_arch[] */
> +	int gtlb_offset[E500_TLB_NUM];
> 
> 	/* KVM internal information associated with each guest TLB entry */
> 	struct tlbe_priv *gtlb_priv[E500_TLB_NUM];
> @@ -53,6 +49,9 @@ struct kvmppc_vcpu_e500 {
> 	unsigned int gtlb_size[E500_TLB_NUM];
> 	unsigned int gtlb_nv[E500_TLB_NUM];
> 
> +	unsigned int gtlb0_ways;
> +	unsigned int gtlb0_sets;
> +
> 	/*
> 	 * information associated with each host TLB entry --
> 	 * TLB1 only for now.  If/when guest TLB1 entries can be
> @@ -64,7 +63,6 @@ struct kvmppc_vcpu_e500 {
> 	 * and back, and our host TLB entries got evicted).
> 	 */
> 	struct tlbe_ref *tlb_refs[E500_TLB_NUM];
> -
> 	unsigned int host_tlb1_nv;
> 
> 	u32 host_pid[E500_PID_NUM];
> @@ -74,11 +72,10 @@ struct kvmppc_vcpu_e500 {
> 	u32 mas0;
> 	u32 mas1;
> 	u32 mas2;
> -	u32 mas3;
> +	u64 mas7_3;
> 	u32 mas4;
> 	u32 mas5;
> 	u32 mas6;
> -	u32 mas7;
> 
> 	/* vcpu id table */
> 	struct vcpu_id_table *idt;
> @@ -91,6 +88,9 @@ struct kvmppc_vcpu_e500 {
> 	u32 tlb1cfg;
> 	u64 mcar;
> 
> +	struct page **shared_tlb_pages;
> +	int num_shared_tlb_pages;
> +
> 	struct kvm_vcpu vcpu;
> };
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index c662f14..bb3d418 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -152,4 +152,11 @@ int kvmppc_set_sregs_ivor(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs);
> 
> void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
> 
> +int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> +			      struct kvm_config_tlb *cfg);
> +int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> +			     struct kvm_dirty_tlb *cfg);
> +
> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu);
> +
> #endif /* __POWERPC_KVM_PPC_H__ */
> diff --git a/arch/powerpc/kvm/e500.c b/arch/powerpc/kvm/e500.c
> index 797a744..b8f065c 100644
> --- a/arch/powerpc/kvm/e500.c
> +++ b/arch/powerpc/kvm/e500.c
> @@ -118,7 +118,7 @@ void kvmppc_core_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
> 	sregs->u.e.mas0 = vcpu_e500->mas0;
> 	sregs->u.e.mas1 = vcpu_e500->mas1;
> 	sregs->u.e.mas2 = vcpu_e500->mas2;
> -	sregs->u.e.mas7_3 = ((u64)vcpu_e500->mas7 << 32) | vcpu_e500->mas3;
> +	sregs->u.e.mas7_3 = vcpu_e500->mas7_3;
> 	sregs->u.e.mas4 = vcpu_e500->mas4;
> 	sregs->u.e.mas6 = vcpu_e500->mas6;
> 
> @@ -151,8 +151,7 @@ int kvmppc_core_set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
> 		vcpu_e500->mas0 = sregs->u.e.mas0;
> 		vcpu_e500->mas1 = sregs->u.e.mas1;
> 		vcpu_e500->mas2 = sregs->u.e.mas2;
> -		vcpu_e500->mas7 = sregs->u.e.mas7_3 >> 32;
> -		vcpu_e500->mas3 = (u32)sregs->u.e.mas7_3;
> +		vcpu_e500->mas7_3 = sregs->u.e.mas7_3;
> 		vcpu_e500->mas4 = sregs->u.e.mas4;
> 		vcpu_e500->mas6 = sregs->u.e.mas6;
> 	}
> diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
> index d48ae39..e0d3609 100644
> --- a/arch/powerpc/kvm/e500_emulate.c
> +++ b/arch/powerpc/kvm/e500_emulate.c
> @@ -95,13 +95,17 @@ int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int rs)
> 	case SPRN_MAS2:
> 		vcpu_e500->mas2 = spr_val; break;
> 	case SPRN_MAS3:
> -		vcpu_e500->mas3 = spr_val; break;
> +		vcpu_e500->mas7_3 &= ~(u64)0xffffffff;
> +		vcpu_e500->mas7_3 |= spr_val;
> +		break;
> 	case SPRN_MAS4:
> 		vcpu_e500->mas4 = spr_val; break;
> 	case SPRN_MAS6:
> 		vcpu_e500->mas6 = spr_val; break;
> 	case SPRN_MAS7:
> -		vcpu_e500->mas7 = spr_val; break;
> +		vcpu_e500->mas7_3 &= (u64)0xffffffff;
> +		vcpu_e500->mas7_3 |= (u64)spr_val << 32;
> +		break;
> 	case SPRN_L1CSR0:
> 		vcpu_e500->l1csr0 = spr_val;
> 		vcpu_e500->l1csr0 &= ~(L1CSR0_DCFI | L1CSR0_CLFC);
> @@ -158,13 +162,13 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, int rt)
> 	case SPRN_MAS2:
> 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas2); break;
> 	case SPRN_MAS3:
> -		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas3); break;
> +		kvmppc_set_gpr(vcpu, rt, (u32)vcpu_e500->mas7_3); break;
> 	case SPRN_MAS4:
> 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas4); break;
> 	case SPRN_MAS6:
> 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas6); break;
> 	case SPRN_MAS7:
> -		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7); break;
> +		kvmppc_set_gpr(vcpu, rt, vcpu_e500->mas7_3 >> 32); break;
> 
> 	case SPRN_TLB0CFG:
> 		kvmppc_set_gpr(vcpu, rt, vcpu_e500->tlb0cfg); break;
> diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
> index 526170f..512a65e 100644
> --- a/arch/powerpc/kvm/e500_tlb.c
> +++ b/arch/powerpc/kvm/e500_tlb.c
> @@ -19,6 +19,11 @@
> #include <linux/kvm.h>
> #include <linux/kvm_host.h>
> #include <linux/highmem.h>
> +#include <linux/log2.h>
> +#include <linux/uaccess.h>
> +#include <linux/sched.h>
> +#include <linux/rwsem.h>
> +#include <linux/vmalloc.h>
> #include <asm/kvm_ppc.h>
> #include <asm/kvm_e500.h>
> 
> @@ -68,6 +73,13 @@ static unsigned int tlb_host_entries[2];
> static unsigned int tlb_host_ways[2];
> static unsigned int tlb_host_sets[2];
> 
> +static struct kvm_book3e_206_tlb_entry *get_entry(
> +	struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel, int entry)
> +{
> +	int offset = vcpu_e500->gtlb_offset[tlbsel];
> +	return &vcpu_e500->gtlb_arch[offset + entry];
> +}
> +
> /*
>  * Allocate a free shadow id and setup a valid sid mapping in given entry.
>  * A mapping is only valid when vcpu_id_table and pcpu_id_table are match.
> @@ -219,34 +231,13 @@ void kvmppc_e500_recalc_shadow_pid(struct kvmppc_vcpu_e500 *vcpu_e500)
> 	preempt_enable();
> }
> 
> -void kvmppc_dump_tlbs(struct kvm_vcpu *vcpu)
> -{
> -	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> -	struct tlbe *tlbe;
> -	int i, tlbsel;
> -
> -	printk("| %8s | %8s | %8s | %8s | %8s |\n",
> -			"nr", "mas1", "mas2", "mas3", "mas7");
> -
> -	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
> -		printk("Guest TLB%d:\n", tlbsel);
> -		for (i = 0; i < vcpu_e500->gtlb_size[tlbsel]; i++) {
> -			tlbe = &vcpu_e500->gtlb_arch[tlbsel][i];
> -			if (tlbe->mas1 & MAS1_VALID)
> -				printk(" G[%d][%3d] |  %08X | %08X | %08X | %08X |\n",
> -					tlbsel, i, tlbe->mas1, tlbe->mas2,
> -					tlbe->mas3, tlbe->mas7);
> -		}
> -	}
> -}
> -
> static inline unsigned int gtlb0_get_next_victim(
> 		struct kvmppc_vcpu_e500 *vcpu_e500)
> {
> 	unsigned int victim;
> 
> 	victim = vcpu_e500->gtlb_nv[0]++;
> -	if (unlikely(vcpu_e500->gtlb_nv[0] >= KVM_E500_TLB0_WAY_NUM))
> +	if (unlikely(vcpu_e500->gtlb_nv[0] >= vcpu_e500->gtlb0_ways))
> 		vcpu_e500->gtlb_nv[0] = 0;
> 
> 	return victim;
> @@ -258,9 +249,9 @@ static inline unsigned int tlb1_max_shadow_size(void)
> 	return tlb_host_entries[1] - tlbcam_index - 1;
> }
> 
> -static inline int tlbe_is_writable(struct tlbe *tlbe)
> +static inline int tlbe_is_writable(struct kvm_book3e_206_tlb_entry *tlbe)
> {
> -	return tlbe->mas3 & (MAS3_SW|MAS3_UW);
> +	return tlbe->mas7_3 & (MAS3_SW|MAS3_UW);
> }
> 
> static inline u32 e500_shadow_mas3_attrib(u32 mas3, int usermode)
> @@ -291,39 +282,41 @@ static inline u32 e500_shadow_mas2_attrib(u32 mas2, int usermode)
> /*
>  * writing shadow tlb entry to host TLB
>  */
> -static inline void __write_host_tlbe(struct tlbe *stlbe, uint32_t mas0)
> +static inline void __write_host_tlbe(struct kvm_book3e_206_tlb_entry *stlbe,
> +				     uint32_t mas0)
> {
> 	unsigned long flags;
> 
> 	local_irq_save(flags);
> 	mtspr(SPRN_MAS0, mas0);
> 	mtspr(SPRN_MAS1, stlbe->mas1);
> -	mtspr(SPRN_MAS2, stlbe->mas2);
> -	mtspr(SPRN_MAS3, stlbe->mas3);
> -	mtspr(SPRN_MAS7, stlbe->mas7);
> +	mtspr(SPRN_MAS2, (unsigned long)stlbe->mas2);
> +	mtspr(SPRN_MAS3, (u32)stlbe->mas7_3);
> +	mtspr(SPRN_MAS7, (u32)(stlbe->mas7_3 >> 32));
> 	asm volatile("isync; tlbwe" : : : "memory");
> 	local_irq_restore(flags);
> }
> 
> /* esel is index into set, not whole array */
> static inline void write_host_tlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> -		int tlbsel, int esel, struct tlbe *stlbe)
> +		int tlbsel, int esel, struct kvm_book3e_206_tlb_entry *stlbe)
> {
> 	if (tlbsel = 0) {
> -		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(esel));
> +		int way = esel & (vcpu_e500->gtlb0_ways - 1);
> +		__write_host_tlbe(stlbe, MAS0_TLBSEL(0) | MAS0_ESEL(way));

Didn't you just change this to not mask in 4/5? Why go back to masking?

> 	} else {
> 		__write_host_tlbe(stlbe,
> 				  MAS0_TLBSEL(1) |
> 				  MAS0_ESEL(to_htlb1_esel(esel)));
> 	}
> 	trace_kvm_stlb_write(index_of(tlbsel, esel), stlbe->mas1, stlbe->mas2,
> -			     stlbe->mas3, stlbe->mas7);
> +			     (u32)stlbe->mas7_3, (u32)(stlbe->mas7_3 >> 32));

Better change the trace definition :).

> }
> 
> void kvmppc_map_magic(struct kvm_vcpu *vcpu)
> {
> 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> -	struct tlbe magic;
> +	struct kvm_book3e_206_tlb_entry magic;
> 	ulong shared_page = ((ulong)vcpu->arch.shared) & PAGE_MASK;
> 	unsigned int stid;
> 	pfn_t pfn;
> @@ -337,9 +330,8 @@ void kvmppc_map_magic(struct kvm_vcpu *vcpu)
> 	magic.mas1 = MAS1_VALID | MAS1_TS | MAS1_TID(stid) |
> 		     MAS1_TSIZE(BOOK3E_PAGESZ_4K);
> 	magic.mas2 = vcpu->arch.magic_page_ea | MAS2_M;
> -	magic.mas3 = (pfn << PAGE_SHIFT) |
> -		     MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
> -	magic.mas7 = pfn >> (32 - PAGE_SHIFT);
> +	magic.mas7_3 = ((u64)pfn << PAGE_SHIFT) |
> +		       MAS3_SW | MAS3_SR | MAS3_UW | MAS3_UR;
> 
> 	__write_host_tlbe(&magic, MAS0_TLBSEL(1) | MAS0_ESEL(tlbcam_index));
> 	preempt_enable();
> @@ -360,7 +352,8 @@ void kvmppc_e500_tlb_put(struct kvm_vcpu *vcpu)
> static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500,
> 				int tlbsel, int esel)
> {
> -	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
> +	struct kvm_book3e_206_tlb_entry *gtlbe > +		get_entry(vcpu_e500, tlbsel, esel);
> 	struct vcpu_id_table *idt = vcpu_e500->idt;
> 	unsigned int pr, tid, ts, pid;
> 	u32 val, eaddr;
> @@ -426,9 +419,8 @@ static int tlb0_set_base(gva_t addr, int sets, int ways)
> 
> static int gtlb0_set_base(struct kvmppc_vcpu_e500 *vcpu_e500, gva_t addr)
> {
> -	int sets = KVM_E500_TLB0_SIZE / KVM_E500_TLB0_WAY_NUM;
> -
> -	return tlb0_set_base(addr, sets, KVM_E500_TLB0_WAY_NUM);
> +	return tlb0_set_base(addr, vcpu_e500->gtlb0_sets,
> +			     vcpu_e500->gtlb0_ways);

Yeah, the more I see code like this the more I believe that it'd be a good idea to have some sort of tlb description struct that we keep around for host and guest.

> }
> 
> static int htlb0_set_base(gva_t addr)
> @@ -441,7 +433,7 @@ static unsigned int get_tlb_esel(struct kvmppc_vcpu_e500 *vcpu_e500, int tlbsel)
> 	unsigned int esel = get_tlb_esel_bit(vcpu_e500);
> 
> 	if (tlbsel = 0) {
> -		esel &= KVM_E500_TLB0_WAY_NUM_MASK;
> +		esel &= vcpu_e500->gtlb0_ways - 1;
> 		esel += gtlb0_set_base(vcpu_e500, vcpu_e500->mas2);
> 	} else {
> 		esel &= vcpu_e500->gtlb_size[tlbsel] - 1;
> @@ -455,18 +447,21 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
> 		gva_t eaddr, int tlbsel, unsigned int pid, int as)
> {
> 	int size = vcpu_e500->gtlb_size[tlbsel];
> -	unsigned int set_base;
> +	unsigned int set_base, offset;
> 	int i;
> 
> 	if (tlbsel = 0) {
> 		set_base = gtlb0_set_base(vcpu_e500, eaddr);
> -		size = KVM_E500_TLB0_WAY_NUM;
> +		size = vcpu_e500->gtlb0_ways;
> 	} else {
> 		set_base = 0;
> 	}
> 
> +	offset = vcpu_e500->gtlb_offset[tlbsel];
> +
> 	for (i = 0; i < size; i++) {
> -		struct tlbe *tlbe = &vcpu_e500->gtlb_arch[tlbsel][set_base + i];
> +		struct kvm_book3e_206_tlb_entry *tlbe > +			&vcpu_e500->gtlb_arch[offset + set_base + i];
> 		unsigned int tid;
> 
> 		if (eaddr < get_tlb_eaddr(tlbe))
> @@ -492,7 +487,7 @@ static int kvmppc_e500_tlb_index(struct kvmppc_vcpu_e500 *vcpu_e500,
> }
> 
> static inline void kvmppc_e500_ref_setup(struct tlbe_ref *ref,
> -					 struct tlbe *gtlbe,
> +					 struct kvm_book3e_206_tlb_entry *gtlbe,
> 					 pfn_t pfn)
> {
> 	ref->pfn = pfn;
> @@ -531,6 +526,8 @@ static void clear_tlb_refs(struct kvmppc_vcpu_e500 *vcpu_e500)
> 	int stlbsel = 1;
> 	int i;
> 
> +	kvmppc_e500_id_table_reset_all(vcpu_e500);
> +
> 	for (i = 0; i < tlb_host_entries[stlbsel]; i++) {
> 		struct tlbe_ref *ref > 			&vcpu_e500->tlb_refs[stlbsel][i];
> @@ -560,18 +557,18 @@ static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
> 		| MAS1_TSIZE(tsized);
> 	vcpu_e500->mas2 = (eaddr & MAS2_EPN)
> 		| (vcpu_e500->mas4 & MAS2_ATTRIB_MASK);
> -	vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
> +	vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
> 	vcpu_e500->mas6 = (vcpu_e500->mas6 & MAS6_SPID1)
> 		| (get_cur_pid(vcpu) << 16)
> 		| (as ? MAS6_SAS : 0);
> -	vcpu_e500->mas7 = 0;
> }
> 
> /* TID must be supplied by the caller */
> -static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> -					   struct tlbe *gtlbe, int tsize,
> -					   struct tlbe_ref *ref,
> -					   u64 gvaddr, struct tlbe *stlbe)
> +static inline void kvmppc_e500_setup_stlbe(
> +	struct kvmppc_vcpu_e500 *vcpu_e500,
> +	struct kvm_book3e_206_tlb_entry *gtlbe,
> +	int tsize, struct tlbe_ref *ref, u64 gvaddr,
> +	struct kvm_book3e_206_tlb_entry *stlbe)
> {
> 	pfn_t pfn = ref->pfn;
> 
> @@ -582,16 +579,16 @@ static inline void kvmppc_e500_setup_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> 	stlbe->mas2 = (gvaddr & MAS2_EPN)
> 		| e500_shadow_mas2_attrib(gtlbe->mas2,
> 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
> -	stlbe->mas3 = ((pfn << PAGE_SHIFT) & MAS3_RPN)
> -		| e500_shadow_mas3_attrib(gtlbe->mas3,
> +	stlbe->mas7_3 = ((u64)pfn << PAGE_SHIFT)
> +		| e500_shadow_mas3_attrib(gtlbe->mas7_3,
> 				vcpu_e500->vcpu.arch.shared->msr & MSR_PR);
> -	stlbe->mas7 = (pfn >> (32 - PAGE_SHIFT)) & MAS7_RPN;
> }
> 
> /* sesel is an index into the entire array, not just the set */
> static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> -	u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, int tlbsel, int sesel,
> -	struct tlbe *stlbe, struct tlbe_ref *ref)
> +	u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
> +	int tlbsel, int sesel, struct kvm_book3e_206_tlb_entry *stlbe,
> +	struct tlbe_ref *ref)
> {
> 	struct kvm_memory_slot *slot;
> 	unsigned long pfn, hva;
> @@ -701,15 +698,16 @@ static inline void kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> 
> /* XXX only map the one-one case, for now use TLB0 */
> static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> -				int esel, struct tlbe *stlbe)
> +				int esel,
> +				struct kvm_book3e_206_tlb_entry *stlbe)
> {
> -	struct tlbe *gtlbe;
> +	struct kvm_book3e_206_tlb_entry *gtlbe;
> 	struct tlbe_ref *ref;
> 	int sesel = esel & (tlb_host_ways[0] - 1);
> 	int sesel_base;
> 	gva_t ea;
> 
> -	gtlbe = &vcpu_e500->gtlb_arch[0][esel];
> +	gtlbe = get_entry(vcpu_e500, 0, esel);
> 	ref = &vcpu_e500->gtlb_priv[0][esel].ref;
> 
> 	ea = get_tlb_eaddr(gtlbe);
> @@ -726,7 +724,8 @@ static int kvmppc_e500_tlb0_map(struct kvmppc_vcpu_e500 *vcpu_e500,
>  * the shadow TLB. */
> /* XXX for both one-one and one-to-many , for now use TLB1 */
> static int kvmppc_e500_tlb1_map(struct kvmppc_vcpu_e500 *vcpu_e500,
> -		u64 gvaddr, gfn_t gfn, struct tlbe *gtlbe, struct tlbe *stlbe)
> +		u64 gvaddr, gfn_t gfn, struct kvm_book3e_206_tlb_entry *gtlbe,
> +		struct kvm_book3e_206_tlb_entry *stlbe)
> {
> 	struct tlbe_ref *ref;
> 	unsigned int victim;
> @@ -755,7 +754,8 @@ static inline int kvmppc_e500_gtlbe_invalidate(
> 				struct kvmppc_vcpu_e500 *vcpu_e500,
> 				int tlbsel, int esel)
> {
> -	struct tlbe *gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
> +	struct kvm_book3e_206_tlb_entry *gtlbe > +		get_entry(vcpu_e500, tlbsel, esel);
> 
> 	if (unlikely(get_tlb_iprot(gtlbe)))
> 		return -1;
> @@ -818,18 +818,17 @@ int kvmppc_e500_emul_tlbre(struct kvm_vcpu *vcpu)
> {
> 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> 	int tlbsel, esel;
> -	struct tlbe *gtlbe;
> +	struct kvm_book3e_206_tlb_entry *gtlbe;
> 
> 	tlbsel = get_tlb_tlbsel(vcpu_e500);
> 	esel = get_tlb_esel(vcpu_e500, tlbsel);
> 
> -	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
> +	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
> 	vcpu_e500->mas0 &= ~MAS0_NV(~0);
> 	vcpu_e500->mas0 |= MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
> 	vcpu_e500->mas1 = gtlbe->mas1;
> 	vcpu_e500->mas2 = gtlbe->mas2;
> -	vcpu_e500->mas3 = gtlbe->mas3;
> -	vcpu_e500->mas7 = gtlbe->mas7;
> +	vcpu_e500->mas7_3 = gtlbe->mas7_3;
> 
> 	return EMULATE_DONE;
> }
> @@ -840,7 +839,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
> 	int as = !!get_cur_sas(vcpu_e500);
> 	unsigned int pid = get_cur_spid(vcpu_e500);
> 	int esel, tlbsel;
> -	struct tlbe *gtlbe = NULL;
> +	struct kvm_book3e_206_tlb_entry *gtlbe = NULL;
> 	gva_t ea;
> 
> 	ea = kvmppc_get_gpr(vcpu, rb);
> @@ -848,21 +847,20 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
> 	for (tlbsel = 0; tlbsel < 2; tlbsel++) {
> 		esel = kvmppc_e500_tlb_index(vcpu_e500, ea, tlbsel, pid, as);
> 		if (esel >= 0) {
> -			gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
> +			gtlbe = get_entry(vcpu_e500, tlbsel, esel);
> 			break;
> 		}
> 	}
> 
> 	if (gtlbe) {
> 		if (tlbsel = 0)
> -			esel &= KVM_E500_TLB0_WAY_NUM - 1;
> +			esel &= vcpu_e500->gtlb0_ways - 1;
> 
> 		vcpu_e500->mas0 = MAS0_TLBSEL(tlbsel) | MAS0_ESEL(esel)
> 			| MAS0_NV(vcpu_e500->gtlb_nv[tlbsel]);
> 		vcpu_e500->mas1 = gtlbe->mas1;
> 		vcpu_e500->mas2 = gtlbe->mas2;
> -		vcpu_e500->mas3 = gtlbe->mas3;
> -		vcpu_e500->mas7 = gtlbe->mas7;
> +		vcpu_e500->mas7_3 = gtlbe->mas7_3;
> 	} else {
> 		int victim;
> 
> @@ -877,8 +875,7 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
> 			| (vcpu_e500->mas4 & MAS4_TSIZED(~0));
> 		vcpu_e500->mas2 &= MAS2_EPN;
> 		vcpu_e500->mas2 |= vcpu_e500->mas4 & MAS2_ATTRIB_MASK;
> -		vcpu_e500->mas3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
> -		vcpu_e500->mas7 = 0;
> +		vcpu_e500->mas7_3 &= MAS3_U0 | MAS3_U1 | MAS3_U2 | MAS3_U3;
> 	}
> 
> 	kvmppc_set_exit_type(vcpu, EMULATED_TLBSX_EXITS);
> @@ -887,8 +884,8 @@ int kvmppc_e500_emul_tlbsx(struct kvm_vcpu *vcpu, int rb)
> 
> /* sesel is index into the set, not the whole array */
> static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> -			struct tlbe *gtlbe,
> -			struct tlbe *stlbe,
> +			struct kvm_book3e_206_tlb_entry *gtlbe,
> +			struct kvm_book3e_206_tlb_entry *stlbe,
> 			int stlbsel, int sesel)
> {
> 	int stid;
> @@ -906,28 +903,27 @@ static void write_stlbe(struct kvmppc_vcpu_e500 *vcpu_e500,
> int kvmppc_e500_emul_tlbwe(struct kvm_vcpu *vcpu)
> {
> 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> -	struct tlbe *gtlbe;
> +	struct kvm_book3e_206_tlb_entry *gtlbe;
> 	int tlbsel, esel;
> 
> 	tlbsel = get_tlb_tlbsel(vcpu_e500);
> 	esel = get_tlb_esel(vcpu_e500, tlbsel);
> 
> -	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
> +	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
> 
> 	if (get_tlb_v(gtlbe))
> 		inval_gtlbe_on_host(vcpu_e500, tlbsel, esel);
> 
> 	gtlbe->mas1 = vcpu_e500->mas1;
> 	gtlbe->mas2 = vcpu_e500->mas2;
> -	gtlbe->mas3 = vcpu_e500->mas3;
> -	gtlbe->mas7 = vcpu_e500->mas7;
> +	gtlbe->mas7_3 = vcpu_e500->mas7_3;
> 
> 	trace_kvm_gtlb_write(vcpu_e500->mas0, gtlbe->mas1, gtlbe->mas2,
> -			     gtlbe->mas3, gtlbe->mas7);
> +			     (u32)gtlbe->mas7_3, (u32)(gtlbe->mas7_3 >> 32));
> 
> 	/* Invalidate shadow mappings for the about-to-be-clobbered TLBE. */
> 	if (tlbe_is_host_safe(vcpu, gtlbe)) {
> -		struct tlbe stlbe;
> +		struct kvm_book3e_206_tlb_entry stlbe;
> 		int stlbsel, sesel;
> 		u64 eaddr;
> 		u64 raddr;
> @@ -1000,9 +996,11 @@ gpa_t kvmppc_mmu_xlate(struct kvm_vcpu *vcpu, unsigned int index,
> 			gva_t eaddr)
> {
> 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> -	struct tlbe *gtlbe > -		&vcpu_e500->gtlb_arch[tlbsel_of(index)][esel_of(index)];
> -	u64 pgmask = get_tlb_bytes(gtlbe) - 1;
> +	struct kvm_book3e_206_tlb_entry *gtlbe;
> +	u64 pgmask;
> +
> +	gtlbe = get_entry(vcpu_e500, tlbsel_of(index), esel_of(index));
> +	pgmask = get_tlb_bytes(gtlbe) - 1;
> 
> 	return get_tlb_raddr(gtlbe) | (eaddr & pgmask);
> }
> @@ -1016,12 +1014,12 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, gpa_t gpaddr,
> {
> 	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> 	struct tlbe_priv *priv;
> -	struct tlbe *gtlbe, stlbe;
> +	struct kvm_book3e_206_tlb_entry *gtlbe, stlbe;
> 	int tlbsel = tlbsel_of(index);
> 	int esel = esel_of(index);
> 	int stlbsel, sesel;
> 
> -	gtlbe = &vcpu_e500->gtlb_arch[tlbsel][esel];
> +	gtlbe = get_entry(vcpu_e500, tlbsel, esel);
> 
> 	switch (tlbsel) {
> 	case 0:
> @@ -1077,25 +1075,186 @@ void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid)
> 
> void kvmppc_e500_tlb_setup(struct kvmppc_vcpu_e500 *vcpu_e500)
> {
> -	struct tlbe *tlbe;
> +	struct kvm_book3e_206_tlb_entry *tlbe;
> 
> 	/* Insert large initial mapping for guest. */
> -	tlbe = &vcpu_e500->gtlb_arch[1][0];
> +	tlbe = get_entry(vcpu_e500, 1, 0);
> 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_256M);
> 	tlbe->mas2 = 0;
> -	tlbe->mas3 = E500_TLB_SUPER_PERM_MASK;
> -	tlbe->mas7 = 0;
> +	tlbe->mas7_3 = E500_TLB_SUPER_PERM_MASK;
> 
> 	/* 4K map for serial output. Used by kernel wrapper. */
> -	tlbe = &vcpu_e500->gtlb_arch[1][1];
> +	tlbe = get_entry(vcpu_e500, 1, 1);
> 	tlbe->mas1 = MAS1_VALID | MAS1_TSIZE(BOOK3E_PAGESZ_4K);
> 	tlbe->mas2 = (0xe0004500 & 0xFFFFF000) | MAS2_I | MAS2_G;
> -	tlbe->mas3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
> -	tlbe->mas7 = 0;
> +	tlbe->mas7_3 = (0xe0004500 & 0xFFFFF000) | E500_TLB_SUPER_PERM_MASK;
> +}
> +
> +static void free_gtlb(struct kvmppc_vcpu_e500 *vcpu_e500)
> +{
> +	int i;
> +
> +	clear_tlb_refs(vcpu_e500);
> +	kfree(vcpu_e500->gtlb_priv[0]);
> +	kfree(vcpu_e500->gtlb_priv[1]);
> +
> +	if (vcpu_e500->shared_tlb_pages) {
> +		vfree((void *)(round_down((uintptr_t)vcpu_e500->gtlb_arch,
> +					  PAGE_SIZE)));
> +
> +		for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
> +			put_page(vcpu_e500->shared_tlb_pages[i]);
> +
> +		vcpu_e500->num_shared_tlb_pages = 0;
> +		vcpu_e500->shared_tlb_pages = NULL;
> +	} else {
> +		kfree(vcpu_e500->gtlb_arch);
> +	}
> +
> +	vcpu_e500->gtlb_arch = NULL;
> +}
> +
> +int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> +			      struct kvm_config_tlb *cfg)
> +{
> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> +	struct kvm_book3e_206_tlb_params params;
> +	char *virt;
> +	struct page **pages;
> +	struct tlbe_priv *privs[2] = {};
> +	size_t array_len;
> +	u32 sets;
> +	int num_pages, ret, i;
> +
> +	if (cfg->mmu_type != KVM_MMU_FSL_BOOKE_NOHV)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&params, (void __user *)(uintptr_t)cfg->params,
> +			   sizeof(params)))
> +		return -EFAULT;
> +
> +	if (params.tlb_sizes[1] > 64)
> +		return -EINVAL;
> +	if (params.tlb_ways[1] != params.tlb_sizes[1])
> +		return -EINVAL;
> +	if (params.tlb_sizes[2] != 0 || params.tlb_sizes[3] != 0)
> +		return -EINVAL;
> +	if (params.tlb_ways[2] != 0 || params.tlb_ways[3] != 0)
> +		return -EINVAL;
> +
> +	if (!is_power_of_2(params.tlb_ways[0]))
> +		return -EINVAL;
> +
> +	sets = params.tlb_sizes[0] >> ilog2(params.tlb_ways[0]);
> +	if (!is_power_of_2(sets))
> +		return -EINVAL;
> +
> +	array_len = params.tlb_sizes[0] + params.tlb_sizes[1];
> +	array_len *= sizeof(struct kvm_book3e_206_tlb_entry);
> +
> +	if (cfg->array_len < array_len)
> +		return -EINVAL;
> +
> +	num_pages = DIV_ROUND_UP(cfg->array + array_len - 1, PAGE_SIZE) -
> +		    cfg->array / PAGE_SIZE;
> +	pages = kmalloc(sizeof(struct page *) * num_pages, GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	ret = get_user_pages_fast(cfg->array, num_pages, 1, pages);
> +	if (ret < 0)
> +		goto err_pages;
> +
> +	if (ret != num_pages) {
> +		num_pages = ret;
> +		ret = -EFAULT;
> +		goto err_put_page;
> +	}
> +
> +	virt = vmap(pages, num_pages, VM_MAP, PAGE_KERNEL);
> +	if (!virt)
> +		goto err_put_page;
> +
> +	privs[0] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[0],
> +			   GFP_KERNEL);
> +	privs[1] = kzalloc(sizeof(struct tlbe_priv) * params.tlb_sizes[1],
> +			   GFP_KERNEL);
> +
> +	if (!privs[0] || !privs[1])
> +		goto err_put_page;
> +
> +	free_gtlb(vcpu_e500);
> +
> +	vcpu_e500->gtlb_priv[0] = privs[0];
> +	vcpu_e500->gtlb_priv[1] = privs[1];
> +
> +	vcpu_e500->gtlb_arch = (struct kvm_book3e_206_tlb_entry *)
> +		(virt + (cfg->array & (PAGE_SIZE - 1)));
> +
> +	vcpu_e500->gtlb_size[0] = params.tlb_sizes[0];
> +	vcpu_e500->gtlb_size[1] = params.tlb_sizes[1];
> +
> +	vcpu_e500->gtlb_offset[0] = 0;
> +	vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0];
> +
> +	vcpu_e500->tlb0cfg = mfspr(SPRN_TLB0CFG) & ~0xfffUL;
> +	if (params.tlb_sizes[0] <= 2048)
> +		vcpu_e500->tlb0cfg |= params.tlb_sizes[0];
> +
> +	vcpu_e500->tlb1cfg = mfspr(SPRN_TLB1CFG) & ~0xfffUL;
> +	vcpu_e500->tlb1cfg |= params.tlb_sizes[1];
> +
> +	vcpu_e500->shared_tlb_pages = pages;
> +	vcpu_e500->num_shared_tlb_pages = num_pages;
> +
> +	vcpu_e500->gtlb0_ways = params.tlb_ways[0];
> +	vcpu_e500->gtlb0_sets = sets;
> +
> +	return 0;
> +
> +err_put_page:
> +	kfree(privs[0]);
> +	kfree(privs[1]);
> +
> +	for (i = 0; i < num_pages; i++)
> +		put_page(pages[i]);
> +
> +err_pages:
> +	kfree(pages);
> +	return ret;
> +}
> +
> +int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
> +			     struct kvm_dirty_tlb *dirty)
> +{
> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> +
> +	clear_tlb_refs(vcpu_e500);
> +	return 0;
> +}
> +
> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> +	int i;
> +
> +	/*
> +	 * We may have modified the guest TLB, so mark it dirty.
> +	 * We only do it on an actual return to userspace, to avoid
> +	 * adding more overhead to getting scheduled out -- and avoid
> +	 * any locking issues with getting preempted in the middle of
> +	 * KVM_CONFIG_TLB, etc.
> +	 */
> +
> +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
> +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);

Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
  2011-07-08 12:57 ` Alexander Graf
  2011-07-18 10:09 ` Alexander Graf
@ 2011-07-18 16:18 ` Scott Wood
  2011-07-18 16:33 ` Alexander Graf
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Scott Wood @ 2011-07-18 16:18 UTC (permalink / raw)
  To: kvm-ppc

On Mon, 18 Jul 2011 12:09:53 +0200
Alexander Graf <agraf@suse.de> wrote:

> On 08.07.2011, at 01:41, Scott Wood wrote:
> 
> > +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> > +	int i;
> > +
> > +	/*
> > +	 * We may have modified the guest TLB, so mark it dirty.
> > +	 * We only do it on an actual return to userspace, to avoid
> > +	 * adding more overhead to getting scheduled out -- and avoid
> > +	 * any locking issues with getting preempted in the middle of
> > +	 * KVM_CONFIG_TLB, etc.
> > +	 */
> > +
> > +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
> > +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
> 
> Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?

They're pinned by get_user_pages_fast().  We (potentially) write to them, so
we should mark them dirty, because they are dirty.  It's up to the rest
of Linux what to do with that.  Will being pinned stop updates from being
written out if it is file-backed?  And eventually the vm will be destroyed
(or the tlb reconfigured) and the pages will be unpinned.

-Scott


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (2 preceding siblings ...)
  2011-07-18 16:18 ` Scott Wood
@ 2011-07-18 16:33 ` Alexander Graf
  2011-07-18 18:08 ` Scott Wood
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-18 16:33 UTC (permalink / raw)
  To: kvm-ppc


On 18.07.2011, at 18:18, Scott Wood wrote:

> On Mon, 18 Jul 2011 12:09:53 +0200
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 08.07.2011, at 01:41, Scott Wood wrote:
>> 
>>> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
>>> +{
>>> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * We may have modified the guest TLB, so mark it dirty.
>>> +	 * We only do it on an actual return to userspace, to avoid
>>> +	 * adding more overhead to getting scheduled out -- and avoid
>>> +	 * any locking issues with getting preempted in the middle of
>>> +	 * KVM_CONFIG_TLB, etc.
>>> +	 */
>>> +
>>> +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
>>> +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
>> 
>> Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?
> 
> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
> we should mark them dirty, because they are dirty.  It's up to the rest
> of Linux what to do with that.  Will being pinned stop updates from being
> written out if it is file-backed?  And eventually the vm will be destroyed
> (or the tlb reconfigured) and the pages will be unpinned.

Hrm. How much overhead do we add to the exit-to-userspace path with this? I completely agree that we should mark them dirty when closing, but I'm not fully convinced a "we dirty them so we should declare them dirty at random times" pays off against possible substantial slowdowns due to the marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (3 preceding siblings ...)
  2011-07-18 16:33 ` Alexander Graf
@ 2011-07-18 18:08 ` Scott Wood
  2011-07-18 21:44 ` Alexander Graf
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Scott Wood @ 2011-07-18 18:08 UTC (permalink / raw)
  To: kvm-ppc

On Mon, 18 Jul 2011 18:33:58 +0200
Alexander Graf <agraf@suse.de> wrote:

> 
> On 18.07.2011, at 18:18, Scott Wood wrote:
> 
> > They're pinned by get_user_pages_fast().  We (potentially) write to them, so
> > we should mark them dirty, because they are dirty.  It's up to the rest
> > of Linux what to do with that.  Will being pinned stop updates from being
> > written out if it is file-backed?  And eventually the vm will be destroyed
> > (or the tlb reconfigured) and the pages will be unpinned.
> 
> Hrm. How much overhead do we add to the exit-to-userspace path with this?

Not sure -- probably not too much for anonymous memory, compared to the
rest of the cost of a heavyweight exit.  On e500 the tlb array is 4 pages,
and each set_page_dirty_lock() will just do a few bit operations.

> I completely agree that we should mark them dirty when closing, but I'm
> not fully convinced a "we dirty them so we should declare them dirty at
> random times" pays off against possible substantial slowdowns due to the
> marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.

If we can convince ourselves nothing bad can happen, fine.  I did it here
because this is the point at which the API says the contents of the memory
are well-defined.  If it is file-backed, and userspace does a sync on a
heavyweight exit, shouldn't the the right thing get written to disk?  Could
any other weird things happen?  I'm not familiar enough with that part of
the kernel to say right away that it's safe.

If we need to start making assumptions about what userspace is going to do
with this memory in order for it to be safe, then the restrictions should
be written into the API, and we should be sure that the performance gain is
worth it.

-Scott

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (4 preceding siblings ...)
  2011-07-18 18:08 ` Scott Wood
@ 2011-07-18 21:44 ` Alexander Graf
  2011-07-19  8:36 ` Johannes Weiner
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-18 21:44 UTC (permalink / raw)
  To: kvm-ppc


On 18.07.2011, at 20:08, Scott Wood wrote:

> On Mon, 18 Jul 2011 18:33:58 +0200
> Alexander Graf <agraf@suse.de> wrote:
> 
>> 
>> On 18.07.2011, at 18:18, Scott Wood wrote:
>> 
>>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
>>> we should mark them dirty, because they are dirty.  It's up to the rest
>>> of Linux what to do with that.  Will being pinned stop updates from being
>>> written out if it is file-backed?  And eventually the vm will be destroyed
>>> (or the tlb reconfigured) and the pages will be unpinned.
>> 
>> Hrm. How much overhead do we add to the exit-to-userspace path with this?
> 
> Not sure -- probably not too much for anonymous memory, compared to the
> rest of the cost of a heavyweight exit.  On e500 the tlb array is 4 pages,
> and each set_page_dirty_lock() will just do a few bit operations.

Hm, ok.

> 
>> I completely agree that we should mark them dirty when closing, but I'm
>> not fully convinced a "we dirty them so we should declare them dirty at
>> random times" pays off against possible substantial slowdowns due to the
>> marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.
> 
> If we can convince ourselves nothing bad can happen, fine.  I did it here
> because this is the point at which the API says the contents of the memory
> are well-defined.  If it is file-backed, and userspace does a sync on a
> heavyweight exit, shouldn't the the right thing get written to disk?  Could
> any other weird things happen?  I'm not familiar enough with that part of
> the kernel to say right away that it's safe.

I'm neither, these are pretty subtile grounds. CC'ing Andrea and Johannes. Guys, would you please take a look at that patch and tell us if it's safe and a good thing to do what's being done here?

We're talking about the following patch: http://www.spinics.net/lists/kvm-ppc/msg02961.html
and specifically about:

> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> +	int i;
> +
> +	/*
> +	 * We may have modified the guest TLB, so mark it dirty.
> +	 * We only do it on an actual return to userspace, to avoid
> +	 * adding more overhead to getting scheduled out -- and avoid
> +	 * any locking issues with getting preempted in the middle of
> +	 * KVM_CONFIG_TLB, etc.
> +	 */
> +
> +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
> +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
>  }
 
The background is that we want to have a few shared pages between kernel and user space to keep guest TLB information in. It's too much to shove around every time we need it in user space.

> If we need to start making assumptions about what userspace is going to do
> with this memory in order for it to be safe, then the restrictions should
> be written into the API, and we should be sure that the performance gain is
> worth it.

Yes, I agree. I just have the feeling that what the dirty setting is trying to achieve is already guaranteed implicitly, but I also feel better asking some mm gurus :).


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (5 preceding siblings ...)
  2011-07-18 21:44 ` Alexander Graf
@ 2011-07-19  8:36 ` Johannes Weiner
  2011-07-19  8:51 ` Alexander Graf
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2011-07-19  8:36 UTC (permalink / raw)
  To: kvm-ppc

On Mon, Jul 18, 2011 at 11:44:02PM +0200, Alexander Graf wrote:
> 
> On 18.07.2011, at 20:08, Scott Wood wrote:
> 
> > On Mon, 18 Jul 2011 18:33:58 +0200
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >> 
> >> On 18.07.2011, at 18:18, Scott Wood wrote:
> >> 
> >>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
> >>> we should mark them dirty, because they are dirty.  It's up to the rest
> >>> of Linux what to do with that.  Will being pinned stop updates from being
> >>> written out if it is file-backed?  And eventually the vm will be destroyed
> >>> (or the tlb reconfigured) and the pages will be unpinned.
> >> 
> >> Hrm. How much overhead do we add to the exit-to-userspace path with this?
> > 
> > Not sure -- probably not too much for anonymous memory, compared to the
> > rest of the cost of a heavyweight exit.  On e500 the tlb array is 4 pages,
> > and each set_page_dirty_lock() will just do a few bit operations.
> 
> Hm, ok.
> 
> > 
> >> I completely agree that we should mark them dirty when closing, but I'm
> >> not fully convinced a "we dirty them so we should declare them dirty at
> >> random times" pays off against possible substantial slowdowns due to the
> >> marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.
> > 
> > If we can convince ourselves nothing bad can happen, fine.  I did it here
> > because this is the point at which the API says the contents of the memory
> > are well-defined.  If it is file-backed, and userspace does a sync on a
> > heavyweight exit, shouldn't the the right thing get written to disk?  Could
> > any other weird things happen?  I'm not familiar enough with that part of
> > the kernel to say right away that it's safe.
> 
> I'm neither, these are pretty subtile grounds. CC'ing Andrea and
> Johannes. Guys, would you please take a look at that patch and tell
> us if it's safe and a good thing to do what's being done here?
> 
> We're talking about the following patch: http://www.spinics.net/lists/kvm-ppc/msg02961.html
> and specifically about:
> 
> > +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> > +	int i;
> > +
> > +	/*
> > +	 * We may have modified the guest TLB, so mark it dirty.
> > +	 * We only do it on an actual return to userspace, to avoid
> > +	 * adding more overhead to getting scheduled out -- and avoid
> > +	 * any locking issues with getting preempted in the middle of
> > +	 * KVM_CONFIG_TLB, etc.
> > +	 */
> > +
> > +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
> > +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
> >  }
>  
> The background is that we want to have a few shared pages between
> kernel and user space to keep guest TLB information in. It's too
> much to shove around every time we need it in user space.

Is there a strict requirement to have these pages originate from
userspace?  Usually, shared memory between kernel and userspace is
owned by the driver and kept away from the mm subsystem completely.

You could allocate the memory in the driver when userspace issues an
ioctl like KVM_ALLOC_TLB_CONFIG and return a file handle that can be
mmap'd.  The array length info is maintained in the vma for the
kernel, userspace must remember the size of mmap regions anyway.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (6 preceding siblings ...)
  2011-07-19  8:36 ` Johannes Weiner
@ 2011-07-19  8:51 ` Alexander Graf
  2011-07-19 11:20 ` Johannes Weiner
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-19  8:51 UTC (permalink / raw)
  To: kvm-ppc


On 19.07.2011, at 10:36, Johannes Weiner wrote:

> On Mon, Jul 18, 2011 at 11:44:02PM +0200, Alexander Graf wrote:
>> 
>> On 18.07.2011, at 20:08, Scott Wood wrote:
>> 
>>> On Mon, 18 Jul 2011 18:33:58 +0200
>>> Alexander Graf <agraf@suse.de> wrote:
>>> 
>>>> 
>>>> On 18.07.2011, at 18:18, Scott Wood wrote:
>>>> 
>>>>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
>>>>> we should mark them dirty, because they are dirty.  It's up to the rest
>>>>> of Linux what to do with that.  Will being pinned stop updates from being
>>>>> written out if it is file-backed?  And eventually the vm will be destroyed
>>>>> (or the tlb reconfigured) and the pages will be unpinned.
>>>> 
>>>> Hrm. How much overhead do we add to the exit-to-userspace path with this?
>>> 
>>> Not sure -- probably not too much for anonymous memory, compared to the
>>> rest of the cost of a heavyweight exit.  On e500 the tlb array is 4 pages,
>>> and each set_page_dirty_lock() will just do a few bit operations.
>> 
>> Hm, ok.
>> 
>>> 
>>>> I completely agree that we should mark them dirty when closing, but I'm
>>>> not fully convinced a "we dirty them so we should declare them dirty at
>>>> random times" pays off against possible substantial slowdowns due to the
>>>> marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.
>>> 
>>> If we can convince ourselves nothing bad can happen, fine.  I did it here
>>> because this is the point at which the API says the contents of the memory
>>> are well-defined.  If it is file-backed, and userspace does a sync on a
>>> heavyweight exit, shouldn't the the right thing get written to disk?  Could
>>> any other weird things happen?  I'm not familiar enough with that part of
>>> the kernel to say right away that it's safe.
>> 
>> I'm neither, these are pretty subtile grounds. CC'ing Andrea and
>> Johannes. Guys, would you please take a look at that patch and tell
>> us if it's safe and a good thing to do what's being done here?
>> 
>> We're talking about the following patch: http://www.spinics.net/lists/kvm-ppc/msg02961.html
>> and specifically about:
>> 
>>> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
>>> +{
>>> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * We may have modified the guest TLB, so mark it dirty.
>>> +	 * We only do it on an actual return to userspace, to avoid
>>> +	 * adding more overhead to getting scheduled out -- and avoid
>>> +	 * any locking issues with getting preempted in the middle of
>>> +	 * KVM_CONFIG_TLB, etc.
>>> +	 */
>>> +
>>> +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
>>> +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
>>> }
>> 
>> The background is that we want to have a few shared pages between
>> kernel and user space to keep guest TLB information in. It's too
>> much to shove around every time we need it in user space.
> 
> Is there a strict requirement to have these pages originate from
> userspace?  Usually, shared memory between kernel and userspace is
> owned by the driver and kept away from the mm subsystem completely.
> 
> You could allocate the memory in the driver when userspace issues an
> ioctl like KVM_ALLOC_TLB_CONFIG and return a file handle that can be
> mmap'd.  The array length info is maintained in the vma for the
> kernel, userspace must remember the size of mmap regions anyway.

Hrm. What's the advantage of doing it this way around vs the other?


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (7 preceding siblings ...)
  2011-07-19  8:51 ` Alexander Graf
@ 2011-07-19 11:20 ` Johannes Weiner
  2011-07-24  9:16 ` Alexander Graf
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2011-07-19 11:20 UTC (permalink / raw)
  To: kvm-ppc

On Tue, Jul 19, 2011 at 10:51:40AM +0200, Alexander Graf wrote:
> 
> On 19.07.2011, at 10:36, Johannes Weiner wrote:
> 
> > On Mon, Jul 18, 2011 at 11:44:02PM +0200, Alexander Graf wrote:
> >> 
> >> On 18.07.2011, at 20:08, Scott Wood wrote:
> >> 
> >>> On Mon, 18 Jul 2011 18:33:58 +0200
> >>> Alexander Graf <agraf@suse.de> wrote:
> >>> 
> >>>> 
> >>>> On 18.07.2011, at 18:18, Scott Wood wrote:
> >>>> 
> >>>>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
> >>>>> we should mark them dirty, because they are dirty.  It's up to the rest
> >>>>> of Linux what to do with that.  Will being pinned stop updates from being
> >>>>> written out if it is file-backed?  And eventually the vm will be destroyed
> >>>>> (or the tlb reconfigured) and the pages will be unpinned.
> >>>> 
> >>>> Hrm. How much overhead do we add to the exit-to-userspace path with this?
> >>> 
> >>> Not sure -- probably not too much for anonymous memory, compared to the
> >>> rest of the cost of a heavyweight exit.  On e500 the tlb array is 4 pages,
> >>> and each set_page_dirty_lock() will just do a few bit operations.
> >> 
> >> Hm, ok.
> >> 
> >>> 
> >>>> I completely agree that we should mark them dirty when closing, but I'm
> >>>> not fully convinced a "we dirty them so we should declare them dirty at
> >>>> random times" pays off against possible substantial slowdowns due to the
> >>>> marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.
> >>> 
> >>> If we can convince ourselves nothing bad can happen, fine.  I did it here
> >>> because this is the point at which the API says the contents of the memory
> >>> are well-defined.  If it is file-backed, and userspace does a sync on a
> >>> heavyweight exit, shouldn't the the right thing get written to disk?  Could
> >>> any other weird things happen?  I'm not familiar enough with that part of
> >>> the kernel to say right away that it's safe.
> >> 
> >> I'm neither, these are pretty subtile grounds. CC'ing Andrea and
> >> Johannes. Guys, would you please take a look at that patch and tell
> >> us if it's safe and a good thing to do what's being done here?
> >> 
> >> We're talking about the following patch: http://www.spinics.net/lists/kvm-ppc/msg02961.html
> >> and specifically about:
> >> 
> >>> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
> >>> +{
> >>> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
> >>> +	int i;
> >>> +
> >>> +	/*
> >>> +	 * We may have modified the guest TLB, so mark it dirty.
> >>> +	 * We only do it on an actual return to userspace, to avoid
> >>> +	 * adding more overhead to getting scheduled out -- and avoid
> >>> +	 * any locking issues with getting preempted in the middle of
> >>> +	 * KVM_CONFIG_TLB, etc.
> >>> +	 */
> >>> +
> >>> +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
> >>> +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
> >>> }
> >> 
> >> The background is that we want to have a few shared pages between
> >> kernel and user space to keep guest TLB information in. It's too
> >> much to shove around every time we need it in user space.
> > 
> > Is there a strict requirement to have these pages originate from
> > userspace?  Usually, shared memory between kernel and userspace is
> > owned by the driver and kept away from the mm subsystem completely.
> > 
> > You could allocate the memory in the driver when userspace issues an
> > ioctl like KVM_ALLOC_TLB_CONFIG and return a file handle that can be
> > mmap'd.  The array length info is maintained in the vma for the
> > kernel, userspace must remember the size of mmap regions anyway.
> 
> Hrm. What's the advantage of doing it this way around vs the other?

You don't have to work around the mm subsystem trying to reclaim your
memory, maintain disk coherency that is guaranteed by the filebacked
memory semantics etc.

If your driver provides the memory, there are much less assumptions
from userspace that you have to consider and memory management will
not interfere either.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (8 preceding siblings ...)
  2011-07-19 11:20 ` Johannes Weiner
@ 2011-07-24  9:16 ` Alexander Graf
  2011-07-25 19:25 ` Scott Wood
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-24  9:16 UTC (permalink / raw)
  To: kvm-ppc


On 19.07.2011, at 13:20, Johannes Weiner wrote:

> On Tue, Jul 19, 2011 at 10:51:40AM +0200, Alexander Graf wrote:
>> 
>> On 19.07.2011, at 10:36, Johannes Weiner wrote:
>> 
>>> On Mon, Jul 18, 2011 at 11:44:02PM +0200, Alexander Graf wrote:
>>>> 
>>>> On 18.07.2011, at 20:08, Scott Wood wrote:
>>>> 
>>>>> On Mon, 18 Jul 2011 18:33:58 +0200
>>>>> Alexander Graf <agraf@suse.de> wrote:
>>>>> 
>>>>>> 
>>>>>> On 18.07.2011, at 18:18, Scott Wood wrote:
>>>>>> 
>>>>>>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
>>>>>>> we should mark them dirty, because they are dirty.  It's up to the rest
>>>>>>> of Linux what to do with that.  Will being pinned stop updates from being
>>>>>>> written out if it is file-backed?  And eventually the vm will be destroyed
>>>>>>> (or the tlb reconfigured) and the pages will be unpinned.
>>>>>> 
>>>>>> Hrm. How much overhead do we add to the exit-to-userspace path with this?
>>>>> 
>>>>> Not sure -- probably not too much for anonymous memory, compared to the
>>>>> rest of the cost of a heavyweight exit.  On e500 the tlb array is 4 pages,
>>>>> and each set_page_dirty_lock() will just do a few bit operations.
>>>> 
>>>> Hm, ok.
>>>> 
>>>>> 
>>>>>> I completely agree that we should mark them dirty when closing, but I'm
>>>>>> not fully convinced a "we dirty them so we should declare them dirty at
>>>>>> random times" pays off against possible substantial slowdowns due to the
>>>>>> marking. Keep in mind that this is the MMIO case which isn't _that_ seldom.
>>>>> 
>>>>> If we can convince ourselves nothing bad can happen, fine.  I did it here
>>>>> because this is the point at which the API says the contents of the memory
>>>>> are well-defined.  If it is file-backed, and userspace does a sync on a
>>>>> heavyweight exit, shouldn't the the right thing get written to disk?  Could
>>>>> any other weird things happen?  I'm not familiar enough with that part of
>>>>> the kernel to say right away that it's safe.
>>>> 
>>>> I'm neither, these are pretty subtile grounds. CC'ing Andrea and
>>>> Johannes. Guys, would you please take a look at that patch and tell
>>>> us if it's safe and a good thing to do what's being done here?
>>>> 
>>>> We're talking about the following patch: http://www.spinics.net/lists/kvm-ppc/msg02961.html
>>>> and specifically about:
>>>> 
>>>>> +void kvmppc_core_heavy_exit(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
>>>>> +	int i;
>>>>> +
>>>>> +	/*
>>>>> +	 * We may have modified the guest TLB, so mark it dirty.
>>>>> +	 * We only do it on an actual return to userspace, to avoid
>>>>> +	 * adding more overhead to getting scheduled out -- and avoid
>>>>> +	 * any locking issues with getting preempted in the middle of
>>>>> +	 * KVM_CONFIG_TLB, etc.
>>>>> +	 */
>>>>> +
>>>>> +	for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++)
>>>>> +		set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]);
>>>>> }
>>>> 
>>>> The background is that we want to have a few shared pages between
>>>> kernel and user space to keep guest TLB information in. It's too
>>>> much to shove around every time we need it in user space.
>>> 
>>> Is there a strict requirement to have these pages originate from
>>> userspace?  Usually, shared memory between kernel and userspace is
>>> owned by the driver and kept away from the mm subsystem completely.
>>> 
>>> You could allocate the memory in the driver when userspace issues an
>>> ioctl like KVM_ALLOC_TLB_CONFIG and return a file handle that can be
>>> mmap'd.  The array length info is maintained in the vma for the
>>> kernel, userspace must remember the size of mmap regions anyway.
>> 
>> Hrm. What's the advantage of doing it this way around vs the other?
> 
> You don't have to work around the mm subsystem trying to reclaim your
> memory, maintain disk coherency that is guaranteed by the filebacked
> memory semantics etc.
> 
> If your driver provides the memory, there are much less assumptions
> from userspace that you have to consider and memory management will
> not interfere either.

Ah, thanks a lot. Scott, mind to switch this to the normal scheme then? Sounds like we don't need to dirty set by then either.


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (9 preceding siblings ...)
  2011-07-24  9:16 ` Alexander Graf
@ 2011-07-25 19:25 ` Scott Wood
  2011-07-25 21:50 ` Alexander Graf
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Scott Wood @ 2011-07-25 19:25 UTC (permalink / raw)
  To: kvm-ppc

On Sun, 24 Jul 2011 11:16:32 +0200
Alexander Graf <agraf@suse.de> wrote:

> On 19.07.2011, at 13:20, Johannes Weiner wrote:
> 
> > You don't have to work around the mm subsystem trying to reclaim your
> > memory,

The pages are pinned by get_free_pages_fast().

> > maintain disk coherency that is guaranteed by the filebacked
> > memory semantics etc.
> > 
> > If your driver provides the memory, there are much less assumptions
> > from userspace that you have to consider and memory management will
> > not interfere either.
> 
> Ah, thanks a lot. Scott, mind to switch this to the normal scheme then? Sounds like we don't need to dirty set by then either.

That's a fair bit of churn and added complexity, both here and in qemu.  Is
it really worth redesigning this API again, to avoid setting a few dirty
bits on an already-slow heavyweight exit?

-Scott


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (10 preceding siblings ...)
  2011-07-25 19:25 ` Scott Wood
@ 2011-07-25 21:50 ` Alexander Graf
  2011-08-08  8:49 ` Johannes Weiner
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Alexander Graf @ 2011-07-25 21:50 UTC (permalink / raw)
  To: kvm-ppc


On 25.07.2011, at 21:25, Scott Wood wrote:

> On Sun, 24 Jul 2011 11:16:32 +0200
> Alexander Graf <agraf@suse.de> wrote:
> 
>> On 19.07.2011, at 13:20, Johannes Weiner wrote:
>> 
>>> You don't have to work around the mm subsystem trying to reclaim your
>>> memory,
> 
> The pages are pinned by get_free_pages_fast().
> 
>>> maintain disk coherency that is guaranteed by the filebacked
>>> memory semantics etc.
>>> 
>>> If your driver provides the memory, there are much less assumptions
>>> from userspace that you have to consider and memory management will
>>> not interfere either.
>> 
>> Ah, thanks a lot. Scott, mind to switch this to the normal scheme then? Sounds like we don't need to dirty set by then either.
> 
> That's a fair bit of churn and added complexity, both here and in qemu.  Is
> it really worth redesigning this API again, to avoid setting a few dirty
> bits on an already-slow heavyweight exit?

Well, alternatively we could simply bail out if the memory is not anonymous, right? Then the pinning on get_user_pages_fast should be enough. Johannes, would there be any downside to this approach?


Alex


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (11 preceding siblings ...)
  2011-07-25 21:50 ` Alexander Graf
@ 2011-08-08  8:49 ` Johannes Weiner
  2011-08-08 23:13 ` Scott Wood
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Johannes Weiner @ 2011-08-08  8:49 UTC (permalink / raw)
  To: kvm-ppc

On Mon, Jul 25, 2011 at 11:50:50PM +0200, Alexander Graf wrote:
> 
> On 25.07.2011, at 21:25, Scott Wood wrote:
> 
> > On Sun, 24 Jul 2011 11:16:32 +0200
> > Alexander Graf <agraf@suse.de> wrote:
> > 
> >> On 19.07.2011, at 13:20, Johannes Weiner wrote:
> >> 
> >>> You don't have to work around the mm subsystem trying to reclaim your
> >>> memory,
> > 
> > The pages are pinned by get_free_pages_fast().
> > 
> >>> maintain disk coherency that is guaranteed by the filebacked
> >>> memory semantics etc.
> >>> 
> >>> If your driver provides the memory, there are much less assumptions
> >>> from userspace that you have to consider and memory management will
> >>> not interfere either.
> >> 
> >> Ah, thanks a lot. Scott, mind to switch this to the normal scheme then? Sounds like we don't need to dirty set by then either.
> > 
> > That's a fair bit of churn and added complexity, both here and in qemu.  Is
> > it really worth redesigning this API again, to avoid setting a few dirty
> > bits on an already-slow heavyweight exit?
> 
> Well, alternatively we could simply bail out if the memory is not
> anonymous, right? Then the pinning on get_user_pages_fast should be
> enough. Johannes, would there be any downside to this approach?

I don't see any correctness issues.  Maybe Andrea does?

While the userspace pages are never freed because of your reference,
it does not prevent reclaim from writing them to swap und unmapping
them from the user's page tables.

So even if it's working, it's still a bit of a hack...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (12 preceding siblings ...)
  2011-08-08  8:49 ` Johannes Weiner
@ 2011-08-08 23:13 ` Scott Wood
  2011-08-13 15:14 ` Benjamin Herrenschmidt
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Scott Wood @ 2011-08-08 23:13 UTC (permalink / raw)
  To: kvm-ppc

On 08/08/2011 03:49 AM, Johannes Weiner wrote:
> On Mon, Jul 25, 2011 at 11:50:50PM +0200, Alexander Graf wrote:
>>
>> Well, alternatively we could simply bail out if the memory is not
>> anonymous, right? Then the pinning on get_user_pages_fast should be
>> enough. Johannes, would there be any downside to this approach?
> 
> I don't see any correctness issues.  Maybe Andrea does?
> 
> While the userspace pages are never freed because of your reference,
> it does not prevent reclaim from writing them to swap und unmapping
> them from the user's page tables.

Being unmapped from the user's page tables isn't a problem, as long as
if the mapping is faulted back in before the I/O reference is released,
it points at the same physical page.  Anything else seems like it would
break using get_free_pages() to implement read() -- you could be
swapping out the wrong data.  I hope that the "there may even be a
completely different page there in some cases (eg. if mmapped pagecache
has been invalidated and subsequently re faulted)" in the
__get_user_pages() comment is referring to the !FOLL_WRITE case (or an
explicit mapping change from userspace).

This usage of get_free_pages() is pretty similar to how the guest's
memory is dealt with.  When the guest adds a TLB entry,
get_user_pages_fast() gets called.  It also doesn't get marked dirty
until just before release, and userspace may access the memory before
then (for debugging the guest, emulated DMA, etc).  If that's not a
problem, it shouldn't be a problem here either.

-Scott

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (13 preceding siblings ...)
  2011-08-08 23:13 ` Scott Wood
@ 2011-08-13 15:14 ` Benjamin Herrenschmidt
  2011-08-15 15:03 ` Scott Wood
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-13 15:14 UTC (permalink / raw)
  To: kvm-ppc

On Mon, 2011-07-18 at 11:18 -0500, Scott Wood wrote:

> > Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?
> 
> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
> we should mark them dirty, because they are dirty.  It's up to the rest
> of Linux what to do with that.  Will being pinned stop updates from being
> written out if it is file-backed?  And eventually the vm will be destroyed
> (or the tlb reconfigured) and the pages will be unpinned.

Note that gup or gup_fast won't guarantee that the virtual->physical
mapping remains.

IE. the backing page itself will remain around, but it could be broken
off the mapping and another page can have taken its place in qemu
address space.

(Think page migration for example).

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (14 preceding siblings ...)
  2011-08-13 15:14 ` Benjamin Herrenschmidt
@ 2011-08-15 15:03 ` Scott Wood
  2011-08-15 15:15 ` Benjamin Herrenschmidt
  2011-08-15 20:55 ` Scott Wood
  17 siblings, 0 replies; 19+ messages in thread
From: Scott Wood @ 2011-08-15 15:03 UTC (permalink / raw)
  To: kvm-ppc

On 08/13/2011 10:14 AM, Benjamin Herrenschmidt wrote:
> On Mon, 2011-07-18 at 11:18 -0500, Scott Wood wrote:
> 
>>> Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?
>>
>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
>> we should mark them dirty, because they are dirty.  It's up to the rest
>> of Linux what to do with that.  Will being pinned stop updates from being
>> written out if it is file-backed?  And eventually the vm will be destroyed
>> (or the tlb reconfigured) and the pages will be unpinned.
> 
> Note that gup or gup_fast won't guarantee that the virtual->physical
> mapping remains.
> 
> IE. the backing page itself will remain around, but it could be broken
> off the mapping and another page can have taken its place in qemu
> address space.
> 
> (Think page migration for example).

How would that work if gup is being used to implement read()?  Wouldn't
the data be written to the wrong place?

-Scott


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (15 preceding siblings ...)
  2011-08-15 15:03 ` Scott Wood
@ 2011-08-15 15:15 ` Benjamin Herrenschmidt
  2011-08-15 20:55 ` Scott Wood
  17 siblings, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2011-08-15 15:15 UTC (permalink / raw)
  To: kvm-ppc

On Mon, 2011-08-15 at 10:03 -0500, Scott Wood wrote:
> On 08/13/2011 10:14 AM, Benjamin Herrenschmidt wrote:
> > On Mon, 2011-07-18 at 11:18 -0500, Scott Wood wrote:
> > 
> >>> Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?
> >>
> >> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
> >> we should mark them dirty, because they are dirty.  It's up to the rest
> >> of Linux what to do with that.  Will being pinned stop updates from being
> >> written out if it is file-backed?  And eventually the vm will be destroyed
> >> (or the tlb reconfigured) and the pages will be unpinned.
> > 
> > Note that gup or gup_fast won't guarantee that the virtual->physical
> > mapping remains.
> > 
> > IE. the backing page itself will remain around, but it could be broken
> > off the mapping and another page can have taken its place in qemu
> > address space.
> > 
> > (Think page migration for example).
> 
> How would that work if gup is being used to implement read()?  Wouldn't
> the data be written to the wrong place?

If it drops the mm_sem, I suppose so. You'll have to talk to the vm
folks, they are the ones who warned me against gup.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] KVM: PPC: e500: MMU API
  2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
                   ` (16 preceding siblings ...)
  2011-08-15 15:15 ` Benjamin Herrenschmidt
@ 2011-08-15 20:55 ` Scott Wood
  17 siblings, 0 replies; 19+ messages in thread
From: Scott Wood @ 2011-08-15 20:55 UTC (permalink / raw)
  To: kvm-ppc

On 08/15/2011 10:15 AM, Benjamin Herrenschmidt wrote:
> On Mon, 2011-08-15 at 10:03 -0500, Scott Wood wrote:
>> On 08/13/2011 10:14 AM, Benjamin Herrenschmidt wrote:
>>> On Mon, 2011-07-18 at 11:18 -0500, Scott Wood wrote:
>>>
>>>>> Does this work? Why do we need to set them dirty in the first place? If the shared tlb pages are on file backed storage, we're screwed under memory pressure either way and they'd just get evicted despite us writing to them. Or does vmap pin them? Either way, they're either pinned or not. And if they're not, dirtying them here shouldn't really buy us anything, no?
>>>>
>>>> They're pinned by get_user_pages_fast().  We (potentially) write to them, so
>>>> we should mark them dirty, because they are dirty.  It's up to the rest
>>>> of Linux what to do with that.  Will being pinned stop updates from being
>>>> written out if it is file-backed?  And eventually the vm will be destroyed
>>>> (or the tlb reconfigured) and the pages will be unpinned.
>>>
>>> Note that gup or gup_fast won't guarantee that the virtual->physical
>>> mapping remains.
>>>
>>> IE. the backing page itself will remain around, but it could be broken
>>> off the mapping and another page can have taken its place in qemu
>>> address space.
>>>
>>> (Think page migration for example).
>>
>> How would that work if gup is being used to implement read()?  Wouldn't
>> the data be written to the wrong place?
> 
> If it drops the mm_sem, I suppose so. You'll have to talk to the vm
> folks, they are the ones who warned me against gup.

It seems that migration does check page_count and insist that there be
no unexpected references before migrating.  See step 7 in
Documentation/vm/page_migration.

Likewise with swap -- see is_page_cache_freeable() and __remove_mapping().

KVM is already using gup in a very similar way for the guest's memory pages.

-Scott


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2011-08-15 20:55 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-07 23:41 [PATCH v5 5/5] KVM: PPC: e500: MMU API Scott Wood
2011-07-08 12:57 ` Alexander Graf
2011-07-18 10:09 ` Alexander Graf
2011-07-18 16:18 ` Scott Wood
2011-07-18 16:33 ` Alexander Graf
2011-07-18 18:08 ` Scott Wood
2011-07-18 21:44 ` Alexander Graf
2011-07-19  8:36 ` Johannes Weiner
2011-07-19  8:51 ` Alexander Graf
2011-07-19 11:20 ` Johannes Weiner
2011-07-24  9:16 ` Alexander Graf
2011-07-25 19:25 ` Scott Wood
2011-07-25 21:50 ` Alexander Graf
2011-08-08  8:49 ` Johannes Weiner
2011-08-08 23:13 ` Scott Wood
2011-08-13 15:14 ` Benjamin Herrenschmidt
2011-08-15 15:03 ` Scott Wood
2011-08-15 15:15 ` Benjamin Herrenschmidt
2011-08-15 20:55 ` Scott Wood

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.