[Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing
@ 2019-01-16 17:01 Emilio G. Cota
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 1/3] cputlb: do not evict empty entries to the vtlb Emilio G. Cota
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Emilio G. Cota @ 2019-01-16 17:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Alex Bennée

v6: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg02998.html

Changes since v6:

- Define TCG_TARGET_IMPLEMENTS_DYN_TLB for tcg/riscv (patch 2).

- Fix --disable-tcg breakage (reported by Alex) by moving
  tlb_entry_is_empty to cputlb.c, since the function's only caller
  is in that file (patch 1).

You can fetch this series from:
  https://github.com/cota/qemu/tree/tlb-dyn-v7

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH v7 1/3] cputlb: do not evict empty entries to the vtlb
  2019-01-16 17:01 [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing Emilio G. Cota
@ 2019-01-16 17:01 ` Emilio G. Cota
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 2/3] tcg: introduce dynamic TLB sizing Emilio G. Cota
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Emilio G. Cota @ 2019-01-16 17:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Alex Bennée

Currently we evict an entry to the victim TLB when it doesn't match
the current address. But it could be that there's no match because
the current entry is empty (i.e. all -1's, for instance via tlb_flush).
Do not evict the entry to the vtlb in that case.

This change will help us keep track of the TLB's use rate, which
we'll use to implement a policy for dynamic TLB sizing.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cputlb.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index af6bd8ccf9..10f1150c62 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -224,6 +224,15 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry *tlb_entry,
            tlb_hit_page(tlb_entry->addr_code, page);
 }
 
+/**
+ * tlb_entry_is_empty - return true if the entry is not in use
+ * @te: pointer to CPUTLBEntry
+ */
+static inline bool tlb_entry_is_empty(const CPUTLBEntry *te)
+{
+    return te->addr_read == -1 && te->addr_write == -1 && te->addr_code == -1;
+}
+
 /* Called with tlb_c.lock held */
 static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
                                           target_ulong page)
@@ -591,7 +600,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
      * Only evict the old entry to the victim tlb if it's for a
      * different page; otherwise just overwrite the stale data.
      */
-    if (!tlb_hit_page_anyprot(te, vaddr_page)) {
+    if (!tlb_hit_page_anyprot(te, vaddr_page) && !tlb_entry_is_empty(te)) {
         unsigned vidx = env->tlb_d[mmu_idx].vindex++ % CPU_VTLB_SIZE;
         CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx];
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH v7 2/3] tcg: introduce dynamic TLB sizing
  2019-01-16 17:01 [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing Emilio G. Cota
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 1/3] cputlb: do not evict empty entries to the vtlb Emilio G. Cota
@ 2019-01-16 17:01 ` Emilio G. Cota
  2019-01-18 15:01   ` Alex Bennée
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable " Emilio G. Cota
  2019-01-17 16:31 ` [Qemu-devel] [PATCH v7 0/3] Dynamic " Alex Bennée
  3 siblings, 1 reply; 9+ messages in thread
From: Emilio G. Cota @ 2019-01-16 17:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Alex Bennée

Disabled in all TCG backends for now.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h  |  57 ++++++++++-
 include/exec/cpu_ldst.h  |  21 ++++
 tcg/aarch64/tcg-target.h |   1 +
 tcg/arm/tcg-target.h     |   1 +
 tcg/i386/tcg-target.h    |   1 +
 tcg/mips/tcg-target.h    |   1 +
 tcg/ppc/tcg-target.h     |   1 +
 tcg/riscv/tcg-target.h   |   1 +
 tcg/s390/tcg-target.h    |   1 +
 tcg/sparc/tcg-target.h   |   1 +
 tcg/tci/tcg-target.h     |   1 +
 accel/tcg/cputlb.c       | 202 ++++++++++++++++++++++++++++++++++++++-
 12 files changed, 282 insertions(+), 7 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 6a60f94a41..191a1e021f 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -67,6 +67,28 @@ typedef uint64_t target_ulong;
 #define CPU_TLB_ENTRY_BITS 5
 #endif
 
+#if TCG_TARGET_IMPLEMENTS_DYN_TLB
+#define CPU_TLB_DYN_MIN_BITS 6
+#define CPU_TLB_DYN_DEFAULT_BITS 8
+
+
+# if HOST_LONG_BITS == 32
+/* Make sure we do not require a double-word shift for the TLB load */
+#  define CPU_TLB_DYN_MAX_BITS (32 - TARGET_PAGE_BITS)
+# else /* HOST_LONG_BITS == 64 */
+/*
+ * Assuming TARGET_PAGE_BITS==12, with 2**22 entries we can cover 2**(22+12) ==
+ * 2**34 == 16G of address space. This is roughly what one would expect a
+ * TLB to cover in a modern (as of 2018) x86_64 CPU. For instance, Intel
+ * Skylake's Level-2 STLB has 16 1G entries.
+ * Also, make sure we do not size the TLB past the guest's address space.
+ */
+#  define CPU_TLB_DYN_MAX_BITS                                  \
+    MIN(22, TARGET_VIRT_ADDR_SPACE_BITS - TARGET_PAGE_BITS)
+# endif
+
+#else /* !TCG_TARGET_IMPLEMENTS_DYN_TLB */
+
 /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
  * the TLB is not unnecessarily small, but still small enough for the
  * TLB lookup instruction sequence used by the TCG target.
@@ -98,6 +120,7 @@ typedef uint64_t target_ulong;
          NB_MMU_MODES <= 8 ? 3 : 4))
 
 #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+#endif /* TCG_TARGET_IMPLEMENTS_DYN_TLB */
 
 typedef struct CPUTLBEntry {
     /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
@@ -141,6 +164,18 @@ typedef struct CPUIOTLBEntry {
     MemTxAttrs attrs;
 } CPUIOTLBEntry;
 
+/**
+ * struct CPUTLBWindow
+ * @begin_ns: host time (in ns) at the beginning of the time window
+ * @max_entries: maximum number of entries observed in the window
+ *
+ * See also: tlb_mmu_resize_locked()
+ */
+typedef struct CPUTLBWindow {
+    int64_t begin_ns;
+    size_t max_entries;
+} CPUTLBWindow;
+
 typedef struct CPUTLBDesc {
     /*
      * Describe a region covering all of the large pages allocated
@@ -152,6 +187,10 @@ typedef struct CPUTLBDesc {
     target_ulong large_page_mask;
     /* The next index to use in the tlb victim table.  */
     size_t vindex;
+#if TCG_TARGET_IMPLEMENTS_DYN_TLB
+    CPUTLBWindow window;
+    size_t n_used_entries;
+#endif
 } CPUTLBDesc;
 
 /*
@@ -176,6 +215,20 @@ typedef struct CPUTLBCommon {
     size_t elide_flush_count;
 } CPUTLBCommon;
 
+#if TCG_TARGET_IMPLEMENTS_DYN_TLB
+# define CPU_TLB                                                        \
+    /* tlb_mask[i] contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */    \
+    uintptr_t tlb_mask[NB_MMU_MODES];                                   \
+    CPUTLBEntry *tlb_table[NB_MMU_MODES];
+# define CPU_IOTLB                              \
+    CPUIOTLBEntry *iotlb[NB_MMU_MODES];
+#else
+# define CPU_TLB                                        \
+    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];
+# define CPU_IOTLB                                      \
+    CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];
+#endif
+
 /*
  * The meaning of each of the MMU modes is defined in the target code.
  * Note that NB_MMU_MODES is not yet defined; we can only reference it
@@ -184,9 +237,9 @@ typedef struct CPUTLBCommon {
 #define CPU_COMMON_TLB \
     CPUTLBCommon tlb_c;                                                 \
     CPUTLBDesc tlb_d[NB_MMU_MODES];                                     \
-    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
+    CPU_TLB                                                             \
     CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
-    CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
+    CPU_IOTLB                                                           \
     CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];
 
 #else
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index 959068495a..83b2907d86 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -135,6 +135,21 @@ static inline target_ulong tlb_addr_write(const CPUTLBEntry *entry)
 #endif
 }
 
+#if TCG_TARGET_IMPLEMENTS_DYN_TLB
+/* Find the TLB index corresponding to the mmu_idx + address pair.  */
+static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
+                                  target_ulong addr)
+{
+    uintptr_t size_mask = env->tlb_mask[mmu_idx] >> CPU_TLB_ENTRY_BITS;
+
+    return (addr >> TARGET_PAGE_BITS) & size_mask;
+}
+
+static inline size_t tlb_n_entries(CPUArchState *env, uintptr_t mmu_idx)
+{
+    return (env->tlb_mask[mmu_idx] >> CPU_TLB_ENTRY_BITS) + 1;
+}
+#else
 /* Find the TLB index corresponding to the mmu_idx + address pair.  */
 static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
                                   target_ulong addr)
@@ -142,6 +157,12 @@ static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
     return (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
 }
 
+static inline size_t tlb_n_entries(CPUArchState *env, uintptr_t mmu_idx)
+{
+    return CPU_TLB_SIZE;
+}
+#endif /* TCG_TARGET_IMPLEMENTS_DYN_TLB */
+
 /* Find the TLB entry corresponding to the mmu_idx + address pair.  */
 static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx,
                                      target_ulong addr)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index f966a4fcb3..bff91c5aa0 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -15,6 +15,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 #undef TCG_TARGET_STACK_GROWSUP
 
 typedef enum {
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 16172f73a3..c5a7064bdc 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -60,6 +60,7 @@ extern int arm_arch;
 #undef TCG_TARGET_STACK_GROWSUP
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 
 typedef enum {
     TCG_REG_R0 = 0,
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index f378d29568..bd7d37c7ef 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -27,6 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  1
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index 5cb8672470..8600eefd9a 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -37,6 +37,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 52c1bb04b1..b51854b5cf 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -34,6 +34,7 @@
 #define TCG_TARGET_NB_REGS 32
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 
 typedef enum {
     TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index 60918cacb4..1eb032626c 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -33,6 +33,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
index 853ed6e7aa..394b545369 100644
--- a/tcg/s390/tcg-target.h
+++ b/tcg/s390/tcg-target.h
@@ -27,6 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 2
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 
 typedef enum TCGReg {
     TCG_REG_R0 = 0,
diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
index a0ed2a3342..dc0a227890 100644
--- a/tcg/sparc/tcg-target.h
+++ b/tcg/sparc/tcg-target.h
@@ -29,6 +29,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE 4
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 #define TCG_TARGET_NB_REGS 32
 
 typedef enum {
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 086f34e69a..816dc4697c 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -43,6 +43,7 @@
 #define TCG_TARGET_INTERPRETER 1
 #define TCG_TARGET_INSN_UNIT_SIZE 1
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
 
 #if UINTPTR_MAX == UINT32_MAX
 # define TCG_TARGET_REG_BITS 32
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 10f1150c62..a3a1614f0e 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -74,6 +74,187 @@ QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(run_on_cpu_data));
 QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
 #define ALL_MMUIDX_BITS ((1 << NB_MMU_MODES) - 1)
 
+#if TCG_TARGET_IMPLEMENTS_DYN_TLB
+static inline size_t sizeof_tlb(CPUArchState *env, uintptr_t mmu_idx)
+{
+    return env->tlb_mask[mmu_idx] + (1 << CPU_TLB_ENTRY_BITS);
+}
+
+static void tlb_window_reset(CPUTLBWindow *window, int64_t ns,
+                             size_t max_entries)
+{
+    window->begin_ns = ns;
+    window->max_entries = max_entries;
+}
+
+static void tlb_dyn_init(CPUArchState *env)
+{
+    int i;
+
+    for (i = 0; i < NB_MMU_MODES; i++) {
+        CPUTLBDesc *desc = &env->tlb_d[i];
+        size_t n_entries = 1 << CPU_TLB_DYN_DEFAULT_BITS;
+
+        tlb_window_reset(&desc->window, get_clock_realtime(), 0);
+        desc->n_used_entries = 0;
+        env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS;
+        env->tlb_table[i] = g_new(CPUTLBEntry, n_entries);
+        env->iotlb[i] = g_new(CPUIOTLBEntry, n_entries);
+    }
+}
+
+/**
+ * tlb_mmu_resize_locked() - perform TLB resize bookkeeping; resize if necessary
+ * @env: CPU that owns the TLB
+ * @mmu_idx: MMU index of the TLB
+ *
+ * Called with tlb_lock_held.
+ *
+ * We have two main constraints when resizing a TLB: (1) we only resize it
+ * on a TLB flush (otherwise we'd have to take a perf hit by either rehashing
+ * the array or unnecessarily flushing it), which means we do not control how
+ * frequently the resizing can occur; (2) we don't have access to the guest's
+ * future scheduling decisions, and therefore have to decide the magnitude of
+ * the resize based on past observations.
+ *
+ * In general, a memory-hungry process can benefit greatly from an appropriately
+ * sized TLB, since a guest TLB miss is very expensive. This doesn't mean that
+ * we just have to make the TLB as large as possible; while an oversized TLB
+ * results in minimal TLB miss rates, it also takes longer to be flushed
+ * (flushes can be _very_ frequent), and the reduced locality can also hurt
+ * performance.
+ *
+ * To achieve near-optimal performance for all kinds of workloads, we:
+ *
+ * 1. Aggressively increase the size of the TLB when the use rate of the
+ * TLB being flushed is high, since it is likely that in the near future this
+ * memory-hungry process will execute again, and its memory hungriness will
+ * probably be similar.
+ *
+ * 2. Slowly reduce the size of the TLB as the use rate declines over a
+ * reasonably large time window. The rationale is that if in such a time window
+ * we have not observed a high TLB use rate, it is likely that we won't observe
+ * it in the near future. In that case, once a time window expires we downsize
+ * the TLB to match the maximum use rate observed in the window.
+ *
+ * 3. Try to keep the maximum use rate in a time window in the 30-70% range,
+ * since in that range performance is likely near-optimal. Recall that the TLB
+ * is direct mapped, so we want the use rate to be low (or at least not too
+ * high), since otherwise we are likely to have a significant amount of
+ * conflict misses.
+ */
+static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
+{
+    CPUTLBDesc *desc = &env->tlb_d[mmu_idx];
+    size_t old_size = tlb_n_entries(env, mmu_idx);
+    size_t rate;
+    size_t new_size = old_size;
+    int64_t now = get_clock_realtime();
+    int64_t window_len_ms = 100;
+    int64_t window_len_ns = window_len_ms * 1000 * 1000;
+    bool window_expired = now > desc->window.begin_ns + window_len_ns;
+
+    if (desc->n_used_entries > desc->window.max_entries) {
+        desc->window.max_entries = desc->n_used_entries;
+    }
+    rate = desc->window.max_entries * 100 / old_size;
+
+    if (rate > 70) {
+        new_size = MIN(old_size << 1, 1 << CPU_TLB_DYN_MAX_BITS);
+    } else if (rate < 30 && window_expired) {
+        size_t ceil = pow2ceil(desc->window.max_entries);
+        size_t expected_rate = desc->window.max_entries * 100 / ceil;
+
+        /*
+         * Avoid undersizing when the max number of entries seen is just below
+         * a pow2. For instance, if max_entries == 1025, the expected use rate
+         * would be 1025/2048==50%. However, if max_entries == 1023, we'd get
+         * 1023/1024==99.9% use rate, so we'd likely end up doubling the size
+         * later. Thus, make sure that the expected use rate remains below 70%.
+         * (and since we double the size, that means the lowest rate we'd
+         * expect to get is 35%, which is still in the 30-70% range where
+         * we consider that the size is appropriate.)
+         */
+        if (expected_rate > 70) {
+            ceil *= 2;
+        }
+        new_size = MAX(ceil, 1 << CPU_TLB_DYN_MIN_BITS);
+    }
+
+    if (new_size == old_size) {
+        if (window_expired) {
+            tlb_window_reset(&desc->window, now, desc->n_used_entries);
+        }
+        return;
+    }
+
+    g_free(env->tlb_table[mmu_idx]);
+    g_free(env->iotlb[mmu_idx]);
+
+    tlb_window_reset(&desc->window, now, 0);
+    /* desc->n_used_entries is cleared by the caller */
+    env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS;
+    env->tlb_table[mmu_idx] = g_try_new(CPUTLBEntry, new_size);
+    env->iotlb[mmu_idx] = g_try_new(CPUIOTLBEntry, new_size);
+    /*
+     * If the allocations fail, try smaller sizes. We just freed some
+     * memory, so going back to half of new_size has a good chance of working.
+     * Increased memory pressure elsewhere in the system might cause the
+     * allocations to fail though, so we progressively reduce the allocation
+     * size, aborting if we cannot even allocate the smallest TLB we support.
+     */
+    while (env->tlb_table[mmu_idx] == NULL || env->iotlb[mmu_idx] == NULL) {
+        if (new_size == (1 << CPU_TLB_DYN_MIN_BITS)) {
+            error_report("%s: %s", __func__, strerror(errno));
+            abort();
+        }
+        new_size = MAX(new_size >> 1, 1 << CPU_TLB_DYN_MIN_BITS);
+        env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS;
+
+        g_free(env->tlb_table[mmu_idx]);
+        g_free(env->iotlb[mmu_idx]);
+        env->tlb_table[mmu_idx] = g_try_new(CPUTLBEntry, new_size);
+        env->iotlb[mmu_idx] = g_try_new(CPUIOTLBEntry, new_size);
+    }
+}
+
+static inline void tlb_table_flush_by_mmuidx(CPUArchState *env, int mmu_idx)
+{
+    tlb_mmu_resize_locked(env, mmu_idx);
+    memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx));
+    env->tlb_d[mmu_idx].n_used_entries = 0;
+}
+
+static inline void tlb_n_used_entries_inc(CPUArchState *env, uintptr_t mmu_idx)
+{
+    env->tlb_d[mmu_idx].n_used_entries++;
+}
+
+static inline void tlb_n_used_entries_dec(CPUArchState *env, uintptr_t mmu_idx)
+{
+    env->tlb_d[mmu_idx].n_used_entries--;
+}
+
+#else /* !TCG_TARGET_IMPLEMENTS_DYN_TLB */
+
+static inline void tlb_dyn_init(CPUArchState *env)
+{
+}
+
+static inline void tlb_table_flush_by_mmuidx(CPUArchState *env, int mmu_idx)
+{
+    memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+}
+
+static inline void tlb_n_used_entries_inc(CPUArchState *env, uintptr_t mmu_idx)
+{
+}
+
+static inline void tlb_n_used_entries_dec(CPUArchState *env, uintptr_t mmu_idx)
+{
+}
+#endif /* TCG_TARGET_IMPLEMENTS_DYN_TLB */
+
 void tlb_init(CPUState *cpu)
 {
     CPUArchState *env = cpu->env_ptr;
@@ -82,6 +263,8 @@ void tlb_init(CPUState *cpu)
 
     /* Ensure that cpu_reset performs a full flush.  */
     env->tlb_c.dirty = ALL_MMUIDX_BITS;
+
+    tlb_dyn_init(env);
 }
 
 /* flush_all_helper: run fn across all cpus
@@ -122,7 +305,7 @@ void tlb_flush_counts(size_t *pfull, size_t *ppart, size_t *pelide)
 
 static void tlb_flush_one_mmuidx_locked(CPUArchState *env, int mmu_idx)
 {
-    memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
+    tlb_table_flush_by_mmuidx(env, mmu_idx);
     memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
     env->tlb_d[mmu_idx].large_page_addr = -1;
     env->tlb_d[mmu_idx].large_page_mask = -1;
@@ -234,12 +417,14 @@ static inline bool tlb_entry_is_empty(const CPUTLBEntry *te)
 }
 
 /* Called with tlb_c.lock held */
-static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
+static inline bool tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
                                           target_ulong page)
 {
     if (tlb_hit_page_anyprot(tlb_entry, page)) {
         memset(tlb_entry, -1, sizeof(*tlb_entry));
+        return true;
     }
+    return false;
 }
 
 /* Called with tlb_c.lock held */
@@ -250,7 +435,9 @@ static inline void tlb_flush_vtlb_page_locked(CPUArchState *env, int mmu_idx,
 
     assert_cpu_is_self(ENV_GET_CPU(env));
     for (k = 0; k < CPU_VTLB_SIZE; k++) {
-        tlb_flush_entry_locked(&env->tlb_v_table[mmu_idx][k], page);
+        if (tlb_flush_entry_locked(&env->tlb_v_table[mmu_idx][k], page)) {
+            tlb_n_used_entries_dec(env, mmu_idx);
+        }
     }
 }
 
@@ -267,7 +454,9 @@ static void tlb_flush_page_locked(CPUArchState *env, int midx,
                   midx, lp_addr, lp_mask);
         tlb_flush_one_mmuidx_locked(env, midx);
     } else {
-        tlb_flush_entry_locked(tlb_entry(env, midx, page), page);
+        if (tlb_flush_entry_locked(tlb_entry(env, midx, page), page)) {
+            tlb_n_used_entries_dec(env, midx);
+        }
         tlb_flush_vtlb_page_locked(env, midx, page);
     }
 }
@@ -444,8 +633,9 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
     qemu_spin_lock(&env->tlb_c.lock);
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         unsigned int i;
+        unsigned int n = tlb_n_entries(env, mmu_idx);
 
-        for (i = 0; i < CPU_TLB_SIZE; i++) {
+        for (i = 0; i < n; i++) {
             tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1,
                                          length);
         }
@@ -607,6 +797,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
         /* Evict the old entry into the victim tlb.  */
         copy_tlb_helper_locked(tv, te);
         env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
+        tlb_n_used_entries_dec(env, mmu_idx);
     }
 
     /* refill the tlb */
@@ -658,6 +849,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     }
 
     copy_tlb_helper_locked(te, &tn);
+    tlb_n_used_entries_inc(env, mmu_idx);
     qemu_spin_unlock(&env->tlb_c.lock);
 }
 
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/3] tcg: introduce dynamic TLB sizing
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 2/3] tcg: introduce dynamic TLB sizing Emilio G. Cota
@ 2019-01-18 15:01   ` Alex Bennée
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2019-01-18 15:01 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> Disabled in all TCG backends for now.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/exec/cpu-defs.h  |  57 ++++++++++-
>  include/exec/cpu_ldst.h  |  21 ++++
>  tcg/aarch64/tcg-target.h |   1 +
>  tcg/arm/tcg-target.h     |   1 +
>  tcg/i386/tcg-target.h    |   1 +
>  tcg/mips/tcg-target.h    |   1 +
>  tcg/ppc/tcg-target.h     |   1 +
>  tcg/riscv/tcg-target.h   |   1 +
>  tcg/s390/tcg-target.h    |   1 +
>  tcg/sparc/tcg-target.h   |   1 +
>  tcg/tci/tcg-target.h     |   1 +
>  accel/tcg/cputlb.c       | 202 ++++++++++++++++++++++++++++++++++++++-
>  12 files changed, 282 insertions(+), 7 deletions(-)
>
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 6a60f94a41..191a1e021f 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -67,6 +67,28 @@ typedef uint64_t target_ulong;
>  #define CPU_TLB_ENTRY_BITS 5
>  #endif
>
> +#if TCG_TARGET_IMPLEMENTS_DYN_TLB
> +#define CPU_TLB_DYN_MIN_BITS 6
> +#define CPU_TLB_DYN_DEFAULT_BITS 8
> +
> +
> +# if HOST_LONG_BITS == 32
> +/* Make sure we do not require a double-word shift for the TLB load */
> +#  define CPU_TLB_DYN_MAX_BITS (32 - TARGET_PAGE_BITS)
> +# else /* HOST_LONG_BITS == 64 */
> +/*
> + * Assuming TARGET_PAGE_BITS==12, with 2**22 entries we can cover 2**(22+12) ==
> + * 2**34 == 16G of address space. This is roughly what one would expect a
> + * TLB to cover in a modern (as of 2018) x86_64 CPU. For instance, Intel
> + * Skylake's Level-2 STLB has 16 1G entries.
> + * Also, make sure we do not size the TLB past the guest's address space.
> + */
> +#  define CPU_TLB_DYN_MAX_BITS                                  \
> +    MIN(22, TARGET_VIRT_ADDR_SPACE_BITS - TARGET_PAGE_BITS)
> +# endif
> +
> +#else /* !TCG_TARGET_IMPLEMENTS_DYN_TLB */
> +
>  /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
>   * the TLB is not unnecessarily small, but still small enough for the
>   * TLB lookup instruction sequence used by the TCG target.
> @@ -98,6 +120,7 @@ typedef uint64_t target_ulong;
>           NB_MMU_MODES <= 8 ? 3 : 4))
>
>  #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
> +#endif /* TCG_TARGET_IMPLEMENTS_DYN_TLB */
>
>  typedef struct CPUTLBEntry {
>      /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address
> @@ -141,6 +164,18 @@ typedef struct CPUIOTLBEntry {
>      MemTxAttrs attrs;
>  } CPUIOTLBEntry;
>
> +/**
> + * struct CPUTLBWindow
> + * @begin_ns: host time (in ns) at the beginning of the time window
> + * @max_entries: maximum number of entries observed in the window
> + *
> + * See also: tlb_mmu_resize_locked()
> + */
> +typedef struct CPUTLBWindow {
> +    int64_t begin_ns;
> +    size_t max_entries;
> +} CPUTLBWindow;
> +
>  typedef struct CPUTLBDesc {
>      /*
>       * Describe a region covering all of the large pages allocated
> @@ -152,6 +187,10 @@ typedef struct CPUTLBDesc {
>      target_ulong large_page_mask;
>      /* The next index to use in the tlb victim table.  */
>      size_t vindex;
> +#if TCG_TARGET_IMPLEMENTS_DYN_TLB
> +    CPUTLBWindow window;
> +    size_t n_used_entries;
> +#endif
>  } CPUTLBDesc;
>
>  /*
> @@ -176,6 +215,20 @@ typedef struct CPUTLBCommon {
>      size_t elide_flush_count;
>  } CPUTLBCommon;
>
> +#if TCG_TARGET_IMPLEMENTS_DYN_TLB
> +# define CPU_TLB                                                        \
> +    /* tlb_mask[i] contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */    \
> +    uintptr_t tlb_mask[NB_MMU_MODES];                                   \
> +    CPUTLBEntry *tlb_table[NB_MMU_MODES];
> +# define CPU_IOTLB                              \
> +    CPUIOTLBEntry *iotlb[NB_MMU_MODES];
> +#else
> +# define CPU_TLB                                        \
> +    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];
> +# define CPU_IOTLB                                      \
> +    CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];
> +#endif
> +
>  /*
>   * The meaning of each of the MMU modes is defined in the target code.
>   * Note that NB_MMU_MODES is not yet defined; we can only reference it
> @@ -184,9 +237,9 @@ typedef struct CPUTLBCommon {
>  #define CPU_COMMON_TLB \
>      CPUTLBCommon tlb_c;                                                 \
>      CPUTLBDesc tlb_d[NB_MMU_MODES];                                     \
> -    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
> +    CPU_TLB                                                             \
>      CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
> -    CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                    \
> +    CPU_IOTLB                                                           \
>      CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];
>
>  #else
> diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
> index 959068495a..83b2907d86 100644
> --- a/include/exec/cpu_ldst.h
> +++ b/include/exec/cpu_ldst.h
> @@ -135,6 +135,21 @@ static inline target_ulong tlb_addr_write(const CPUTLBEntry *entry)
>  #endif
>  }
>
> +#if TCG_TARGET_IMPLEMENTS_DYN_TLB
> +/* Find the TLB index corresponding to the mmu_idx + address pair.  */
> +static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
> +                                  target_ulong addr)
> +{
> +    uintptr_t size_mask = env->tlb_mask[mmu_idx] >> CPU_TLB_ENTRY_BITS;
> +
> +    return (addr >> TARGET_PAGE_BITS) & size_mask;
> +}
> +
> +static inline size_t tlb_n_entries(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +    return (env->tlb_mask[mmu_idx] >> CPU_TLB_ENTRY_BITS) + 1;
> +}
> +#else
>  /* Find the TLB index corresponding to the mmu_idx + address pair.  */
>  static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
>                                    target_ulong addr)
> @@ -142,6 +157,12 @@ static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx,
>      return (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>  }
>
> +static inline size_t tlb_n_entries(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +    return CPU_TLB_SIZE;
> +}
> +#endif /* TCG_TARGET_IMPLEMENTS_DYN_TLB */
> +
>  /* Find the TLB entry corresponding to the mmu_idx + address pair.  */
>  static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx,
>                                       target_ulong addr)
> diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
> index f966a4fcb3..bff91c5aa0 100644
> --- a/tcg/aarch64/tcg-target.h
> +++ b/tcg/aarch64/tcg-target.h
> @@ -15,6 +15,7 @@
>
>  #define TCG_TARGET_INSN_UNIT_SIZE  4
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 24
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>  #undef TCG_TARGET_STACK_GROWSUP
>
>  typedef enum {
> diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
> index 16172f73a3..c5a7064bdc 100644
> --- a/tcg/arm/tcg-target.h
> +++ b/tcg/arm/tcg-target.h
> @@ -60,6 +60,7 @@ extern int arm_arch;
>  #undef TCG_TARGET_STACK_GROWSUP
>  #define TCG_TARGET_INSN_UNIT_SIZE 4
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>
>  typedef enum {
>      TCG_REG_R0 = 0,
> diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
> index f378d29568..bd7d37c7ef 100644
> --- a/tcg/i386/tcg-target.h
> +++ b/tcg/i386/tcg-target.h
> @@ -27,6 +27,7 @@
>
>  #define TCG_TARGET_INSN_UNIT_SIZE  1
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>
>  #ifdef __x86_64__
>  # define TCG_TARGET_REG_BITS  64
> diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
> index 5cb8672470..8600eefd9a 100644
> --- a/tcg/mips/tcg-target.h
> +++ b/tcg/mips/tcg-target.h
> @@ -37,6 +37,7 @@
>
>  #define TCG_TARGET_INSN_UNIT_SIZE 4
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>  #define TCG_TARGET_NB_REGS 32
>
>  typedef enum {
> diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
> index 52c1bb04b1..b51854b5cf 100644
> --- a/tcg/ppc/tcg-target.h
> +++ b/tcg/ppc/tcg-target.h
> @@ -34,6 +34,7 @@
>  #define TCG_TARGET_NB_REGS 32
>  #define TCG_TARGET_INSN_UNIT_SIZE 4
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 16
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>
>  typedef enum {
>      TCG_REG_R0,  TCG_REG_R1,  TCG_REG_R2,  TCG_REG_R3,
> diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
> index 60918cacb4..1eb032626c 100644
> --- a/tcg/riscv/tcg-target.h
> +++ b/tcg/riscv/tcg-target.h
> @@ -33,6 +33,7 @@
>
>  #define TCG_TARGET_INSN_UNIT_SIZE 4
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 20
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>  #define TCG_TARGET_NB_REGS 32
>
>  typedef enum {
> diff --git a/tcg/s390/tcg-target.h b/tcg/s390/tcg-target.h
> index 853ed6e7aa..394b545369 100644
> --- a/tcg/s390/tcg-target.h
> +++ b/tcg/s390/tcg-target.h
> @@ -27,6 +27,7 @@
>
>  #define TCG_TARGET_INSN_UNIT_SIZE 2
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 19
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>
>  typedef enum TCGReg {
>      TCG_REG_R0 = 0,
> diff --git a/tcg/sparc/tcg-target.h b/tcg/sparc/tcg-target.h
> index a0ed2a3342..dc0a227890 100644
> --- a/tcg/sparc/tcg-target.h
> +++ b/tcg/sparc/tcg-target.h
> @@ -29,6 +29,7 @@
>
>  #define TCG_TARGET_INSN_UNIT_SIZE 4
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>  #define TCG_TARGET_NB_REGS 32
>
>  typedef enum {
> diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
> index 086f34e69a..816dc4697c 100644
> --- a/tcg/tci/tcg-target.h
> +++ b/tcg/tci/tcg-target.h
> @@ -43,6 +43,7 @@
>  #define TCG_TARGET_INTERPRETER 1
>  #define TCG_TARGET_INSN_UNIT_SIZE 1
>  #define TCG_TARGET_TLB_DISPLACEMENT_BITS 32
> +#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
>
>  #if UINTPTR_MAX == UINT32_MAX
>  # define TCG_TARGET_REG_BITS 32
> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index 10f1150c62..a3a1614f0e 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -74,6 +74,187 @@ QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(run_on_cpu_data));
>  QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16);
>  #define ALL_MMUIDX_BITS ((1 << NB_MMU_MODES) - 1)
>
> +#if TCG_TARGET_IMPLEMENTS_DYN_TLB
> +static inline size_t sizeof_tlb(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +    return env->tlb_mask[mmu_idx] + (1 << CPU_TLB_ENTRY_BITS);
> +}
> +
> +static void tlb_window_reset(CPUTLBWindow *window, int64_t ns,
> +                             size_t max_entries)
> +{
> +    window->begin_ns = ns;
> +    window->max_entries = max_entries;
> +}
> +
> +static void tlb_dyn_init(CPUArchState *env)
> +{
> +    int i;
> +
> +    for (i = 0; i < NB_MMU_MODES; i++) {
> +        CPUTLBDesc *desc = &env->tlb_d[i];
> +        size_t n_entries = 1 << CPU_TLB_DYN_DEFAULT_BITS;
> +
> +        tlb_window_reset(&desc->window, get_clock_realtime(), 0);
> +        desc->n_used_entries = 0;
> +        env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS;
> +        env->tlb_table[i] = g_new(CPUTLBEntry, n_entries);
> +        env->iotlb[i] = g_new(CPUIOTLBEntry, n_entries);
> +    }
> +}
> +
> +/**
> + * tlb_mmu_resize_locked() - perform TLB resize bookkeeping; resize if necessary
> + * @env: CPU that owns the TLB
> + * @mmu_idx: MMU index of the TLB
> + *
> + * Called with tlb_lock_held.
> + *
> + * We have two main constraints when resizing a TLB: (1) we only resize it
> + * on a TLB flush (otherwise we'd have to take a perf hit by either rehashing
> + * the array or unnecessarily flushing it), which means we do not control how
> + * frequently the resizing can occur; (2) we don't have access to the guest's
> + * future scheduling decisions, and therefore have to decide the magnitude of
> + * the resize based on past observations.
> + *
> + * In general, a memory-hungry process can benefit greatly from an appropriately
> + * sized TLB, since a guest TLB miss is very expensive. This doesn't mean that
> + * we just have to make the TLB as large as possible; while an oversized TLB
> + * results in minimal TLB miss rates, it also takes longer to be flushed
> + * (flushes can be _very_ frequent), and the reduced locality can also hurt
> + * performance.
> + *
> + * To achieve near-optimal performance for all kinds of workloads, we:
> + *
> + * 1. Aggressively increase the size of the TLB when the use rate of the
> + * TLB being flushed is high, since it is likely that in the near future this
> + * memory-hungry process will execute again, and its memory hungriness will
> + * probably be similar.
> + *
> + * 2. Slowly reduce the size of the TLB as the use rate declines over a
> + * reasonably large time window. The rationale is that if in such a time window
> + * we have not observed a high TLB use rate, it is likely that we won't observe
> + * it in the near future. In that case, once a time window expires we downsize
> + * the TLB to match the maximum use rate observed in the window.
> + *
> + * 3. Try to keep the maximum use rate in a time window in the 30-70% range,
> + * since in that range performance is likely near-optimal. Recall that the TLB
> + * is direct mapped, so we want the use rate to be low (or at least not too
> + * high), since otherwise we are likely to have a significant amount of
> + * conflict misses.
> + */
> +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx)
> +{
> +    CPUTLBDesc *desc = &env->tlb_d[mmu_idx];
> +    size_t old_size = tlb_n_entries(env, mmu_idx);
> +    size_t rate;
> +    size_t new_size = old_size;
> +    int64_t now = get_clock_realtime();
> +    int64_t window_len_ms = 100;
> +    int64_t window_len_ns = window_len_ms * 1000 * 1000;
> +    bool window_expired = now > desc->window.begin_ns + window_len_ns;
> +
> +    if (desc->n_used_entries > desc->window.max_entries) {
> +        desc->window.max_entries = desc->n_used_entries;
> +    }
> +    rate = desc->window.max_entries * 100 / old_size;
> +
> +    if (rate > 70) {
> +        new_size = MIN(old_size << 1, 1 << CPU_TLB_DYN_MAX_BITS);
> +    } else if (rate < 30 && window_expired) {
> +        size_t ceil = pow2ceil(desc->window.max_entries);
> +        size_t expected_rate = desc->window.max_entries * 100 / ceil;
> +
> +        /*
> +         * Avoid undersizing when the max number of entries seen is just below
> +         * a pow2. For instance, if max_entries == 1025, the expected use rate
> +         * would be 1025/2048==50%. However, if max_entries == 1023, we'd get
> +         * 1023/1024==99.9% use rate, so we'd likely end up doubling the size
> +         * later. Thus, make sure that the expected use rate remains below 70%.
> +         * (and since we double the size, that means the lowest rate we'd
> +         * expect to get is 35%, which is still in the 30-70% range where
> +         * we consider that the size is appropriate.)
> +         */
> +        if (expected_rate > 70) {
> +            ceil *= 2;
> +        }
> +        new_size = MAX(ceil, 1 << CPU_TLB_DYN_MIN_BITS);
> +    }
> +
> +    if (new_size == old_size) {
> +        if (window_expired) {
> +            tlb_window_reset(&desc->window, now, desc->n_used_entries);
> +        }
> +        return;
> +    }
> +
> +    g_free(env->tlb_table[mmu_idx]);
> +    g_free(env->iotlb[mmu_idx]);
> +
> +    tlb_window_reset(&desc->window, now, 0);
> +    /* desc->n_used_entries is cleared by the caller */
> +    env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS;
> +    env->tlb_table[mmu_idx] = g_try_new(CPUTLBEntry, new_size);
> +    env->iotlb[mmu_idx] = g_try_new(CPUIOTLBEntry, new_size);
> +    /*
> +     * If the allocations fail, try smaller sizes. We just freed some
> +     * memory, so going back to half of new_size has a good chance of working.
> +     * Increased memory pressure elsewhere in the system might cause the
> +     * allocations to fail though, so we progressively reduce the allocation
> +     * size, aborting if we cannot even allocate the smallest TLB we support.
> +     */
> +    while (env->tlb_table[mmu_idx] == NULL || env->iotlb[mmu_idx] == NULL) {
> +        if (new_size == (1 << CPU_TLB_DYN_MIN_BITS)) {
> +            error_report("%s: %s", __func__, strerror(errno));
> +            abort();
> +        }
> +        new_size = MAX(new_size >> 1, 1 << CPU_TLB_DYN_MIN_BITS);
> +        env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS;
> +
> +        g_free(env->tlb_table[mmu_idx]);
> +        g_free(env->iotlb[mmu_idx]);
> +        env->tlb_table[mmu_idx] = g_try_new(CPUTLBEntry, new_size);
> +        env->iotlb[mmu_idx] = g_try_new(CPUIOTLBEntry, new_size);
> +    }
> +}
> +
> +static inline void tlb_table_flush_by_mmuidx(CPUArchState *env, int mmu_idx)
> +{
> +    tlb_mmu_resize_locked(env, mmu_idx);
> +    memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx));
> +    env->tlb_d[mmu_idx].n_used_entries = 0;
> +}
> +
> +static inline void tlb_n_used_entries_inc(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +    env->tlb_d[mmu_idx].n_used_entries++;
> +}
> +
> +static inline void tlb_n_used_entries_dec(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +    env->tlb_d[mmu_idx].n_used_entries--;
> +}
> +
> +#else /* !TCG_TARGET_IMPLEMENTS_DYN_TLB */
> +
> +static inline void tlb_dyn_init(CPUArchState *env)
> +{
> +}
> +
> +static inline void tlb_table_flush_by_mmuidx(CPUArchState *env, int mmu_idx)
> +{
> +    memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
> +}
> +
> +static inline void tlb_n_used_entries_inc(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +}
> +
> +static inline void tlb_n_used_entries_dec(CPUArchState *env, uintptr_t mmu_idx)
> +{
> +}
> +#endif /* TCG_TARGET_IMPLEMENTS_DYN_TLB */
> +
>  void tlb_init(CPUState *cpu)
>  {
>      CPUArchState *env = cpu->env_ptr;
> @@ -82,6 +263,8 @@ void tlb_init(CPUState *cpu)
>
>      /* Ensure that cpu_reset performs a full flush.  */
>      env->tlb_c.dirty = ALL_MMUIDX_BITS;
> +
> +    tlb_dyn_init(env);
>  }
>
>  /* flush_all_helper: run fn across all cpus
> @@ -122,7 +305,7 @@ void tlb_flush_counts(size_t *pfull, size_t *ppart, size_t *pelide)
>
>  static void tlb_flush_one_mmuidx_locked(CPUArchState *env, int mmu_idx)
>  {
> -    memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0]));
> +    tlb_table_flush_by_mmuidx(env, mmu_idx);
>      memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0]));
>      env->tlb_d[mmu_idx].large_page_addr = -1;
>      env->tlb_d[mmu_idx].large_page_mask = -1;
> @@ -234,12 +417,14 @@ static inline bool tlb_entry_is_empty(const CPUTLBEntry *te)
>  }
>
>  /* Called with tlb_c.lock held */
> -static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
> +static inline bool tlb_flush_entry_locked(CPUTLBEntry *tlb_entry,
>                                            target_ulong page)
>  {
>      if (tlb_hit_page_anyprot(tlb_entry, page)) {
>          memset(tlb_entry, -1, sizeof(*tlb_entry));
> +        return true;
>      }
> +    return false;
>  }
>
>  /* Called with tlb_c.lock held */
> @@ -250,7 +435,9 @@ static inline void tlb_flush_vtlb_page_locked(CPUArchState *env, int mmu_idx,
>
>      assert_cpu_is_self(ENV_GET_CPU(env));
>      for (k = 0; k < CPU_VTLB_SIZE; k++) {
> -        tlb_flush_entry_locked(&env->tlb_v_table[mmu_idx][k], page);
> +        if (tlb_flush_entry_locked(&env->tlb_v_table[mmu_idx][k], page)) {
> +            tlb_n_used_entries_dec(env, mmu_idx);
> +        }
>      }
>  }
>
> @@ -267,7 +454,9 @@ static void tlb_flush_page_locked(CPUArchState *env, int midx,
>                    midx, lp_addr, lp_mask);
>          tlb_flush_one_mmuidx_locked(env, midx);
>      } else {
> -        tlb_flush_entry_locked(tlb_entry(env, midx, page), page);
> +        if (tlb_flush_entry_locked(tlb_entry(env, midx, page), page)) {
> +            tlb_n_used_entries_dec(env, midx);
> +        }
>          tlb_flush_vtlb_page_locked(env, midx, page);
>      }
>  }
> @@ -444,8 +633,9 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length)
>      qemu_spin_lock(&env->tlb_c.lock);
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>          unsigned int i;
> +        unsigned int n = tlb_n_entries(env, mmu_idx);
>
> -        for (i = 0; i < CPU_TLB_SIZE; i++) {
> +        for (i = 0; i < n; i++) {
>              tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1,
>                                           length);
>          }
> @@ -607,6 +797,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>          /* Evict the old entry into the victim tlb.  */
>          copy_tlb_helper_locked(tv, te);
>          env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
> +        tlb_n_used_entries_dec(env, mmu_idx);
>      }
>
>      /* refill the tlb */
> @@ -658,6 +849,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>      }
>
>      copy_tlb_helper_locked(te, &tn);
> +    tlb_n_used_entries_inc(env, mmu_idx);
>      qemu_spin_unlock(&env->tlb_c.lock);
>  }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable dynamic TLB sizing
  2019-01-16 17:01 [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing Emilio G. Cota
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 1/3] cputlb: do not evict empty entries to the vtlb Emilio G. Cota
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 2/3] tcg: introduce dynamic TLB sizing Emilio G. Cota
@ 2019-01-16 17:01 ` Emilio G. Cota
  2019-01-18 15:04   ` Alex Bennée
  2019-01-17 16:31 ` [Qemu-devel] [PATCH v7 0/3] Dynamic " Alex Bennée
  3 siblings, 1 reply; 9+ messages in thread
From: Emilio G. Cota @ 2019-01-16 17:01 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Alex Bennée

As the following experiments show, this series is a net perf gain,
particularly for memory-heavy workloads. Experiments are run on an
Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.

1. System boot + shudown, debian aarch64:

- Before (v3.1.0):
 Performance counter stats for './die.sh v3.1.0' (10 runs):

       9019.797015      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
    29,910,312,379      cycles                    #    3.316 GHz                      ( +-  0.14% )
    54,699,252,014      instructions              #    1.83  insn per cycle           ( +-  0.08% )
    10,061,951,686      branches                  # 1115.541 M/sec                    ( +-  0.08% )
       172,966,530      branch-misses             #    1.72% of all branches          ( +-  0.07% )

       9.084039051 seconds time elapsed                                          ( +-  0.23% )

- After:
 Performance counter stats for './die.sh tlb-dyn-v5' (10 runs):

       8624.084842      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
    28,556,123,404      cycles                    #    3.311 GHz                      ( +-  0.13% )
    51,755,089,512      instructions              #    1.81  insn per cycle           ( +-  0.05% )
     9,526,513,946      branches                  # 1104.641 M/sec                    ( +-  0.05% )
       166,578,509      branch-misses             #    1.75% of all branches          ( +-  0.19% )

       8.680540350 seconds time elapsed                                          ( +-  0.24% )

That is, a 4.4% perf increase.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (v3.1.0):
      56100.574751      task-clock (msec)         #    1.016 CPUs utilized            ( +-  4.81% )
   200,745,466,128      cycles                    #    3.578 GHz                      ( +-  5.24% )
   431,949,100,608      instructions              #    2.15  insn per cycle           ( +-  5.65% )
    77,502,383,330      branches                  # 1381.490 M/sec                    ( +-  6.18% )
       844,681,191      branch-misses             #    1.09% of all branches          ( +-  3.82% )

      55.221556378 seconds time elapsed                                          ( +-  5.01% )

- After:
      56603.419540      task-clock (msec)         #    1.019 CPUs utilized            ( +- 10.19% )
   202,217,930,479      cycles                    #    3.573 GHz                      ( +- 10.69% )
   439,336,291,626      instructions              #    2.17  insn per cycle           ( +- 14.14% )
    80,538,357,447      branches                  # 1422.853 M/sec                    ( +- 16.09% )
       776,321,622      branch-misses             #    0.96% of all branches          ( +-  3.77% )

      55.549661409 seconds time elapsed                                          ( +- 10.44% )

No improvement (within noise range). Note that for this workload,
increasing the time window too much can lead to perf degradation,
since it flushes the TLB *very* frequently.

3. x86_64 SPEC06int:

           x86_64-softmmu speedup vs. v3.1.0 for SPEC06int (test set)
            Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)

5.5 +------------------------------------------------------------------------+
    |                   +-+                                                  |
  5 |-+.................+-+...............................tlb-dyn-v5.......+-|
    |                   * *                                                  |
4.5 |-+.................*.*................................................+-|
    |                   * *                                                  |
  4 |-+.................*.*................................................+-|
    |                   * *                                                  |
3.5 |-+.................*.*................................................+-|
    |                   * *                                                  |
  3 |-+......+-+*.......*.*................................................+-|
    |        *  *       * *                                                  |
2.5 |-+......*..*.......*.*.................................+-+*...........+-|
    |        *  *       * *                                 *  *             |
  2 |-+......*..*.......*.*.................................*..*...........+-|
    |        *  *       * *                                 *  *  +-+        |
1.5 |-+......*..*.......*.*.................................*..*.*+-+.*+-+.+-|
    |        *  * *+-+  * *  +-+       *+-+  +-+       +-+  *  * *  * *  *   |
  1 |++++-+*+*++*+*++*++*+*++*+*+++-+*+*+-++*+-++++-++++-+++*++*+*++*+*++*+++|
    |   *  * *  * *  *  * *  * *  *  * *  * *  *  * *  * *  *  * *  * *  *   |
0.5 +------------------------------------------------------------------------+
  400.perlb401.bzip403.g429445.g456.hm462.libq464.h471.omn47483.xalancbgeomean
  png: https://imgur.com/YRF90f7

That is, a 1.51x average speedup over the baseline, with a max speedup
of 5.17x.

Here's a different look at the SPEC06int results, using KVM as the baseline:

             x86_64-softmmu slowdown vs. KVM for SPEC06int (test set)
             Host: Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz (Skylake)

25 +---------------------------------------------------------------------------+
   |                   +-+                                        +-+          |
   |                   * *                             +-+      v3.1.0         |
   |                   * *                             +-+  tlb-dyn-v5         |
   |                   * *                             * *        +-+          |
20 |-+.................*.*.............................*.+-+......*.*........+-|
   |                   * *                             * # #      * *          |
   |        +-+        * *                             * # #      * *          |
   |        * *        * *                             * # #      * *          |
15 |-+......*.*........*.*.............................*.#.#......*.+-+......+-|
   |        * *        * *                             * # #      * #|#        |
   |        * *        * *        +-+                  * # #      * +-+        |
   |        * *  +-+   * *        ++-+       +-+       * # #      * # # +-+    |
   |        * *  +-+   * *        * ##       *|   +-+  * # #      * # # +-+    |
10 |-+......*.*..*.+-+.*.*........*.##.......++-+.*.+-+*.#.#......*.#.#.*.*..+-|
   |        * *  * +-+ * *        * ## +-+   *# # * # #* # # +-+  * # # * *    |
   |        * *  * # # * *  +-+   * ## * +-+ *# # * # #* # # * *  * # # *+-+   |
   |        * *  * # # * *  * +-+ * ## * # # *# # * # #* # # * *  * # # * ##   |
 5 |-+......*.+-+*.#.#.*.*..*.#.#.*.##.*.#.#.*#.#.*.#.#*.#.#.*.*..*.#.#.*.##.+-|
   |        * # #* # # * +-+* # # * ## * # # *# # * # #* # # * *  * # # * ##   |
   |        * # #* # # * # #* # # * ## * # # *# # * # #* # # * +-+* # # * ##   |
   |   ++-+ * # #* # # * # #* # # * ## * # # *# # * # #* # # * # #* # # * ##   |
   |+++*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+*+#+#+*#+#+*+#+#*+#+#+*+#+#*+#+#+*+##+++|
 0 +---------------------------------------------------------------------------+
 400.perlbe401.bzi403.gc429445.go456.h462.libqu464.h471.omne4483.xalancbmgeomean
  png: https://imgur.com/YzAMNEV

After this series, we bring down the average SPEC06int slowdown vs KVM
from 11.47x to 7.58x.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/i386/tcg-target.h     |  2 +-
 tcg/i386/tcg-target.inc.c | 28 ++++++++++++++--------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index bd7d37c7ef..bdcf613f65 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -27,7 +27,7 @@
 
 #define TCG_TARGET_INSN_UNIT_SIZE  1
 #define TCG_TARGET_TLB_DISPLACEMENT_BITS 31
-#define TCG_TARGET_IMPLEMENTS_DYN_TLB 0
+#define TCG_TARGET_IMPLEMENTS_DYN_TLB 1
 
 #ifdef __x86_64__
 # define TCG_TARGET_REG_BITS  64
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 1b4e3b80e1..df8b20755c 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -329,6 +329,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type,
 #define OPC_ARITH_GvEv	(0x03)		/* ... plus (ARITH_FOO << 3) */
 #define OPC_ANDN        (0xf2 | P_EXT38)
 #define OPC_ADD_GvEv	(OPC_ARITH_GvEv | (ARITH_ADD << 3))
+#define OPC_AND_GvEv    (OPC_ARITH_GvEv | (ARITH_AND << 3))
 #define OPC_BLENDPS     (0x0c | P_EXT3A | P_DATA16)
 #define OPC_BSF         (0xbc | P_EXT)
 #define OPC_BSR         (0xbd | P_EXT)
@@ -1621,7 +1622,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
         }
         if (TCG_TYPE_PTR == TCG_TYPE_I64) {
             hrexw = P_REXW;
-            if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) {
+            if (TARGET_PAGE_BITS + CPU_TLB_DYN_MAX_BITS > 32) {
                 tlbtype = TCG_TYPE_I64;
                 tlbrexw = P_REXW;
             }
@@ -1629,6 +1630,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
     }
 
     tcg_out_mov(s, tlbtype, r0, addrlo);
+    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
+
+    tcg_out_modrm_offset(s, OPC_AND_GvEv + trexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_mask[mem_index]));
+
+    tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0,
+                         offsetof(CPUArchState, tlb_table[mem_index]));
+
     /* If the required alignment is at least as large as the access, simply
        copy the address and mask.  For lesser alignments, check that we don't
        cross pages for the complete access.  */
@@ -1638,20 +1648,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
         tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_mask);
     }
     tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask;
-
-    tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0,
-                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
-
     tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0);
-    tgen_arithi(s, ARITH_AND + tlbrexw, r0,
-                (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0);
-
-    tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0,
-                             offsetof(CPUArchState, tlb_table[mem_index][0])
-                             + which);
 
     /* cmp 0(r0), r1 */
-    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0);
+    tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which);
 
     /* Prepare for both the fast path add of the tlb addend, and the slow
        path function argument setup.  */
@@ -1664,7 +1664,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
 
     if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
         /* cmp 4(r0), addrhi */
-        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4);
+        tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4);
 
         /* jne slow_path */
         tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
@@ -1676,7 +1676,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
 
     /* add addend(r0), r1 */
     tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0,
-                         offsetof(CPUTLBEntry, addend) - which);
+                         offsetof(CPUTLBEntry, addend));
 }
 
 /*
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable dynamic TLB sizing
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable " Emilio G. Cota
@ 2019-01-18 15:04   ` Alex Bennée
  2019-01-18 15:30     ` Emilio G. Cota
  2019-01-18 23:11     ` Richard Henderson
  0 siblings, 2 replies; 9+ messages in thread
From: Alex Bennée @ 2019-01-18 15:04 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> As the following experiments show, this series is a net perf gain,
> particularly for memory-heavy workloads. Experiments are run on an
> Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.
>
> 1. System boot + shudown, debian aarch64:
>
> - Before (v3.1.0):
>  Performance counter stats for './die.sh v3.1.0' (10 runs):
>
>        9019.797015      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
>     29,910,312,379      cycles                    #    3.316 GHz                      ( +-  0.14% )
>     54,699,252,014      instructions              #    1.83  insn per cycle           ( +-  0.08% )
>     10,061,951,686      branches                  # 1115.541 M/sec                    ( +-  0.08% )
>        172,966,530      branch-misses             #    1.72% of all branches          ( +-  0.07% )
>
>        9.084039051 seconds time elapsed                                          ( +-  0.23% )
>
> - After:
>  Performance counter stats for './die.sh tlb-dyn-v5' (10 runs):
>
>        8624.084842      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
>     28,556,123,404      cycles                    #    3.311 GHz                      ( +-  0.13% )
>     51,755,089,512      instructions              #    1.81  insn per cycle           ( +-  0.05% )
>      9,526,513,946      branches                  # 1104.641 M/sec                    ( +-  0.05% )
>        166,578,509      branch-misses             #    1.75% of all branches          ( +-  0.19% )
>
>        8.680540350 seconds time elapsed                                          ( +-  0.24% )
>
> That is, a 4.4% perf increase.
>
> 2. System boot + shutdown, ubuntu 18.04 x86_64:
>
> - Before (v3.1.0):
>       56100.574751      task-clock (msec)         #    1.016 CPUs utilized            ( +-  4.81% )
>    200,745,466,128      cycles                    #    3.578 GHz                      ( +-  5.24% )
>    431,949,100,608      instructions              #    2.15  insn per cycle           ( +-  5.65% )
>     77,502,383,330      branches                  # 1381.490 M/sec                    ( +-  6.18% )
>        844,681,191      branch-misses             #    1.09% of all branches          ( +-  3.82% )
>
>       55.221556378 seconds time elapsed                                          ( +-  5.01% )
>
> - After:
>       56603.419540      task-clock (msec)         #    1.019 CPUs utilized            ( +- 10.19% )
>    202,217,930,479      cycles                    #    3.573 GHz                      ( +- 10.69% )
>    439,336,291,626      instructions              #    2.17  insn per cycle           ( +- 14.14% )
>     80,538,357,447      branches                  # 1422.853 M/sec                    ( +- 16.09% )
>        776,321,622      branch-misses             #    0.96% of all branches          ( +-  3.77% )
>
>       55.549661409 seconds time elapsed                                          ( +- 10.44% )
>
> No improvement (within noise range). Note that for this workload,
> increasing the time window too much can lead to perf degradation,
> since it flushes the TLB *very* frequently.

I would expect this to be fairly minimal in the amount of memory that is
retouched. We spend a bunch of time paging things in just to drop
everything and die. However heavy memory operations like my build stress
test do see a performance boost.

Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

Do you have access to any aarch64 hardware? It would be nice to see if
we could support it there as well.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable dynamic TLB sizing
  2019-01-18 15:04   ` Alex Bennée
@ 2019-01-18 15:30     ` Emilio G. Cota
  2019-01-18 23:11     ` Richard Henderson
  1 sibling, 0 replies; 9+ messages in thread
From: Emilio G. Cota @ 2019-01-18 15:30 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Richard Henderson

On Fri, Jan 18, 2019 at 15:04:38 +0000, Alex Bennée wrote:
(snip)
> Tested-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

Thanks!

> Do you have access to any aarch64 hardware? It would be nice to see if
> we could support it there as well.

I don't have time to implement this for the aarch64 backend,
but if you (or anyone else) want to do it, I can run benchmarks --
I do have access to an aarch64 host, and also have spec06 compiled
for aarch64.

		E.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable dynamic TLB sizing
  2019-01-18 15:04   ` Alex Bennée
  2019-01-18 15:30     ` Emilio G. Cota
@ 2019-01-18 23:11     ` Richard Henderson
  1 sibling, 0 replies; 9+ messages in thread
From: Richard Henderson @ 2019-01-18 23:11 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota; +Cc: qemu-devel

On 1/19/19 2:04 AM, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
>> As the following experiments show, this series is a net perf gain,
>> particularly for memory-heavy workloads. Experiments are run on an
>> Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz.
>>
>> 1. System boot + shudown, debian aarch64:
>>
>> - Before (v3.1.0):
>>  Performance counter stats for './die.sh v3.1.0' (10 runs):
>>
>>        9019.797015      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
>>     29,910,312,379      cycles                    #    3.316 GHz                      ( +-  0.14% )
>>     54,699,252,014      instructions              #    1.83  insn per cycle           ( +-  0.08% )
>>     10,061,951,686      branches                  # 1115.541 M/sec                    ( +-  0.08% )
>>        172,966,530      branch-misses             #    1.72% of all branches          ( +-  0.07% )
>>
>>        9.084039051 seconds time elapsed                                          ( +-  0.23% )
>>
>> - After:
>>  Performance counter stats for './die.sh tlb-dyn-v5' (10 runs):
>>
>>        8624.084842      task-clock (msec)         #    0.993 CPUs utilized            ( +-  0.23% )
>>     28,556,123,404      cycles                    #    3.311 GHz                      ( +-  0.13% )
>>     51,755,089,512      instructions              #    1.81  insn per cycle           ( +-  0.05% )
>>      9,526,513,946      branches                  # 1104.641 M/sec                    ( +-  0.05% )
>>        166,578,509      branch-misses             #    1.75% of all branches          ( +-  0.19% )
>>
>>        8.680540350 seconds time elapsed                                          ( +-  0.24% )
>>
>> That is, a 4.4% perf increase.
>>
>> 2. System boot + shutdown, ubuntu 18.04 x86_64:
>>
>> - Before (v3.1.0):
>>       56100.574751      task-clock (msec)         #    1.016 CPUs utilized            ( +-  4.81% )
>>    200,745,466,128      cycles                    #    3.578 GHz                      ( +-  5.24% )
>>    431,949,100,608      instructions              #    2.15  insn per cycle           ( +-  5.65% )
>>     77,502,383,330      branches                  # 1381.490 M/sec                    ( +-  6.18% )
>>        844,681,191      branch-misses             #    1.09% of all branches          ( +-  3.82% )
>>
>>       55.221556378 seconds time elapsed                                          ( +-  5.01% )
>>
>> - After:
>>       56603.419540      task-clock (msec)         #    1.019 CPUs utilized            ( +- 10.19% )
>>    202,217,930,479      cycles                    #    3.573 GHz                      ( +- 10.69% )
>>    439,336,291,626      instructions              #    2.17  insn per cycle           ( +- 14.14% )
>>     80,538,357,447      branches                  # 1422.853 M/sec                    ( +- 16.09% )
>>        776,321,622      branch-misses             #    0.96% of all branches          ( +-  3.77% )
>>
>>       55.549661409 seconds time elapsed                                          ( +- 10.44% )
>>
>> No improvement (within noise range). Note that for this workload,
>> increasing the time window too much can lead to perf degradation,
>> since it flushes the TLB *very* frequently.
> 
> I would expect this to be fairly minimal in the amount of memory that is
> retouched. We spend a bunch of time paging things in just to drop
> everything and die. However heavy memory operations like my build stress
> test do see a performance boost.
> 
> Tested-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> 
> Do you have access to any aarch64 hardware? It would be nice to see if
> we could support it there as well.

I've already done some porting to other backends.
You should be able to cherry-pick from

  https://github.com/rth7680/qemu.git cputlb-resize

as I don't think the backend API has changed since v6.

(Most of my feedback that went into v7 was due to issues I encountered porting
to arm32).


r~

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing
  2019-01-16 17:01 [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing Emilio G. Cota
                   ` (2 preceding siblings ...)
  2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable " Emilio G. Cota
@ 2019-01-17 16:31 ` Alex Bennée
  3 siblings, 0 replies; 9+ messages in thread
From: Alex Bennée @ 2019-01-17 16:31 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson


Emilio G. Cota <cota@braap.org> writes:

> v6: https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg02998.html
>
> Changes since v6:
>
> - Define TCG_TARGET_IMPLEMENTS_DYN_TLB for tcg/riscv (patch 2).
>
> - Fix --disable-tcg breakage (reported by Alex) by moving
>   tlb_entry_is_empty to cputlb.c, since the function's only caller
>   is in that file (patch 1).
>
> You can fetch this series from:
>   https://github.com/cota/qemu/tree/tlb-dyn-v7
>

Well so far this has proved to be stable whilst I've been stressing it
out so have a:

Tested-by: Alex Bennée <alex.bennee@linaro.org>

I'm currently benchmarking on my build workload but it's taking a while
to run.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-01-18 23:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-01-16 17:01 [Qemu-devel] [PATCH v7 0/3] Dynamic TLB sizing Emilio G. Cota
2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 1/3] cputlb: do not evict empty entries to the vtlb Emilio G. Cota
2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 2/3] tcg: introduce dynamic TLB sizing Emilio G. Cota
2019-01-18 15:01   ` Alex Bennée
2019-01-16 17:01 ` [Qemu-devel] [PATCH v7 3/3] tcg/i386: enable " Emilio G. Cota
2019-01-18 15:04   ` Alex Bennée
2019-01-18 15:30     ` Emilio G. Cota
2019-01-18 23:11     ` Richard Henderson
2019-01-17 16:31 ` [Qemu-devel] [PATCH v7 0/3] Dynamic " Alex Bennée

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).