[Qemu-devel] [PATCH v4 00/14] tb hash improvements

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 00/14] tb hash improvements
@ 2016-04-30  3:33 Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 01/14] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
                   ` (13 more replies)
  0 siblings, 14 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Changes from v3:

- added reviewed-by tags from v3. I dropped the review tags from the
  'qht' and 'info jit' patches because they have changed quite a bit
  from v3.
- qdist: new module to print intuitive histograms, see 'info jit' below.
- qht:
  + bug fix: remove unnecessary requirement of hashes being !0; the
    only requirement is that pointers must be !NULL.
    * qht-test: hash the integers we insert with their own integer values
      instead of using tb_hash_func5. This gives us better control
      of the hash values we're testing, and anyway the values we
      test are all unique, so this doesn't matter.
  + bug fix: was not setting map->n_items to 0 in qht_reset().
  + Do not leave NULL holes after removals. Instead, swap this hole
    with the last valid item in the chain. Performance-wise this
    makes no difference when resize is on; however, without resize
    the gain is measurable.
    * A consequence of this is a slight change in MRU promotion: the
      last item in the head bucket is simply swapped with orig[pos],
      without bringing orig to head->next.
    * Added bucket corruption checks, enabled with #define QHT_DEBUG.
  + Do not set QHT_MODE_MRU_INSERT for the TB hash. With long chains it
    causes quite a performance decrease; with short chains, such as what
    we have with xxhash + auto-resize, it has no measurable performance
    impact.
  + 'info jit' stats:
    * Report the number of empty buckets
    * Do not count empty buckets when reporting avg bucket chain length;
      by doing this we get an idea of how many buckets the average lookup
      is going through.
    * Report the avg bucket chain length + a histogram for its distribution.
    * Report avg bucket chain occupancy (in %) + its distribution's histogram.
  + qht-test: add a few more test cases
  + header guard: s/define QHT_H/define QEMU_QHT_H/
  + consistently use uint32_t for keeping the result of tb_hash_func()
  + avoid false leak reports from valgrind after calling call_rcu(map)
    by placing the 'struct rcu_head' field at the top of struct qht_map.
- define QEMU_ALIGNED(X) instead of QEMU_ALIGNED(B)
- add copyright + license to include/processor.h
- add atomic_test_and_set to include/atomic.h, using __atomic_test_and_set
  when available.
- spinlock:
  + use newly-added atomic_test_and_set instead of atomic_xchg
  + remove TATAS for spin_try_lock

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 01/14] compiler.h: add QEMU_ALIGNED() to enforce struct alignment
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 02/14] seqlock: remove optional mutex Emilio G. Cota
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/compiler.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 8f1cc7b..b64f899 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -41,6 +41,8 @@
 # define QEMU_PACKED __attribute__((packed))
 #endif
 
+#define QEMU_ALIGNED(X) __attribute__((aligned(X)))
+
 #ifndef glue
 #define xglue(x, y) x ## y
 #define glue(x, y) xglue(x, y)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 02/14] seqlock: remove optional mutex
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 01/14] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 03/14] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This option is unused; besides, it bloats the struct when not needed.
Let's just let writers define their own locks elsewhere.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 |  2 +-
 include/qemu/seqlock.h | 10 +---------
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/cpus.c b/cpus.c
index cbeb1f6..dd86da5 100644
--- a/cpus.c
+++ b/cpus.c
@@ -619,7 +619,7 @@ int cpu_throttle_get_percentage(void)
 
 void cpu_ticks_init(void)
 {
-    seqlock_init(&timers_state.vm_clock_seqlock, NULL);
+    seqlock_init(&timers_state.vm_clock_seqlock);
     vmstate_register(NULL, 0, &vmstate_timers, &timers_state);
     throttle_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL_RT,
                                            cpu_throttle_timer_tick, NULL);
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 70b01fd..e673482 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -19,22 +19,17 @@
 typedef struct QemuSeqLock QemuSeqLock;
 
 struct QemuSeqLock {
-    QemuMutex *mutex;
     unsigned sequence;
 };
 
-static inline void seqlock_init(QemuSeqLock *sl, QemuMutex *mutex)
+static inline void seqlock_init(QemuSeqLock *sl)
 {
-    sl->mutex = mutex;
     sl->sequence = 0;
 }
 
 /* Lock out other writers and update the count.  */
 static inline void seqlock_write_lock(QemuSeqLock *sl)
 {
-    if (sl->mutex) {
-        qemu_mutex_lock(sl->mutex);
-    }
     ++sl->sequence;
 
     /* Write sequence before updating other fields.  */
@@ -47,9 +42,6 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
     smp_wmb();
 
     ++sl->sequence;
-    if (sl->mutex) {
-        qemu_mutex_unlock(sl->mutex);
-    }
 }
 
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 03/14] seqlock: rename write_lock/unlock to write_begin/end
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 01/14] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 02/14] seqlock: remove optional mutex Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 04/14] include/processor.h: define cpu_relax() Emilio G. Cota
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

It is a more appropriate name, now that the mutex embedded
in the seqlock is gone.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c                 | 28 ++++++++++++++--------------
 include/qemu/seqlock.h |  4 ++--
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/cpus.c b/cpus.c
index dd86da5..735c9b2 100644
--- a/cpus.c
+++ b/cpus.c
@@ -247,13 +247,13 @@ int64_t cpu_get_clock(void)
 void cpu_enable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (!timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset -= cpu_get_host_ticks();
         timers_state.cpu_clock_offset -= get_clock();
         timers_state.cpu_ticks_enabled = 1;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* disable cpu_get_ticks() : the clock is stopped. You must not call
@@ -263,13 +263,13 @@ void cpu_enable_ticks(void)
 void cpu_disable_ticks(void)
 {
     /* Here, the really thing protected by seqlock is cpu_clock_offset. */
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (timers_state.cpu_ticks_enabled) {
         timers_state.cpu_ticks_offset += cpu_get_host_ticks();
         timers_state.cpu_clock_offset = cpu_get_clock_locked();
         timers_state.cpu_ticks_enabled = 0;
     }
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 /* Correlation between real and virtual time is always going to be
@@ -292,7 +292,7 @@ static void icount_adjust(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     cur_time = cpu_get_clock_locked();
     cur_icount = cpu_get_icount_locked();
 
@@ -313,7 +313,7 @@ static void icount_adjust(void)
     last_delta = delta;
     timers_state.qemu_icount_bias = cur_icount
                               - (timers_state.qemu_icount << icount_time_shift);
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 }
 
 static void icount_adjust_rt(void *opaque)
@@ -353,7 +353,7 @@ static void icount_warp_rt(void)
         return;
     }
 
-    seqlock_write_lock(&timers_state.vm_clock_seqlock);
+    seqlock_write_begin(&timers_state.vm_clock_seqlock);
     if (runstate_is_running()) {
         int64_t clock = REPLAY_CLOCK(REPLAY_CLOCK_VIRTUAL_RT,
                                      cpu_get_clock_locked());
@@ -372,7 +372,7 @@ static void icount_warp_rt(void)
         timers_state.qemu_icount_bias += warp_delta;
     }
     vm_clock_warp_start = -1;
-    seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+    seqlock_write_end(&timers_state.vm_clock_seqlock);
 
     if (qemu_clock_expired(QEMU_CLOCK_VIRTUAL)) {
         qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
@@ -397,9 +397,9 @@ void qtest_clock_warp(int64_t dest)
         int64_t deadline = qemu_clock_deadline_ns_all(QEMU_CLOCK_VIRTUAL);
         int64_t warp = qemu_soonest_timeout(dest - clock, deadline);
 
-        seqlock_write_lock(&timers_state.vm_clock_seqlock);
+        seqlock_write_begin(&timers_state.vm_clock_seqlock);
         timers_state.qemu_icount_bias += warp;
-        seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+        seqlock_write_end(&timers_state.vm_clock_seqlock);
 
         qemu_clock_run_timers(QEMU_CLOCK_VIRTUAL);
         timerlist_run_timers(aio_context->tlg.tl[QEMU_CLOCK_VIRTUAL]);
@@ -466,9 +466,9 @@ void qemu_start_warp_timer(void)
              * It is useful when we want a deterministic execution time,
              * isolated from host latencies.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             timers_state.qemu_icount_bias += deadline;
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             qemu_clock_notify(QEMU_CLOCK_VIRTUAL);
         } else {
             /*
@@ -479,11 +479,11 @@ void qemu_start_warp_timer(void)
              * you will not be sending network packets continuously instead of
              * every 100ms.
              */
-            seqlock_write_lock(&timers_state.vm_clock_seqlock);
+            seqlock_write_begin(&timers_state.vm_clock_seqlock);
             if (vm_clock_warp_start == -1 || vm_clock_warp_start > clock) {
                 vm_clock_warp_start = clock;
             }
-            seqlock_write_unlock(&timers_state.vm_clock_seqlock);
+            seqlock_write_end(&timers_state.vm_clock_seqlock);
             timer_mod_anticipate(icount_warp_timer, clock + deadline);
         }
     } else if (deadline == 0) {
diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index e673482..4dfc055 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -28,7 +28,7 @@ static inline void seqlock_init(QemuSeqLock *sl)
 }
 
 /* Lock out other writers and update the count.  */
-static inline void seqlock_write_lock(QemuSeqLock *sl)
+static inline void seqlock_write_begin(QemuSeqLock *sl)
 {
     ++sl->sequence;
 
@@ -36,7 +36,7 @@ static inline void seqlock_write_lock(QemuSeqLock *sl)
     smp_wmb();
 }
 
-static inline void seqlock_write_unlock(QemuSeqLock *sl)
+static inline void seqlock_write_end(QemuSeqLock *sl)
 {
     /* Write other fields before finalizing sequence.  */
     smp_wmb();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 04/14] include/processor.h: define cpu_relax()
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (2 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 03/14] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 05/14] atomics: add atomic_test_and_set Emilio G. Cota
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Taken from the linux kernel.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/processor.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)
 create mode 100644 include/qemu/processor.h

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..4e6a71f
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+#define cpu_relax() asm volatile("rep; nop" ::: "memory")
+#endif
+
+#ifdef __ia64__
+#define cpu_relax() asm volatile("hint @pause" ::: "memory")
+#endif
+
+#ifdef __aarch64__
+#define cpu_relax() asm volatile("yield" ::: "memory")
+#endif
+
+#if defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+#define cpu_relax() asm volatile("or 1, 1, 1;"
+                                 "or 2, 2, 2;" ::: "memory")
+#endif
+
+#ifndef cpu_relax
+#define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 05/14] atomics: add atomic_test_and_set
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (3 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 04/14] include/processor.h: define cpu_relax() Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:10   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 06/14] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This new helper expands to __atomic_test_and_set where available;
otherwise it expands to atomic_xchg.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/atomic.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 5bc4d6c..6132dad 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -134,6 +134,7 @@
     })
 
 /* Provide shorter names for GCC atomic builtins, return old value */
+#define atomic_test_and_set(ptr) __atomic_test_and_set(ptr, __ATOMIC_SEQ_CST)
 #define atomic_fetch_inc(ptr)  __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST)
 #define atomic_fetch_dec(ptr)  __atomic_fetch_sub(ptr, 1, __ATOMIC_SEQ_CST)
 #define atomic_fetch_add(ptr, n) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST)
@@ -328,6 +329,7 @@
 #endif
 
 /* Provide shorter names for GCC atomic builtins.  */
+#define atomic_test_and_set(ptr) atomic_xchg(ptr, true)
 #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
 #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
 #define atomic_fetch_add       __sync_fetch_and_add
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 06/14] qemu-thread: add simple test-and-set spinlock
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (4 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 05/14] atomics: add atomic_test_and_set Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:10   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 07/14] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>

Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Emilio's additions: use atomic_test_and_set instead of atomic_xchg;
 call cpu_relax() while spinning; optimize for uncontended locks by
 acquiring the lock with TAS instead of TATAS.]
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/thread.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..39ff1ac 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,9 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include <errno.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +63,35 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    spin->value = 0;
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (atomic_test_and_set(&spin->value)) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    if (atomic_test_and_set(&spin->value)) {
+        return -EBUSY;
+    }
+    return 0;
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    atomic_mb_set(&spin->value, 0);
+}
+
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 07/14] exec: add tb_hash_func5, derived from xxhash
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (5 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 06/14] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 08/14] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This will be used by upcoming changes for hashing the tb hash.

Add this into a separate file to include the copyright notice from
xxhash.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/tb-hash-xx.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 94 insertions(+)
 create mode 100644 include/exec/tb-hash-xx.h

diff --git a/include/exec/tb-hash-xx.h b/include/exec/tb-hash-xx.h
new file mode 100644
index 0000000..67f4e6f
--- /dev/null
+++ b/include/exec/tb-hash-xx.h
@@ -0,0 +1,94 @@
+/*
+ * xxHash - Fast Hash algorithm
+ * Copyright (C) 2012-2016, Yann Collet
+ *
+ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php)
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * + Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * + Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following disclaimer
+ * in the documentation and/or other materials provided with the
+ * distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * You can contact the author at :
+ * - xxHash source repository : https://github.com/Cyan4973/xxHash
+ */
+#ifndef EXEC_TB_HASH_XX
+#define EXEC_TB_HASH_XX
+
+#include <qemu/bitops.h>
+
+#define PRIME32_1   2654435761U
+#define PRIME32_2   2246822519U
+#define PRIME32_3   3266489917U
+#define PRIME32_4    668265263U
+#define PRIME32_5    374761393U
+
+#define TB_HASH_XX_SEED 1
+
+/*
+ * xxhash32, customized for input variables that are not guaranteed to be
+ * contiguous in memory.
+ */
+static inline
+uint32_t tb_hash_func5(uint64_t a0, uint64_t b0, uint32_t e)
+{
+    uint32_t v1 = TB_HASH_XX_SEED + PRIME32_1 + PRIME32_2;
+    uint32_t v2 = TB_HASH_XX_SEED + PRIME32_2;
+    uint32_t v3 = TB_HASH_XX_SEED + 0;
+    uint32_t v4 = TB_HASH_XX_SEED - PRIME32_1;
+    uint32_t a = a0 >> 31 >> 1;
+    uint32_t b = a0;
+    uint32_t c = b0 >> 31 >> 1;
+    uint32_t d = b0;
+    uint32_t h32;
+
+    v1 += a * PRIME32_2;
+    v1 = rol32(v1, 13);
+    v1 *= PRIME32_1;
+
+    v2 += b * PRIME32_2;
+    v2 = rol32(v2, 13);
+    v2 *= PRIME32_1;
+
+    v3 += c * PRIME32_2;
+    v3 = rol32(v3, 13);
+    v3 *= PRIME32_1;
+
+    v4 += d * PRIME32_2;
+    v4 = rol32(v4, 13);
+    v4 *= PRIME32_1;
+
+    h32 = rol32(v1, 1) + rol32(v2, 7) + rol32(v3, 12) + rol32(v4, 18);
+    h32 += 20;
+
+    h32 += e * PRIME32_3;
+    h32  = rol32(h32, 17) * PRIME32_4;
+
+    h32 ^= h32 >> 15;
+    h32 *= PRIME32_2;
+    h32 ^= h32 >> 13;
+    h32 *= PRIME32_3;
+    h32 ^= h32 >> 16;
+
+    return h32;
+}
+
+#endif /* EXEC_TB_HASH_XX */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 08/14] tb hash: hash phys_pc, pc, and flags with xxhash
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (6 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 07/14] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 09/14] qdist: add module to represent frequency distributions of data Emilio G. Cota
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
  https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html

To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.

The appended addresses this with two changes:

1) Use xxhash as the hash table's hash function. xxhash is a fast,
   high-quality hashing function.

2) Feed the hashing function with not just tb_phys, but also pc and flags.

This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.

Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though.

BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.

This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time

Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c             |  4 ++--
 include/exec/tb-hash.h |  8 ++++++--
 translate-all.c        | 11 ++++++-----
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index debc65c..395b302 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -224,7 +224,7 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb, **ptb1;
-    unsigned int h;
+    uint32_t h;
     tb_page_addr_t phys_pc, phys_page1;
     target_ulong virt_page2;
 
@@ -233,7 +233,7 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, pc, flags);
     ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     for(;;) {
         tb = *ptb1;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 0f4e8a0..4b9635a 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,6 +20,9 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
+#include "exec/exec-all.h"
+#include "exec/tb-hash-xx.h"
+
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
    addresses on the same page.  The top bits are the same.  This allows
    TLB invalidation to quickly clear a subset of the hash table.  */
@@ -43,9 +46,10 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
            | (tmp & TB_JMP_ADDR_MASK));
 }
 
-static inline unsigned int tb_phys_hash_func(tb_page_addr_t pc)
+static inline
+uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
 {
-    return (pc >> 2) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index 05c0b50..0463efc 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -965,13 +965,14 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 {
     CPUState *cpu;
     PageDesc *p;
-    unsigned int h, n1;
+    unsigned int n1;
+    uint32_t h;
     tb_page_addr_t phys_pc;
     TranslationBlock *tb1, *tb2;
 
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
 
     /* remove the TB from the page list */
@@ -1470,11 +1471,11 @@ static inline void tb_alloc_page(TranslationBlock *tb,
 static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
-    unsigned int h;
+    uint32_t h;
     TranslationBlock **ptb;
 
-    /* add in the physical hash table */
-    h = tb_phys_hash_func(phys_pc);
+    /* add in the hash table */
+    h = tb_hash_func(phys_pc, tb->pc, tb->flags);
     ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
     tb->phys_hash_next = *ptb;
     *ptb = tb;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 09/14] qdist: add module to represent frequency distributions of data
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (7 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 08/14] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:13   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 10/14] qdist: add test program Emilio G. Cota
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Sometimes it is useful to have a quick histogram to represent a certain
distribution -- for example, when investigating a performance regression
in a hash table due to inadequate hashing.

The appended allows us to easily represent a distribution using Unicode
characters. Further, the data structure keeping track of the distribution
is so simple that obtaining its values for off-line processing is trivial.

Example, taking the last 10 commits to QEMU:

 Characters in commit title  Count
-----------------------------------
                         39      1
                         48      1
                         53      1
                         54      2
                         57      1
                         61      1
                         67      1
                         78      1
                         80      1
qdist_init(&dist);
qdist_inc(&dist, 39);
[...]
qdist_inc(&dist, 80);

char *str = qdist_pr(&dist, 9, QDIST_PR_LABELS);
// -> [39.0,43.6)▂▂ █▂ ▂ ▄[75.4,80.0]
g_free(str);

char *str = qdist_pr(&dist, 4, QDIST_PR_LABELS);
// -> [39.0,49.2)▁█▁▁[69.8,80.0]
g_free(str);

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qdist.h |  62 +++++++++
 util/Makefile.objs   |   2 +-
 util/qdist.c         | 386 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 449 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/qdist.h
 create mode 100644 util/qdist.c

diff --git a/include/qemu/qdist.h b/include/qemu/qdist.h
new file mode 100644
index 0000000..6d8b701
--- /dev/null
+++ b/include/qemu/qdist.h
@@ -0,0 +1,62 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_QDIST_H
+#define QEMU_QDIST_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/bitops.h"
+
+/*
+ * Samples with the same 'x value' end up in the same qdist_entry,
+ * e.g. inc(0.1) and inc(0.1) end up as {x=0.1, count=2}.
+ *
+ * Binning happens only at print time, so that we retain the flexibility to
+ * choose the binning. This might not be ideal for workloads that do not care
+ * much about precision and insert many samples all with different x values;
+ * in that case, pre-binning (e.g. entering both 0.115 and 0.097 as 0.1)
+ * should be considered.
+ */
+struct qdist_entry {
+    double x;
+    unsigned long count;
+};
+
+struct qdist {
+    struct qdist_entry *entries;
+    size_t n;
+};
+
+#define QDIST_PR_BORDER     BIT(0)
+#define QDIST_PR_LABELS     BIT(1)
+/* the remaining options only work if PR_LABELS is set */
+#define QDIST_PR_NODECIMAL  BIT(2)
+#define QDIST_PR_PERCENT    BIT(3)
+#define QDIST_PR_100X       BIT(4)
+#define QDIST_PR_NOBINRANGE BIT(5)
+
+void qdist_init(struct qdist *dist);
+void qdist_destroy(struct qdist *dist);
+
+void qdist_add(struct qdist *dist, double x, long count);
+void qdist_inc(struct qdist *dist, double x);
+double qdist_xmin(const struct qdist *dist);
+double qdist_xmax(const struct qdist *dist);
+double qdist_avg(const struct qdist *dist);
+unsigned long qdist_sample_count(const struct qdist *dist);
+size_t qdist_unique_entries(const struct qdist *dist);
+
+/* callers must free the returned string with g_free() */
+char *qdist_pr_plain(const struct qdist *dist, size_t n_groups);
+
+/* callers must free the returned string with g_free() */
+char *qdist_pr(const struct qdist *dist, size_t n_groups, uint32_t opt);
+
+/* Only qdist code and test code should ever call this function */
+void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n);
+
+#endif /* QEMU_QDIST_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index a8a777e..5985a2e 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qdist.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/qdist.c b/util/qdist.c
new file mode 100644
index 0000000..3343640
--- /dev/null
+++ b/util/qdist.c
@@ -0,0 +1,386 @@
+/*
+ * qdist.c - QEMU helpers for handling frequency distributions of data.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#include "qemu/qdist.h"
+
+#include <math.h>
+#ifndef NAN
+#define NAN (0.0 / 0.0)
+#endif
+
+void qdist_init(struct qdist *dist)
+{
+    dist->entries = NULL;
+    dist->n = 0;
+}
+
+void qdist_destroy(struct qdist *dist)
+{
+    g_free(dist->entries);
+}
+
+static inline int qdist_cmp_double(double a, double b)
+{
+    if (a > b) {
+        return 1;
+    } else if (a < b) {
+        return -1;
+    }
+    return 0;
+}
+
+static int qdist_cmp(const void *ap, const void *bp)
+{
+    const struct qdist_entry *a = ap;
+    const struct qdist_entry *b = bp;
+
+    return qdist_cmp_double(a->x, b->x);
+}
+
+void qdist_add(struct qdist *dist, double x, long count)
+{
+    struct qdist_entry *entry = NULL;
+
+    if (dist->entries) {
+        struct qdist_entry e;
+
+        e.x = x;
+        entry = bsearch(&e, dist->entries, dist->n, sizeof(e), qdist_cmp);
+    }
+
+    if (entry) {
+        entry->count += count;
+        return;
+    }
+
+    dist->entries = g_realloc(dist->entries,
+                              sizeof(*dist->entries) * (dist->n + 1));
+    dist->n++;
+    entry = &dist->entries[dist->n - 1];
+    entry->x = x;
+    entry->count = count;
+    qsort(dist->entries, dist->n, sizeof(*entry), qdist_cmp);
+}
+
+void qdist_inc(struct qdist *dist, double x)
+{
+    qdist_add(dist, x, 1);
+}
+
+/*
+ * Unicode for block elements. See:
+ *   https://en.wikipedia.org/wiki/Block_Elements
+ */
+static const gunichar qdist_blocks[] = {
+    0x2581,
+    0x2582,
+    0x2583,
+    0x2584,
+    0x2585,
+    0x2586,
+    0x2587,
+    0x2588
+};
+
+#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
+
+/*
+ * Print a distribution into a string.
+ *
+ * This function assumes that appropriate binning has been done on the input;
+ * see qdist_bin__internal() and qdist_pr_plain().
+ *
+ * Callers must free the returned string with g_free().
+ */
+static char *qdist_pr_internal(const struct qdist *dist)
+{
+    double min, max, step;
+    GString *s = g_string_new("");
+    size_t i;
+
+    /* if only one entry, its printout will be either full or empty */
+    if (dist->n == 1) {
+        if (dist->entries[0].count) {
+            g_string_append_unichar(s, qdist_blocks[QDIST_NR_BLOCK_CODES - 1]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+        goto out;
+    }
+
+    /* get min and max counts */
+    min = dist->entries[0].count;
+    max = min;
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        if (e->count < min) {
+            min = e->count;
+        }
+        if (e->count > max) {
+            max = e->count;
+        }
+    }
+
+    /* floor((count - min) * step) will give us the block index */
+    step = (QDIST_NR_BLOCK_CODES - 1) / (max - min);
+
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+        int index;
+
+        /* make an exception with 0; instead of using block[0], print a space */
+        if (e->count) {
+            index = (int)((e->count - min) * step);
+            g_string_append_unichar(s, qdist_blocks[index]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+    }
+ out:
+    return g_string_free(s, FALSE);
+}
+
+/*
+ * Bin the distribution in @from into @n bins of consecutive, non-overlapping
+ * intervals, copying the result to @to.
+ *
+ * This function is internal to qdist: only this file and test code should
+ * ever call it.
+ *
+ * Note: calling this function on an already-binned qdist is a bug.
+ *
+ * If @n == 0 or @from->n == 1, use @from->n.
+ */
+void qdist_bin__internal(struct qdist *to, const struct qdist *from, size_t n)
+{
+    double xmin, xmax;
+    double step;
+    size_t i, j, j_min;
+
+    qdist_init(to);
+
+    if (!from->entries) {
+        return;
+    }
+    if (!n || from->n == 1) {
+        n = from->n;
+    }
+
+    /* set equally-sized bins between @from's left and right */
+    xmin = qdist_xmin(from);
+    xmax = qdist_xmax(from);
+    step = (xmax - xmin) / n;
+
+    if (n == from->n) {
+        /* if @from's entries are equally spaced, no need to re-bin */
+        for (i = 0; i < from->n; i++) {
+            if (from->entries[i].x != xmin + i * step) {
+                goto rebin;
+            }
+        }
+        /* they're equally spaced, so copy the dist and bail out */
+        to->entries = g_malloc(sizeof(*to->entries) * from->n);
+        to->n = from->n;
+        memcpy(to->entries, from->entries, sizeof(*to->entries) * to->n);
+        return;
+    }
+
+ rebin:
+    j_min = 0;
+    for (i = 0; i < n; i++) {
+        double x;
+        double left, right;
+
+        left = xmin + i * step;
+        right = xmin + (i + 1) * step;
+
+        /* Add x, even if it might not get any counts later */
+        x = left;
+        qdist_add(to, x, 0);
+
+        /*
+         * To avoid double-counting we capture [left, right) ranges, except for
+         * the righmost bin, which captures a [left, right] range.
+         */
+        for (j = j_min; j < from->n; j++) {
+            struct qdist_entry *o = &from->entries[j];
+
+            /* entries are ordered so do not check beyond right */
+            if (o->x > right) {
+                break;
+            }
+            if (o->x >= left && (o->x < right ||
+                                   (i == n - 1 && o->x == right))) {
+                qdist_add(to, x, o->count);
+                /* don't check this entry again */
+                j_min = j + 1;
+            }
+        }
+    }
+}
+
+/*
+ * Print @dist into a string, after re-binning it into @n bins of consecutive,
+ * non-overlapping intervals.
+ *
+ * If @n == 0, use @orig->n.
+ *
+ * Callers must free the returned string with g_free().
+ */
+char *qdist_pr_plain(const struct qdist *dist, size_t n)
+{
+    struct qdist binned;
+    char *ret;
+
+    if (!dist->entries) {
+        return NULL;
+    }
+    qdist_bin__internal(&binned, dist, n);
+    ret = qdist_pr_internal(&binned);
+    qdist_destroy(&binned);
+    return ret;
+}
+
+static char *qdist_pr_label(const struct qdist *dist, size_t n_bins,
+                            uint32_t opt, bool is_left)
+{
+    const char *percent;
+    const char *lparen;
+    const char *rparen;
+    GString *s;
+    double x1, x2, step;
+    double x;
+    double n;
+    int dec;
+
+    s = g_string_new("");
+    if (!(opt & QDIST_PR_LABELS)) {
+        goto out;
+    }
+
+    dec = opt & QDIST_PR_NODECIMAL ? 0 : 1;
+    percent = opt & QDIST_PR_PERCENT ? "%" : "";
+
+    n = n_bins ? n_bins : dist->n;
+    x = is_left ? qdist_xmin(dist) : qdist_xmax(dist);
+    step = (qdist_xmax(dist) - qdist_xmin(dist)) / n;
+
+    if (opt & QDIST_PR_100X) {
+        x *= 100.0;
+        step *= 100.0;
+    }
+    if (opt & QDIST_PR_NOBINRANGE) {
+        lparen = rparen = "";
+        x1 = x;
+        x2 = x; /* unnecessary, but a dumb compiler might not figure it out */
+    } else {
+        lparen = "[";
+        rparen = is_left ? ")" : "]";
+        if (is_left) {
+            x1 = x;
+            x2 = x + step;
+        } else {
+            x1 = x - step;
+            x2 = x;
+        }
+    }
+    g_string_append_printf(s, "%s%.*f", lparen, dec, x1);
+    if (!(opt & QDIST_PR_NOBINRANGE)) {
+        g_string_append_printf(s, ",%.*f%s", dec, x2, rparen);
+    }
+    g_string_append(s, percent);
+ out:
+    return g_string_free(s, FALSE);
+}
+
+/*
+ * Print the distribution's histogram into a string.
+ *
+ * See also: qdist_pr_plain().
+ *
+ * Callers must free the returned string with g_free().
+ */
+char *qdist_pr(const struct qdist *dist, size_t n_bins, uint32_t opt)
+{
+    const char *border = opt & QDIST_PR_BORDER ? "|" : "";
+    char *llabel, *rlabel;
+    char *hgram;
+    GString *s;
+
+    if (dist->entries == NULL) {
+        return NULL;
+    }
+
+    s = g_string_new("");
+
+    llabel = qdist_pr_label(dist, n_bins, opt, true);
+    rlabel = qdist_pr_label(dist, n_bins, opt, false);
+    hgram = qdist_pr_plain(dist, n_bins);
+    g_string_append_printf(s, "%s%s%s%s%s",
+                           llabel, border, hgram, border, rlabel);
+    g_free(llabel);
+    g_free(rlabel);
+    g_free(hgram);
+
+    return g_string_free(s, FALSE);
+}
+
+static inline double qdist_x(const struct qdist *dist, int index)
+{
+    if (dist->entries == NULL) {
+        return NAN;
+    }
+    return dist->entries[index].x;
+}
+
+double qdist_xmin(const struct qdist *dist)
+{
+    return qdist_x(dist, 0);
+}
+
+double qdist_xmax(const struct qdist *dist)
+{
+    return qdist_x(dist, dist->n - 1);
+}
+
+size_t qdist_unique_entries(const struct qdist *dist)
+{
+    return dist->n;
+}
+
+unsigned long qdist_sample_count(const struct qdist *dist)
+{
+    unsigned long count = 0;
+    size_t i;
+
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        count += e->count;
+    }
+    return count;
+}
+
+double qdist_avg(const struct qdist *dist)
+{
+    unsigned long count;
+    size_t i;
+    double ret = 0;
+
+    count = qdist_sample_count(dist);
+    if (!count) {
+        return NAN;
+    }
+    for (i = 0; i < dist->n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        ret += e->x * e->count / count;
+    }
+    return ret;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 10/14] qdist: add test program
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (8 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 09/14] qdist: add module to represent frequency distributions of data Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:23   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 11/14] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore   |   1 +
 tests/Makefile     |   6 +-
 tests/test-qdist.c | 363 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 369 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qdist.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 9eed229..384d2fd 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -47,6 +47,7 @@ test-qapi-types.[ch]
 test-qapi-visit.[ch]
 test-qdev-global-props
 test-qemu-opts
+test-qdist
 test-qga
 test-qmp-commands
 test-qmp-commands.h
diff --git a/tests/Makefile b/tests/Makefile
index 9194f18..d770bf8 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -68,6 +68,8 @@ check-unit-y += tests/rcutorture$(EXESUF)
 gcov-files-rcutorture-y = util/rcu.c
 check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
+check-unit-y += tests/test-qdist$(EXESUF)
+gcov-files-test-qdist-y = util/qdist.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -389,7 +391,8 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-qmp-commands.o tests/test-visitor-serialization.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
-	tests/rcutorture.o tests/test-rcu-list.o
+	tests/rcutorture.o tests/test-rcu-list.o \
+	tests/test-qdist.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -427,6 +430,7 @@ tests/test-cutils$(EXESUF): tests/test-cutils.o util/cutils.o
 tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
+tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qdist.c b/tests/test-qdist.c
new file mode 100644
index 0000000..ecdd3f4
--- /dev/null
+++ b/tests/test-qdist.c
@@ -0,0 +1,363 @@
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qdist.h"
+
+#include <math.h>
+
+struct entry_desc {
+    double x;
+    unsigned long count;
+
+    /* 0 prints a space, 1-8 prints from qdist_blocks[] */
+    int fill_code;
+};
+
+/* See: https://en.wikipedia.org/wiki/Block_Elements */
+static const gunichar qdist_blocks[] = {
+    0x2581,
+    0x2582,
+    0x2583,
+    0x2584,
+    0x2585,
+    0x2586,
+    0x2587,
+    0x2588
+};
+
+#define QDIST_NR_BLOCK_CODES ARRAY_SIZE(qdist_blocks)
+
+static char *pr_hist(const struct entry_desc *darr, size_t n)
+{
+    GString *s = g_string_new("");
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        int fill = darr[i].fill_code;
+
+        if (fill) {
+            assert(fill <= QDIST_NR_BLOCK_CODES);
+            g_string_append_unichar(s, qdist_blocks[fill - 1]);
+        } else {
+            g_string_append_c(s, ' ');
+        }
+    }
+    return g_string_free(s, FALSE);
+}
+
+static void
+histogram_check(const struct qdist *dist, const struct entry_desc *darr,
+                size_t n, size_t n_bins)
+{
+    char *pr = qdist_pr_plain(dist, n_bins);
+    char *str = pr_hist(darr, n);
+
+    g_assert_cmpstr(pr, ==, str);
+    g_free(pr);
+    g_free(str);
+}
+
+static void histogram_check_single_full(const struct qdist *dist, size_t n_bins)
+{
+    struct entry_desc desc = { .fill_code = 8 };
+
+    histogram_check(dist, &desc, 1, n_bins);
+}
+
+static void
+entries_check(const struct qdist *dist, const struct entry_desc *darr, size_t n)
+{
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        struct qdist_entry *e = &dist->entries[i];
+
+        g_assert_cmpuint(e->count, ==, darr[i].count);
+    }
+}
+
+static void
+entries_insert(struct qdist *dist, const struct entry_desc *darr, size_t n)
+{
+    size_t i;
+
+    for (i = 0; i < n; i++) {
+        qdist_add(dist, darr[i].x, darr[i].count);
+    }
+}
+
+static void do_test_bin(const struct entry_desc *a, size_t n_a,
+                        const struct entry_desc *b, size_t n_b)
+{
+    struct qdist qda;
+    struct qdist qdb;
+
+    qdist_init(&qda);
+
+    entries_insert(&qda, a, n_a);
+    qdist_inc(&qda, a[0].x);
+    qdist_add(&qda, a[0].x, -1);
+
+    g_assert_cmpuint(qdist_unique_entries(&qda), ==, n_a);
+    g_assert_cmpfloat(qdist_xmin(&qda), ==, a[0].x);
+    g_assert_cmpfloat(qdist_xmax(&qda), ==, a[n_a - 1].x);
+    histogram_check(&qda, a, n_a, 0);
+    histogram_check(&qda, a, n_a, n_a);
+
+    qdist_bin__internal(&qdb, &qda, n_b);
+    g_assert_cmpuint(qdb.n, ==, n_b);
+    entries_check(&qdb, b, n_b);
+    g_assert_cmpuint(qdist_sample_count(&qda), ==, qdist_sample_count(&qdb));
+    /*
+     * No histogram_check() for $qdb, since we'd rebin it and that is a bug.
+     * Instead, regenerate it from $qda.
+     */
+    histogram_check(&qda, b, n_b, n_b);
+
+    qdist_destroy(&qdb);
+    qdist_destroy(&qda);
+}
+
+static void do_test_pr(uint32_t opt)
+{
+    static const struct entry_desc desc[] = {
+        [0] = { 1, 900, 8 },
+        [1] = { 2, 1, 1 },
+        [2] = { 3, 2, 1 }
+    };
+    static const char border[] = "|";
+    const char *llabel = NULL;
+    const char *rlabel = NULL;
+    struct qdist dist;
+    GString *s;
+    char *str;
+    char *pr;
+    size_t n;
+
+    n = ARRAY_SIZE(desc);
+    qdist_init(&dist);
+
+    entries_insert(&dist, desc, n);
+    histogram_check(&dist, desc, n, 0);
+
+    s = g_string_new("");
+
+    if (opt & QDIST_PR_LABELS) {
+        unsigned int lopts = opt & (QDIST_PR_NODECIMAL |
+                                    QDIST_PR_PERCENT |
+                                    QDIST_PR_100X |
+                                    QDIST_PR_NOBINRANGE);
+
+        if (lopts == 0) {
+            llabel = "[1.0,1.7)";
+            rlabel = "[2.3,3.0]";
+        } else if (lopts == QDIST_PR_NODECIMAL) {
+            llabel = "[1,2)";
+            rlabel = "[2,3]";
+        } else if (lopts == (QDIST_PR_PERCENT | QDIST_PR_NODECIMAL)) {
+            llabel = "[1,2)%";
+            rlabel = "[2,3]%";
+        } else if (lopts == QDIST_PR_100X) {
+            llabel = "[100.0,166.7)";
+            rlabel = "[233.3,300.0]";
+        } else if (lopts == (QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL)) {
+            llabel = "1";
+            rlabel = "3";
+        } else {
+            g_assert_cmpstr("BUG", ==, "This is not meant to be exhaustive");
+        }
+    }
+
+    if (llabel) {
+        g_string_append(s, llabel);
+    }
+    if (opt & QDIST_PR_BORDER) {
+        g_string_append(s, border);
+    }
+
+    str = pr_hist(desc, n);
+    g_string_append(s, str);
+    g_free(str);
+
+    if (opt & QDIST_PR_BORDER) {
+        g_string_append(s, border);
+    }
+    if (rlabel) {
+        g_string_append(s, rlabel);
+    }
+
+    str = g_string_free(s, FALSE);
+    pr = qdist_pr(&dist, n, opt);
+    g_assert_cmpstr(pr, ==, str);
+    g_free(pr);
+    g_free(str);
+
+    qdist_destroy(&dist);
+}
+
+static inline void do_test_pr_label(uint32_t opt)
+{
+    opt |= QDIST_PR_LABELS;
+    do_test_pr(opt);
+}
+
+static void test_pr(void)
+{
+    do_test_pr(0);
+
+    do_test_pr(QDIST_PR_BORDER);
+
+    /* 100X should be ignored because we're not setting LABELS */
+    do_test_pr(QDIST_PR_100X);
+
+    do_test_pr_label(0);
+    do_test_pr_label(QDIST_PR_NODECIMAL);
+    do_test_pr_label(QDIST_PR_PERCENT | QDIST_PR_NODECIMAL);
+    do_test_pr_label(QDIST_PR_100X);
+    do_test_pr_label(QDIST_PR_NOBINRANGE | QDIST_PR_NODECIMAL);
+}
+
+static void test_bin_shrink(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 0.0,   42922, 7 },
+        [1] = { 0.25,  47834, 8 },
+        [2] = { 0.50,  26628, 0 },
+        [3] = { 0.625, 597,   4 },
+        [4] = { 0.75,  10298, 1 },
+        [5] = { 0.875, 22,    2 },
+        [6] = { 1.0,   2771,  1 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0.0, 42922, 7 },
+        [1] = { 0.25, 47834, 8 },
+        [2] = { 0.50, 27225, 3 },
+        [3] = { 0.75, 13091, 1 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_bin_expand(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 0.0,   11713, 5 },
+        [1] = { 0.25,  20294, 0 },
+        [2] = { 0.50,  17266, 8 },
+        [3] = { 0.625, 1506,  0 },
+        [4] = { 0.75,  10355, 6 },
+        [5] = { 0.833, 2,     1 },
+        [6] = { 0.875, 99,    4 },
+        [7] = { 1.0,   4301,  2 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0.0, 11713, 5 },
+        [1] = { 0.0, 0,     0 },
+        [2] = { 0.0, 20294, 8 },
+        [3] = { 0.0, 0,     0 },
+        [4] = { 0.0, 0,     0 },
+        [5] = { 0.0, 17266, 6 },
+        [6] = { 0.0, 1506,  1 },
+        [7] = { 0.0, 10355, 4 },
+        [8] = { 0.0, 101,   1 },
+        [9] = { 0.0, 4301,  2 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_bin_simple(void)
+{
+    static const struct entry_desc a[] = {
+        [0] = { 10, 101, 8 },
+        [1] = { 11, 0, 0 },
+        [2] = { 12, 2, 1 }
+    };
+    static const struct entry_desc b[] = {
+        [0] = { 0, 101, 8 },
+        [1] = { 0, 0, 0 },
+        [2] = { 0, 0, 0 },
+        [3] = { 0, 0, 0 },
+        [4] = { 0, 2, 1 }
+    };
+
+    return do_test_bin(a, ARRAY_SIZE(a), b, ARRAY_SIZE(b));
+}
+
+static void test_single_full(void)
+{
+    struct qdist dist;
+
+    qdist_init(&dist);
+
+    qdist_add(&dist, 3, 102);
+    g_assert_cmpfloat(qdist_avg(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
+
+    histogram_check_single_full(&dist, 0);
+    histogram_check_single_full(&dist, 1);
+    histogram_check_single_full(&dist, 10);
+
+    qdist_destroy(&dist);
+}
+
+static void test_single_empty(void)
+{
+    struct qdist dist;
+    char *pr;
+
+    qdist_init(&dist);
+
+    qdist_add(&dist, 3, 0);
+    g_assert_cmpuint(qdist_sample_count(&dist), ==, 0);
+    g_assert(isnan(qdist_avg(&dist)));
+    g_assert_cmpfloat(qdist_xmin(&dist), ==, 3);
+    g_assert_cmpfloat(qdist_xmax(&dist), ==, 3);
+
+    pr = qdist_pr_plain(&dist, 0);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    pr = qdist_pr_plain(&dist, 1);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    pr = qdist_pr_plain(&dist, 2);
+    g_assert_cmpstr(pr, ==, " ");
+    g_free(pr);
+
+    qdist_destroy(&dist);
+}
+
+static void test_none(void)
+{
+    struct qdist dist;
+    char *pr;
+
+    qdist_init(&dist);
+
+    g_assert(isnan(qdist_avg(&dist)));
+    g_assert(isnan(qdist_xmin(&dist)));
+    g_assert(isnan(qdist_xmax(&dist)));
+
+    pr = qdist_pr_plain(&dist, 0);
+    g_assert(pr == NULL);
+
+    pr = qdist_pr_plain(&dist, 2);
+    g_assert(pr == NULL);
+
+    qdist_destroy(&dist);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qdist/none", test_none);
+    g_test_add_func("/qdist/single/empty", test_single_empty);
+    g_test_add_func("/qdist/single/full", test_single_full);
+    g_test_add_func("/qdist/binning/simple", test_bin_simple);
+    g_test_add_func("/qdist/binning/expand", test_bin_expand);
+    g_test_add_func("/qdist/binning/shrink", test_bin_shrink);
+    g_test_add_func("/qdist/pr", test_pr);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 11/14] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (9 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 10/14] qdist: add test program Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:17   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 12/14] qht: add test program Emilio G. Cota
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

This is a hash table with optional auto-resizing and MRU promotion for
reads and writes. Its implementation goal is to stay fast while
scaling for read-mostly workloads.

A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/qht.h |  67 +++++
 util/Makefile.objs |   2 +-
 util/qht.c         | 722 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 790 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/qht.h
 create mode 100644 util/qht.c

diff --git a/include/qemu/qht.h b/include/qemu/qht.h
new file mode 100644
index 0000000..8beea75
--- /dev/null
+++ b/include/qemu/qht.h
@@ -0,0 +1,67 @@
+/*
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ */
+#ifndef QEMU_QHT_H
+#define QEMU_QHT_H
+
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qemu/seqlock.h"
+#include "qemu/qdist.h"
+#include "qemu/rcu.h"
+
+struct qht {
+    struct qht_map *map;
+    unsigned int mode;
+};
+
+struct qht_stats {
+    size_t head_buckets;
+    size_t used_head_buckets;
+    size_t entries;
+    struct qdist chain;
+    struct qdist occupancy;
+};
+
+typedef bool (*qht_lookup_func_t)(const void *obj, const void *userp);
+typedef void (*qht_iter_func_t)(struct qht *ht, void *p, uint32_t h, void *up);
+
+#define QHT_MODE_MRU_LOOKUP  0x1 /* move looked-up items to head */
+#define QHT_MODE_MRU_INSERT  0x2 /* insert new elements at the head */
+#define QHT_MODE_AUTO_RESIZE 0x4 /* auto-resize when heavily loaded */
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode);
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht);
+
+/* call with an external lock held */
+void qht_reset_size(struct qht *ht, size_t n_elems);
+
+/* call with an external lock held */
+void qht_insert(struct qht *ht, void *p, uint32_t hash);
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash);
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp);
+
+/* call with an external lock held */
+void qht_grow(struct qht *ht);
+
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash);
+
+/* pass @stats to qht_statistics_destroy() when done */
+void qht_statistics_init(struct qht *ht, struct qht_stats *stats);
+
+void qht_statistics_destroy(struct qht_stats *stats);
+
+#endif /* QEMU_QHT_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 5985a2e..7952f6d 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qdist.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o qdist.o qht.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-$(CONFIG_POSIX) += event_notifier-posix.o
 util-obj-$(CONFIG_POSIX) += mmap-alloc.o
diff --git a/util/qht.c b/util/qht.c
new file mode 100644
index 0000000..ef64510
--- /dev/null
+++ b/util/qht.c
@@ -0,0 +1,722 @@
+/*
+ * qht.c - QEMU Hash Table, designed to scale for read-mostly workloads.
+ *
+ * Copyright (C) 2016, Emilio G. Cota <cota@braap.org>
+ *
+ * License: GNU GPL, version 2 or later.
+ *   See the COPYING file in the top-level directory.
+ *
+ * Assumptions:
+ * - Writers and iterators must take an external lock.
+ * - NULL cannot be inserted as a pointer value.
+ *
+ * Features:
+ * - Optional auto-resizing: the hash table resizes up if the load surpasses
+ *   a certain threshold. Resizing is done concurrently with readers.
+ * - Optional bucket MRU promotion policy.
+ *
+ * The key structure is the bucket, which is cacheline-sized. Buckets
+ * contain a few hash values and pointers; the u32 hash values are stored in
+ * full so that resizing is fast. Having this structure instead of directly
+ * chaining items has three advantages:
+ * - Failed lookups fail fast, and touch a minimum number of cache lines.
+ * - Resizing the hash table with concurrent lookups is easy.
+ * - We can have a few Most-Recently-Used (MRU) hash-pointer pairs in the same
+ *   head bucket. This helps scalability, since MRU promotions (i.e. writes to
+ *   the bucket) become less common.
+ *
+ * For concurrent lookups we use a per-bucket seqlock; per-bucket spinlocks
+ * allow readers (lookups) to upgrade to writers and thus implement an MRU
+ * promotion policy; these MRU-induced writes do not touch the cache lines of
+ * other head buckets.
+ *
+ * There are two types of buckets:
+ * 1. "head" buckets are the ones allocated in the array of buckets in qht_map.
+ * 2. all "non-head" buckets (i.e. all others) are members of a chain that
+ *    starts from a head bucket.
+ * Note that the seqlock and spinlock of a head bucket applies to all buckets
+ * chained to it; these two fields are unused in non-head buckets.
+ *
+ * On removals, we move the last valid item in the chain to the position of the
+ * just-removed entry. This makes lookups slightly faster, since the moment an
+ * invalid entry is found, the (failed) lookup is over.
+ *
+ * Resizing is done by taking all spinlocks (so that no readers-turned-writers
+ * can race with us) and then placing all elements into a new hash table. Last,
+ * the ht->map pointer is set, and the old map is freed once no RCU readers can
+ * see it anymore.
+ *
+ * Related Work:
+ * - Idea of cacheline-sized buckets with full hashes taken from:
+ *   David, Guerraoui & Trigonakis, "Asynchronized Concurrency:
+ *   The Secret to Scaling Concurrent Search Data Structures", ASPLOS'15.
+ * - Why not RCU-based hash tables? They would allow us to get rid of the
+ *   seqlock, but resizing would take forever since RCU read critical
+ *   sections in QEMU take quite a long time.
+ *   More info on relativistic hash tables:
+ *   + Triplett, McKenney & Walpole, "Resizable, Scalable, Concurrent Hash
+ *     Tables via Relativistic Programming", USENIX ATC'11.
+ *   + Corbet, "Relativistic hash tables, part 1: Algorithms", @ lwn.net, 2014.
+ *     https://lwn.net/Articles/612021/
+ */
+#include "qemu/qht.h"
+#include "qemu/atomic.h"
+
+//#define QHT_DEBUG
+
+/*
+ * We want to avoid false sharing of cache lines. Most systems have 64-byte
+ * cache lines so we go with it for simplicity.
+ *
+ * Note that systems with smaller cache lines will be fine (the struct is
+ * almost 64-bytes); systems with larger cache lines might suffer from
+ * some false sharing.
+ */
+#define QHT_BUCKET_ALIGN 64
+
+/* define these to keep sizeof(qht_bucket) within QHT_BUCKET_ALIGN */
+#if HOST_LONG_BITS == 32
+#define QHT_BUCKET_ENTRIES 6
+#else /* 64-bit */
+#define QHT_BUCKET_ENTRIES 4
+#endif
+
+struct qht_bucket {
+    QemuSpin lock;
+    QemuSeqLock sequence;
+    uint32_t hashes[QHT_BUCKET_ENTRIES];
+    void *pointers[QHT_BUCKET_ENTRIES];
+    struct qht_bucket *next;
+} QEMU_ALIGNED(QHT_BUCKET_ALIGN);
+
+QEMU_BUILD_BUG_ON(sizeof(struct qht_bucket) > QHT_BUCKET_ALIGN);
+
+/* Note: keep @rcu as the top element to help valgrind find the whole struct */
+struct qht_map {
+    struct rcu_head rcu;
+    struct qht_bucket *buckets;
+    size_t n;
+    size_t n_items;
+    size_t n_items_threshold;
+};
+
+#ifdef QHT_DEBUG
+static void qht_bucket_debug(struct qht_bucket *b)
+{
+    bool seen_empty = false;
+    bool corrupt = false;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                seen_empty = true;
+                continue;
+            }
+            if (seen_empty) {
+                fprintf(stderr, "%s: b: %p, pos: %i, hash: 0x%x, p: %p\n",
+                       __func__, b, i, b->hashes[i], b->pointers[i]);
+                corrupt = true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    assert(!corrupt);
+}
+
+static void qht_map_debug(struct qht_map *map)
+{
+    int i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_debug(&map->buckets[i]);
+    }
+}
+#else
+static inline void qht_bucket_debug(struct qht_bucket *b)
+{ }
+
+static inline void qht_map_debug(struct qht_map *map)
+{ }
+#endif /* QHT_DEBUG */
+
+static inline size_t qht_elems_to_buckets(size_t n_elems)
+{
+    return pow2ceil(n_elems / QHT_BUCKET_ENTRIES);
+}
+
+static inline void qht_head_init(struct qht_bucket *b)
+{
+    memset(b, 0, sizeof(*b));
+    qemu_spin_init(&b->lock);
+    seqlock_init(&b->sequence);
+}
+
+static inline
+struct qht_bucket *qht_map_to_bucket(struct qht_map *map, uint32_t hash)
+{
+    return &map->buckets[hash & (map->n - 1)];
+}
+
+static inline void qht_chain_destroy(struct qht_bucket *head)
+{
+    struct qht_bucket *curr = head->next;
+    struct qht_bucket *prev;
+
+    while (curr) {
+        prev = curr;
+        curr = curr->next;
+        qemu_vfree(prev);
+    }
+}
+
+/* pass only an orphan map */
+static void qht_map_destroy(struct qht_map *map)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_chain_destroy(&map->buckets[i]);
+    }
+    qemu_vfree(map->buckets);
+    g_free(map);
+}
+
+static void qht_map_reclaim(struct rcu_head *rcu)
+{
+    struct qht_map *map = container_of(rcu, struct qht_map, rcu);
+
+    qht_map_destroy(map);
+}
+
+static struct qht_map *qht_map_create(size_t n)
+{
+    struct qht_map *map;
+    size_t i;
+
+    map = g_malloc(sizeof(*map));
+    map->n = n;
+    map->n_items = 0;
+    map->n_items_threshold = n * QHT_BUCKET_ENTRIES / 2;
+    map->buckets = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*map->buckets) * n);
+    for (i = 0; i < n; i++) {
+        qht_head_init(&map->buckets[i]);
+    }
+    return map;
+}
+
+static inline void qht_publish(struct qht *ht, struct qht_map *new)
+{
+    /* Readers should see a properly initialized map; pair with smp_rmb() */
+    smp_wmb();
+    atomic_set(&ht->map, new);
+}
+
+void qht_init(struct qht *ht, size_t n_elems, unsigned int mode)
+{
+    struct qht_map *map;
+    size_t n = qht_elems_to_buckets(n_elems);
+
+    map = qht_map_create(n);
+    ht->mode = mode;
+    qht_publish(ht, map);
+}
+
+/* call only when there are no readers left */
+void qht_destroy(struct qht *ht)
+{
+    qht_map_destroy(ht->map);
+    memset(ht, 0, sizeof(*ht));
+}
+
+static void qht_bucket_reset(struct qht_bucket *head)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    qemu_spin_lock(&head->lock);
+    seqlock_write_begin(&head->sequence);
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                goto done;
+            }
+            atomic_set(&b->hashes[i], 0);
+            atomic_set(&b->pointers[i], NULL);
+        }
+        b = b->next;
+    } while (b);
+ done:
+    seqlock_write_end(&head->sequence);
+    qemu_spin_unlock(&head->lock);
+}
+
+/* call with an external lock held */
+void qht_reset(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_reset(&map->buckets[i]);
+    }
+    qht_map_debug(map);
+    map->n_items = 0;
+}
+
+/* call with an external lock held */
+void qht_reset_size(struct qht *ht, size_t n_elems)
+{
+    struct qht_map *old = ht->map;
+
+    qht_reset(ht);
+    if (old->n == qht_elems_to_buckets(n_elems)) {
+        return;
+    }
+    qht_init(ht, n_elems, ht->mode);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+/*
+ * Call with head->lock held.
+ *
+ * This function moves the item in @orig[@pos] to @head[0]. In order to
+ * do this, the item i in @head is moved to @head's i+1 position,
+ * with the last item being moved to @orig[@pos].
+ *
+ * Note: @orig == @head is a bug.
+ */
+static inline void qht_bucket_mru__locked(struct qht_bucket *head,
+                                          struct qht_bucket *orig, int pos)
+{
+    uint32_t *dest_hash;
+    void **dest_p;
+    void *p;
+    uint32_t hash;
+    int i;
+
+    assert(orig != head);
+    for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+        if (i == QHT_BUCKET_ENTRIES - 1) {
+            dest_hash = &orig->hashes[pos];
+            dest_p = &orig->pointers[pos];
+        } else {
+            dest_hash = &head->hashes[i + 1];
+            dest_p = &head->pointers[i + 1];
+        }
+        hash = *dest_hash;
+        p = *dest_p;
+
+        atomic_set(dest_hash, head->hashes[i]);
+        atomic_set(dest_p, head->pointers[i]);
+
+        atomic_set(&head->hashes[i], hash);
+        atomic_set(&head->pointers[i], p);
+    }
+}
+
+static void qht_bucket_mru(struct qht_bucket *head, struct qht_bucket *orig,
+                           const void *p, int pos)
+{
+    qemu_spin_lock(&head->lock);
+    if (unlikely(orig->pointers[pos] != p)) {
+        /* while we acquired the lock, the bucket was updated, so bail out */
+        goto out;
+    }
+    seqlock_write_begin(&head->sequence);
+    qht_bucket_mru__locked(head, orig, pos);
+    seqlock_write_end(&head->sequence);
+ out:
+    qemu_spin_unlock(&head->lock);
+}
+
+/*
+ * @far_bucket and @pos are only filled in if the match is found in a bucket
+ * that is not the head bucket.
+ */
+static inline
+void *qht_do_lookup(struct qht_bucket *head, struct qht_bucket **far_bucket,
+                    int *pos, qht_lookup_func_t func, const void *userp,
+                    uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            void *p = atomic_read(&b->pointers[i]);
+
+            if (unlikely(p == NULL)) {
+                return NULL;
+            }
+            if (atomic_read(&b->hashes[i]) == hash) {
+                if (likely(func(p, userp))) {
+                    if (unlikely(b != head)) {
+                        *far_bucket = b;
+                        *pos = i;
+                    }
+                    return p;
+                }
+            }
+        }
+        b = atomic_read(&b->next);
+        /*
+         * This barrier guarantees that we will read a properly initialized
+         * b->next; it is paired with an smp_wmb() before setting b->next.
+         */
+        smp_rmb();
+    } while (b);
+    return NULL;
+}
+
+void *qht_lookup(struct qht *ht, qht_lookup_func_t func, const void *userp,
+                 uint32_t hash)
+{
+    struct qht_bucket *far_bucket = NULL;
+    struct qht_bucket *b;
+    struct qht_map *map;
+    uint32_t version;
+    int pos = 0;
+    void *ret;
+
+    map = atomic_read(&ht->map);
+    /* paired with smp_wmb() before setting ht->map */
+    smp_rmb();
+    b = qht_map_to_bucket(map, hash);
+
+    do {
+        version = seqlock_read_begin(&b->sequence);
+        ret = qht_do_lookup(b, &far_bucket, &pos, func, userp, hash);
+    } while (seqlock_read_retry(&b->sequence, version));
+    if ((ht->mode & QHT_MODE_MRU_LOOKUP) && unlikely(far_bucket)) {
+        qht_bucket_mru(b, far_bucket, ret, pos);
+    }
+    return ret;
+}
+
+/* call with head->lock held */
+static void qht_insert__locked(struct qht *ht, struct qht_map *map,
+                               struct qht_bucket *head, void *p, uint32_t hash)
+{
+    struct qht_bucket *b = head;
+    struct qht_bucket *prev = NULL;
+    struct qht_bucket *new = NULL;
+    int i;
+
+    for (;;) {
+        if (b == NULL) {
+            b = qemu_memalign(QHT_BUCKET_ALIGN, sizeof(*b));
+            memset(b, 0, sizeof(*b));
+            new = b;
+        }
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i]) {
+                continue;
+            }
+            /* found an empty key: acquire the seqlock and write */
+            seqlock_write_begin(&head->sequence);
+            if (new) {
+                /*
+                 * This barrier is paired with smp_rmb() after reading
+                 * b->next when not holding b->lock.
+                 */
+                smp_wmb();
+                atomic_set(&prev->next, b);
+            }
+            atomic_set(&b->hashes[i], hash);
+            atomic_set(&b->pointers[i], p);
+            if ((ht->mode & QHT_MODE_MRU_INSERT) && b != head) {
+                qht_bucket_mru__locked(head, b, i);
+            }
+            seqlock_write_end(&head->sequence);
+            map->n_items++;
+            return;
+        }
+        prev = b;
+        b = b->next;
+    }
+}
+
+/* call with an external lock held */
+void qht_insert(struct qht *ht, void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+
+    /* NULL pointers are not supported */
+    assert(p);
+
+    qemu_spin_lock(&b->lock);
+    qht_insert__locked(ht, map, b, p, hash);
+    qht_bucket_debug(b);
+    qemu_spin_unlock(&b->lock);
+
+    if ((ht->mode & QHT_MODE_AUTO_RESIZE) &&
+        unlikely(map->n_items > map->n_items_threshold)) {
+        qht_grow(ht);
+    }
+}
+
+static inline bool qht_entry_is_last(struct qht_bucket *b, int pos)
+{
+    if (pos == QHT_BUCKET_ENTRIES - 1) {
+        if (b->next == NULL) {
+            return true;
+        }
+        return b->next->pointers[0] == NULL;
+    }
+    return b->pointers[pos + 1] == NULL;
+}
+
+static void
+qht_entry_move(struct qht_bucket *to, int i, struct qht_bucket *from, int j)
+{
+    assert(!(to == from && i == j));
+    assert(to->pointers[i] == NULL);
+    assert(from->pointers[j]);
+
+    atomic_set(&to->hashes[i], from->hashes[j]);
+    atomic_set(&to->pointers[i], from->pointers[j]);
+
+    atomic_set(&from->hashes[j], 0);
+    atomic_set(&from->pointers[j], NULL);
+}
+
+/*
+ * Find the last valid entry in @head, and swap it with @orig[pos], which has
+ * just been invalidated.
+ */
+static inline void qht_bucket_fill_hole(struct qht_bucket *orig, int pos)
+{
+    struct qht_bucket *b = orig;
+    struct qht_bucket *prev = NULL;
+    int i;
+
+    if (qht_entry_is_last(orig, pos)) {
+        return;
+    }
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] || (b == orig && i == pos)) {
+                continue;
+            }
+            if (i > 0) {
+                return qht_entry_move(orig, pos, b, i - 1);
+            }
+            assert(prev);
+            return qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
+        }
+        prev = b;
+        b = b->next;
+    } while (b);
+    /* no free entries other than orig[pos], so swap it with the last one */
+    qht_entry_move(orig, pos, prev, QHT_BUCKET_ENTRIES - 1);
+}
+
+/* call with b->lock held */
+static inline bool qht_remove__locked(struct qht_map *map, struct qht_bucket *b,
+                                      const void *p, uint32_t hash)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            void *q = b->pointers[i];
+
+            if (unlikely(q == NULL)) {
+                return false;
+            }
+            if (q == p) {
+                assert(b->hashes[i] == hash);
+                atomic_set(&b->hashes[i], 0);
+                atomic_set(&b->pointers[i], NULL);
+                map->n_items--;
+                qht_bucket_fill_hole(b, i);
+                return true;
+            }
+        }
+        b = b->next;
+    } while (b);
+    return false;
+}
+
+/* call with an external lock held */
+bool qht_remove(struct qht *ht, const void *p, uint32_t hash)
+{
+    struct qht_map *map = ht->map;
+    struct qht_bucket *b = qht_map_to_bucket(map, hash);
+    bool ret;
+
+    qemu_spin_lock(&b->lock);
+    seqlock_write_begin(&b->sequence);
+    ret = qht_remove__locked(map, b, p, hash);
+    qht_bucket_debug(b);
+    seqlock_write_end(&b->sequence);
+    qemu_spin_unlock(&b->lock);
+    return ret;
+}
+
+/*
+ * acquire all spinlocks from a map, so that writers cannot race with
+ * readers-turned-writers.
+ */
+static void qht_lock(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    /* if readers cannot upgrade, do nothing */
+    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
+        return;
+    }
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_lock(&b->lock);
+    }
+}
+
+static void qht_unlock(struct qht *ht)
+{
+    struct qht_map *map = ht->map;
+    size_t i;
+
+    if (!(ht->mode & QHT_MODE_MRU_LOOKUP)) {
+        return;
+    }
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *b = &map->buckets[i];
+
+        qemu_spin_unlock(&b->lock);
+    }
+}
+
+static inline void qht_bucket_iter(struct qht *ht, struct qht_bucket *b,
+                                   qht_iter_func_t func, void *userp)
+{
+    int i;
+
+    do {
+        for (i = 0; i < QHT_BUCKET_ENTRIES; i++) {
+            if (b->pointers[i] == NULL) {
+                return;
+            }
+            func(ht, b->pointers[i], b->hashes[i], userp);
+        }
+        b = b->next;
+    } while (b);
+}
+
+/* external lock + all of the map's locks held (if !MRU_LOOKUP) */
+static inline void qht_map_iter__locked(struct qht *ht, struct qht_map *map,
+                                        qht_iter_func_t func, void *userp)
+{
+    size_t i;
+
+    for (i = 0; i < map->n; i++) {
+        qht_bucket_iter(ht, &map->buckets[i], func, userp);
+    }
+}
+
+/* call with an external lock held */
+void qht_iter(struct qht *ht, qht_iter_func_t func, void *userp)
+{
+    qht_lock(ht);
+    qht_map_iter__locked(ht, ht->map, func, userp);
+    qht_unlock(ht);
+}
+
+static void qht_map_copy(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    struct qht_map *new = userp;
+    struct qht_bucket *b = qht_map_to_bucket(new, hash);
+
+    /* no need to acquire b->lock because no thread has seen this map yet */
+    qht_insert__locked(ht, new, b, p, hash);
+}
+
+/* call with an external lock held */
+void qht_grow(struct qht *ht)
+{
+    struct qht_map *old = ht->map;
+    struct qht_map *new;
+    size_t n = old->n * 2;
+
+    new = qht_map_create(n);
+    qht_iter(ht, qht_map_copy, new);
+    qht_map_debug(new);
+
+    qht_publish(ht, new);
+    call_rcu1(&old->rcu, qht_map_reclaim);
+}
+
+/* pass @stats to qht_statistics_destroy() when done */
+void qht_statistics_init(struct qht *ht, struct qht_stats *stats)
+{
+    struct qht_map *map;
+    int i;
+
+    map = atomic_read(&ht->map);
+    /* paired with smp_wmb() before setting ht->map */
+    smp_rmb();
+
+    /*
+     * Cannot read map->n_entries because once we're done reading the map,
+     * the value might have changed. Instead, we'll count the elements we see.
+     *
+     * Reading map->n is fine though; it is constant for each map.
+     */
+    stats->head_buckets = map->n;
+    stats->used_head_buckets = 0;
+    stats->entries = 0;
+    qdist_init(&stats->chain);
+    qdist_init(&stats->occupancy);
+
+    for (i = 0; i < map->n; i++) {
+        struct qht_bucket *head = &map->buckets[i];
+        struct qht_bucket *b;
+        uint32_t version;
+        size_t buckets;
+        size_t entries;
+        int j;
+
+        do {
+            version = seqlock_read_begin(&head->sequence);
+            buckets = 0;
+            entries = 0;
+            b = head;
+            do {
+                for (j = 0; j < QHT_BUCKET_ENTRIES; j++) {
+                    if (atomic_read(&b->pointers[j]) == NULL) {
+                        break;
+                    }
+                    entries++;
+                }
+                buckets++;
+                b = atomic_read(&b->next);
+                /*
+                 * This barrier guarantees that we will read a properly
+                 * initialized b->next; it is paired with an smp_wmb()
+                 * before setting b->next.
+                 */
+                smp_rmb();
+            } while (b);
+        } while (seqlock_read_retry(&head->sequence, version));
+
+        if (entries) {
+            qdist_inc(&stats->chain, buckets);
+            qdist_inc(&stats->occupancy,
+                      (double)entries / QHT_BUCKET_ENTRIES / buckets);
+            stats->used_head_buckets++;
+            stats->entries += entries;
+        } else {
+            qdist_inc(&stats->occupancy, 0);
+        }
+    }
+}
+
+void qht_statistics_destroy(struct qht_stats *stats)
+{
+    qdist_destroy(&stats->occupancy);
+    qdist_destroy(&stats->chain);
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 12/14] qht: add test program
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (10 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 11/14] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:22   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht Emilio G. Cota
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 14/14] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tests/.gitignore |   1 +
 tests/Makefile   |   5 +-
 tests/test-qht.c | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 182 insertions(+), 1 deletion(-)
 create mode 100644 tests/test-qht.c

diff --git a/tests/.gitignore b/tests/.gitignore
index 384d2fd..c6e680d 100644
--- a/tests/.gitignore
+++ b/tests/.gitignore
@@ -49,6 +49,7 @@ test-qdev-global-props
 test-qemu-opts
 test-qdist
 test-qga
+test-qht
 test-qmp-commands
 test-qmp-commands.h
 test-qmp-event
diff --git a/tests/Makefile b/tests/Makefile
index d770bf8..0ab10af 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -70,6 +70,8 @@ check-unit-y += tests/test-rcu-list$(EXESUF)
 gcov-files-test-rcu-list-y = util/rcu.c
 check-unit-y += tests/test-qdist$(EXESUF)
 gcov-files-test-qdist-y = util/qdist.c
+check-unit-y += tests/test-qht$(EXESUF)
+gcov-files-test-qht-y = util/qht.c
 check-unit-y += tests/test-bitops$(EXESUF)
 check-unit-$(CONFIG_HAS_GLIB_SUBPROCESS_TESTS) += tests/test-qdev-global-props$(EXESUF)
 check-unit-y += tests/check-qom-interface$(EXESUF)
@@ -392,7 +394,7 @@ test-obj-y = tests/check-qint.o tests/check-qstring.o tests/check-qdict.o \
 	tests/test-x86-cpuid.o tests/test-mul64.o tests/test-int128.o \
 	tests/test-opts-visitor.o tests/test-qmp-event.o \
 	tests/rcutorture.o tests/test-rcu-list.o \
-	tests/test-qdist.o
+	tests/test-qdist.o tests/test-qht.o
 
 $(test-obj-y): QEMU_INCLUDES += -Itests
 QEMU_CFLAGS += -I$(SRC_PATH)/tests
@@ -431,6 +433,7 @@ tests/test-int128$(EXESUF): tests/test-int128.o
 tests/rcutorture$(EXESUF): tests/rcutorture.o $(test-util-obj-y)
 tests/test-rcu-list$(EXESUF): tests/test-rcu-list.o $(test-util-obj-y)
 tests/test-qdist$(EXESUF): tests/test-qdist.o $(test-util-obj-y)
+tests/test-qht$(EXESUF): tests/test-qht.o $(test-util-obj-y)
 
 tests/test-qdev-global-props$(EXESUF): tests/test-qdev-global-props.o \
 	hw/core/qdev.o hw/core/qdev-properties.o hw/core/hotplug.o\
diff --git a/tests/test-qht.c b/tests/test-qht.c
new file mode 100644
index 0000000..c8866e8
--- /dev/null
+++ b/tests/test-qht.c
@@ -0,0 +1,177 @@
+#include "qemu/osdep.h"
+#include <glib.h>
+#include "qemu/qht.h"
+
+#define N 5000
+
+static struct qht ht;
+static int32_t arr[N * 2];
+
+static bool is_equal(const void *obj, const void *userp)
+{
+    const int32_t *a = obj;
+    const int32_t *b = userp;
+
+    return *a == *b;
+}
+
+static void insert(int a, int b)
+{
+    int i;
+
+    for (i = a; i < b; i++) {
+        uint32_t hash;
+
+        arr[i] = i;
+        hash = i;
+
+        qht_insert(&ht, &arr[i], hash);
+    }
+}
+
+static void rm(int init, int end)
+{
+    int i;
+
+    for (i = init; i < end; i++) {
+        uint32_t hash;
+
+        hash = arr[i];
+        g_assert_true(qht_remove(&ht, &arr[i], hash));
+    }
+}
+
+static void check(int a, int b, bool expected)
+{
+    struct qht_stats stats;
+    int i;
+
+    for (i = a; i < b; i++) {
+        void *p;
+        uint32_t hash;
+        int32_t val;
+
+        val = i;
+        hash = i;
+        p = qht_lookup(&ht, is_equal, &val, hash);
+        g_assert_true(!!p == expected);
+    }
+    qht_statistics_init(&ht, &stats);
+    if (stats.used_head_buckets) {
+        g_assert_cmpfloat(qdist_avg(&stats.chain), >=, 1.0);
+    }
+    g_assert_cmpuint(stats.head_buckets, >, 0);
+    qht_statistics_destroy(&stats);
+}
+
+static void count_func(struct qht *ht, void *p, uint32_t hash, void *userp)
+{
+    unsigned int *curr = userp;
+
+    (*curr)++;
+}
+
+static void check_n(size_t expected)
+{
+    struct qht_stats stats;
+
+    qht_statistics_init(&ht, &stats);
+    g_assert_cmpuint(stats.entries, ==, expected);
+    qht_statistics_destroy(&stats);
+}
+
+static void iter_check(unsigned int count)
+{
+    unsigned int curr = 0;
+
+    qht_iter(&ht, count_func, &curr);
+    g_assert_cmpuint(curr, ==, count);
+}
+
+static void qht_do_test(unsigned int mode, size_t init_entries)
+{
+    qht_init(&ht, 0, mode);
+
+    insert(0, N);
+    check(0, N, true);
+    check_n(N);
+    check(-N, -1, false);
+    iter_check(N);
+
+    rm(101, 102);
+    check_n(N - 1);
+    insert(N, N * 2);
+    check_n(N + N - 1);
+    rm(N, N * 2);
+    check_n(N - 1);
+    insert(101, 102);
+    check_n(N);
+
+    rm(10, 200);
+    check_n(N - 190);
+    insert(150, 200);
+    check_n(N - 190 + 50);
+    insert(10, 150);
+    check_n(N);
+
+    rm(1, 2);
+    check_n(N - 1);
+    qht_reset_size(&ht, 0);
+    check_n(0);
+    check(0, N, false);
+
+    qht_destroy(&ht);
+}
+
+static void qht_test(unsigned int mode)
+{
+    qht_do_test(mode, 0);
+    qht_do_test(mode, 1);
+    qht_do_test(mode, 2);
+    qht_do_test(mode, 8);
+    qht_do_test(mode, 16);
+    qht_do_test(mode, 8192);
+    qht_do_test(mode, 16384);
+}
+
+static void test_default(void)
+{
+    qht_test(0);
+}
+
+static void test_resize(void)
+{
+    qht_test(QHT_MODE_AUTO_RESIZE);
+}
+
+static void test_mru_lookup(void)
+{
+    qht_test(QHT_MODE_MRU_LOOKUP);
+}
+
+static void test_mru_all(void)
+{
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT);
+}
+
+static void test_mru_all_resize(void)
+{
+    qht_test(QHT_MODE_MRU_LOOKUP | QHT_MODE_MRU_INSERT | QHT_MODE_AUTO_RESIZE);
+}
+
+static void test_mru_insert_resize(void)
+{
+    qht_test(QHT_MODE_AUTO_RESIZE | QHT_MODE_MRU_INSERT);
+}
+
+int main(int argc, char *argv[])
+{
+    g_test_init(&argc, &argv, NULL);
+    g_test_add_func("/qht/mode/default", test_default);
+    g_test_add_func("/qht/mode/resize", test_resize);
+    g_test_add_func("/qht/mode/mru_lookup", test_mru_lookup);
+    g_test_add_func("/qht/mode/mru_all", test_mru_all);
+    g_test_add_func("/qht/mode/mru_all_with_resize", test_mru_all_resize);
+    g_test_add_func("/qht/mode/mru_insert_with_resize", test_mru_insert_resize);
+    return g_test_run();
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (11 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 12/14] qht: add test program Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-03  7:36   ` Alex Bennée
  2016-05-04  5:19   ` Richard Henderson
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 14/14] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
  13 siblings, 2 replies; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.

Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.

The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:

- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
              MRU is implemented here by adding an intermediate struct
              that contains the u32 hash and a pointer to the TB; this
              allows us, on an MRU promotion, to copy said struct (that is not
              at the head), and put this new copy at the head. After a grace
              period, the original non-head struct can be eliminated, and
              after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                   no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                 no MRU for lookups; MRU for inserts.

The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
  http://imgur.com/a/Awgnq

Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.

                            Host: Intel Xeon E5-2690

  28 ++------------+-------------+-------------+-------------+------------++
     A*****        +             +             +             master **A*** +
  27 ++    *                                                 xxhash ##B###++
     |      A******A******                               xxhash-rcu $$C$$$ |
  26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
     D%%$$                              A******A******A*qht-dyn-mru A*E****A
  25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
     B#####%                                                               |
  24 ++    #C$$$$$                                                        ++
     |      B###  $                                                        |
     |          ## C$$$$$$                                                 |
  23 ++           #       C$$$$$$                                         ++
     |             B######       C$$$$$$                                %%%D
  22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
     |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
  21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
     +             E@@@   F&&&   +      E@     +      F&&&   +             +
  20 ++------------+-------------+-------------+-------------+------------++
     14            16            18            20            22            24
                             log2 number of buckets

                                 Host: Intel i7-4790K

  14.5 ++------------+------------+-------------+------------+------------++
       A**           +            +             +            master **A*** +
    14 ++ **                                                 xxhash ##B###++
  13.5 ++   **                                           xxhash-rcu $$C$$$++
       |                                            qht-fixed-nomru %%D%%% |
    13 ++     A******                                   qht-dyn-mru @@E@@@++
       |             A*****A******A******             qht-dyn-nomru &&F&&& |
  12.5 C$$                               A******A******A*****A******    ***A
    12 ++ $$                                                        A***  ++
       D%%% $$                                                             |
  11.5 ++  %%                                                             ++
       B###  %C$$$$$$                                                      |
    11 ++  ## D%%%%% C$$$$$                                               ++
       |     #      %      C$$$$$$                                         |
  10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
    10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
       +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
   9.5 ++------------+------------+-------------+------------+------------++
       14            16           18            20           22            24
                              log2 number of buckets

Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.

xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.

qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.

However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.

Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.

The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:

- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time

We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.

A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
 include/exec/exec-all.h |  9 +++---
 include/exec/tb-hash.h  |  3 +-
 translate-all.c         | 85 ++++++++++++++++++++++---------------------------
 4 files changed, 86 insertions(+), 93 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 395b302..daefe49 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -217,55 +217,59 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
     tb_free(tb);
 }
 
+struct tb_desc {
+    target_ulong pc;
+    target_ulong cs_base;
+    uint64_t flags;
+    tb_page_addr_t phys_page1;
+    CPUArchState *env;
+};
+
+static bool tb_cmp(const void *p, const void *d)
+{
+    const TranslationBlock *tb = p;
+    const struct tb_desc *desc = d;
+
+    if (tb->pc == desc->pc &&
+        tb->page_addr[0] == desc->phys_page1 &&
+        tb->cs_base == desc->cs_base &&
+        tb->flags == desc->flags) {
+        /* check next page if needed */
+        if (tb->page_addr[1] == -1) {
+            return true;
+        } else {
+            tb_page_addr_t phys_page2;
+            target_ulong virt_page2;
+
+            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
+            phys_page2 = get_page_addr_code(desc->env, virt_page2);
+            if (tb->page_addr[1] == phys_page2) {
+                return true;
+            }
+        }
+    }
+    return false;
+}
+
 static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           target_ulong pc,
                                           target_ulong cs_base,
                                           uint32_t flags)
 {
-    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-    TranslationBlock *tb, **ptb1;
+    tb_page_addr_t phys_pc;
+    struct tb_desc desc;
     uint32_t h;
-    tb_page_addr_t phys_pc, phys_page1;
-    target_ulong virt_page2;
 
     tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
 
-    /* find translated block using physical mappings */
-    phys_pc = get_page_addr_code(env, pc);
-    phys_page1 = phys_pc & TARGET_PAGE_MASK;
+    desc.env = (CPUArchState *)cpu->env_ptr;
+    desc.cs_base = cs_base;
+    desc.flags = flags;
+    desc.pc = pc;
+    phys_pc = get_page_addr_code(desc.env, pc);
+    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_hash_func(phys_pc, pc, flags);
-    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    for(;;) {
-        tb = *ptb1;
-        if (!tb) {
-            return NULL;
-        }
-        if (tb->pc == pc &&
-            tb->page_addr[0] == phys_page1 &&
-            tb->cs_base == cs_base &&
-            tb->flags == flags) {
-            /* check next page if needed */
-            if (tb->page_addr[1] != -1) {
-                tb_page_addr_t phys_page2;
-
-                virt_page2 = (pc & TARGET_PAGE_MASK) +
-                    TARGET_PAGE_SIZE;
-                phys_page2 = get_page_addr_code(env, virt_page2);
-                if (tb->page_addr[1] == phys_page2) {
-                    break;
-                }
-            } else {
-                break;
-            }
-        }
-        ptb1 = &tb->phys_hash_next;
-    }
-
-    /* Move the TB to the head of the list */
-    *ptb1 = tb->phys_hash_next;
-    tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
-    return tb;
+    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index c75fb3a..27c25da 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -21,6 +21,7 @@
 #define _EXEC_ALL_H_
 
 #include "qemu-common.h"
+#include "qemu/qht.h"
 
 /* allow to see translation results - the slowdown should be negligible, so we leave it */
 #define DEBUG_DISAS
@@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 
 #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
 
-#define CODE_GEN_PHYS_HASH_BITS     15
-#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
+#define CODE_GEN_HTABLE_BITS     15
+#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
 
 /* Estimated block size for TB allocation.  */
 /* ??? The following is based on a 2015 survey of x86_64 host output.
@@ -249,8 +250,6 @@ struct TranslationBlock {
 
     void *tc_ptr;    /* pointer to the translated code */
     uint8_t *tc_search;  /* pointer to search data */
-    /* next matching tb for physical address. */
-    struct TranslationBlock *phys_hash_next;
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -281,7 +280,7 @@ typedef struct TBContext TBContext;
 struct TBContext {
 
     TranslationBlock *tbs;
-    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+    struct qht htable;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
index 4b9635a..d274357 100644
--- a/include/exec/tb-hash.h
+++ b/include/exec/tb-hash.h
@@ -20,7 +20,6 @@
 #ifndef EXEC_TB_HASH
 #define EXEC_TB_HASH
 
-#include "exec/exec-all.h"
 #include "exec/tb-hash-xx.h"
 
 /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
@@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
 static inline
 uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
 {
-    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
+    return tb_hash_func5(phys_pc, pc, flags);
 }
 
 #endif
diff --git a/translate-all.c b/translate-all.c
index 0463efc..0bf76d7 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -734,6 +734,13 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_htable_init(void)
+{
+    unsigned int mode = QHT_MODE_AUTO_RESIZE;
+
+    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
    (in bytes) allocated to the translation buffer. Zero means default
    size. */
@@ -741,6 +748,7 @@ void tcg_exec_init(unsigned long tb_size)
 {
     cpu_gen_init();
     page_init();
+    tb_htable_init();
     code_gen_alloc(tb_size);
 #if defined(CONFIG_SOFTMMU)
     /* There's no guest base to take into account, so go ahead and
@@ -842,7 +850,7 @@ void tb_flush(CPUState *cpu)
         memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
     }
 
-    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
+    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -853,60 +861,46 @@ void tb_flush(CPUState *cpu)
 
 #ifdef DEBUG_TB_CHECK
 
-static void tb_invalidate_check(target_ulong address)
+static void
+do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb;
-    int i;
+    TranslationBlock *tb = p;
+    target_ulong addr = *(target_ulong *)userp;
 
-    address &= TARGET_PAGE_MASK;
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-             tb = tb->phys_hash_next) {
-            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
-                  address >= tb->pc + tb->size)) {
-                printf("ERROR invalidate: address=" TARGET_FMT_lx
-                       " PC=%08lx size=%04x\n",
-                       address, (long)tb->pc, tb->size);
-            }
-        }
+    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
+        printf("ERROR invalidate: address=" TARGET_FMT_lx
+               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
     }
 }
 
-/* verify that all the pages have correct rights for code */
-static void tb_page_check(void)
+static void tb_invalidate_check(target_ulong address)
 {
-    TranslationBlock *tb;
-    int i, flags1, flags2;
-
-    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                tb = tb->phys_hash_next) {
-            flags1 = page_get_flags(tb->pc);
-            flags2 = page_get_flags(tb->pc + tb->size - 1);
-            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
-                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
-                       (long)tb->pc, tb->size, flags1, flags2);
-            }
-        }
-    }
+    address &= TARGET_PAGE_MASK;
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
 }
 
-#endif
-
-static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
+static void
+do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
 {
-    TranslationBlock *tb1;
+    TranslationBlock *tb = p;
+    int flags1, flags2;
 
-    for (;;) {
-        tb1 = *ptb;
-        if (tb1 == tb) {
-            *ptb = tb1->phys_hash_next;
-            break;
-        }
-        ptb = &tb1->phys_hash_next;
+    flags1 = page_get_flags(tb->pc);
+    flags2 = page_get_flags(tb->pc + tb->size - 1);
+    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
+        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
+               (long)tb->pc, tb->size, flags1, flags2);
     }
 }
 
+/* verify that all the pages have correct rights for code */
+static void tb_page_check(void)
+{
+    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
+}
+
+#endif
+
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -973,7 +967,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     /* remove the TB from the hash list */
     phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
+    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1472,13 +1466,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     uint32_t h;
-    TranslationBlock **ptb;
 
     /* add in the hash table */
     h = tb_hash_func(phys_pc, tb->pc, tb->flags);
-    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb->phys_hash_next = *ptb;
-    *ptb = tb;
+    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
 
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Qemu-devel] [PATCH v4 14/14] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
                   ` (12 preceding siblings ...)
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-04-30  3:33 ` Emilio G. Cota
  2016-05-04  5:21   ` Richard Henderson
  13 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-04-30  3:33 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

Examples:

- Good hashing, i.e. tb_hash_func5(phys_pc, pc, flags):
TB count            715135/2684354
[...]
TB hash buckets     388775/524288 (74.15% head buckets used)
TB hash occupancy   33.04% avg chain occ. Histogram: [0,10)%|▆ █  ▅▁▃▁▁|[90,100]%
TB hash avg chain   1.017 buckets. Histogram: 1|█▁▁|3

- Not-so-good hashing, i.e. tb_hash_func5(phys_pc, pc, 0):
TB count            712636/2684354
[...]
TB hash buckets     344924/524288 (65.79% head buckets used)
TB hash occupancy   31.64% avg chain occ. Histogram: [0,10)%|█ ▆  ▅▁▃▁▂|[90,100]%
TB hash avg chain   1.047 buckets. Histogram: 1|█▁▁▁|4

- Bad hashing, i.e. tb_hash_func5(phys_pc, 0, 0):
TB count            702818/2684354
[...]
TB hash buckets     112741/524288 (21.50% head buckets used)
TB hash occupancy   10.15% avg chain occ. Histogram: [0,10)%|█ ▁  ▁▁▁▁▁|[90,100]%
TB hash avg chain   2.107 buckets. Histogram: [1.0,10.2)|█▁▁▁▁▁▁▁▁▁|[83.8,93.0]

- Good hashing, but no auto-resize:
TB count            715634/2684354
TB hash buckets     8192/8192 (100.00% head buckets used)
TB hash occupancy   98.30% avg chain occ. Histogram: [95.3,95.8)%|▁▁▃▄▃▄▁▇▁█|[99.5,100.0]%
TB hash avg chain   22.070 buckets. Histogram: [15.0,16.7)|▁▂▅▄█▅▁▁▁▁|[30.3,32.0]

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/translate-all.c b/translate-all.c
index 0bf76d7..775ea79 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1665,6 +1665,10 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     int i, target_code_size, max_target_code_size;
     int direct_jmp_count, direct_jmp2_count, cross_page;
     TranslationBlock *tb;
+    struct qht_stats hst;
+    uint32_t hgram_opts;
+    size_t hgram_bins;
+    char *hgram;
 
     target_code_size = 0;
     max_target_code_size = 0;
@@ -1715,6 +1719,38 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
                 direct_jmp2_count,
                 tcg_ctx.tb_ctx.nb_tbs ? (direct_jmp2_count * 100) /
                         tcg_ctx.tb_ctx.nb_tbs : 0);
+
+    qht_statistics_init(&tcg_ctx.tb_ctx.htable, &hst);
+
+    cpu_fprintf(f, "TB hash buckets     %zu/%zu (%0.2f%% head buckets used)\n",
+                hst.used_head_buckets, hst.head_buckets,
+                (double)hst.used_head_buckets / hst.head_buckets * 100);
+
+    hgram_opts =  QDIST_PR_BORDER | QDIST_PR_LABELS;
+    hgram_opts |= QDIST_PR_100X   | QDIST_PR_PERCENT;
+    if (qdist_xmax(&hst.occupancy) - qdist_xmin(&hst.occupancy) == 1) {
+        hgram_opts |= QDIST_PR_NODECIMAL;
+    }
+    hgram = qdist_pr(&hst.occupancy, 10, hgram_opts);
+    cpu_fprintf(f, "TB hash occupancy   %0.2f%% avg chain occ. Histogram: %s\n",
+                qdist_avg(&hst.occupancy) * 100, hgram);
+    g_free(hgram);
+
+    hgram_opts = QDIST_PR_BORDER | QDIST_PR_LABELS;
+    hgram_bins = qdist_xmax(&hst.chain) - qdist_xmin(&hst.chain);
+    if (hgram_bins > 10) {
+        hgram_bins = 10;
+    } else {
+        hgram_bins = 0;
+        hgram_opts |= QDIST_PR_NODECIMAL | QDIST_PR_NOBINRANGE;
+    }
+    hgram = qdist_pr(&hst.chain, hgram_bins, hgram_opts);
+    cpu_fprintf(f, "TB hash avg chain   %0.3f buckets. Histogram: %s\n",
+                qdist_avg(&hst.chain), hgram);
+    g_free(hgram);
+
+    qht_statistics_destroy(&hst);
+
     cpu_fprintf(f, "\nStatistics:\n");
     cpu_fprintf(f, "TB flush count      %d\n", tcg_ctx.tb_ctx.tb_flush_count);
     cpu_fprintf(f, "TB invalidate count %d\n",
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht Emilio G. Cota
@ 2016-05-03  7:36   ` Alex Bennée
  2016-05-03 17:26     ` Emilio G. Cota
  2016-05-04  5:19   ` Richard Henderson
  1 sibling, 1 reply; 31+ messages in thread
From: Alex Bennée @ 2016-05-03  7:36 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov


Emilio G. Cota <cota@braap.org> writes:

> Having a fixed-size hash table for keeping track of all translation blocks
> is suboptimal: some workloads are just too big or too small to get maximum
> performance from the hash table. The MRU promotion policy helps improve
> performance when the hash table is a little undersized, but it cannot
> make up for severely undersized hash tables.
>
> Furthermore, frequent MRU promotions result in writes that are a scalability
> bottleneck. For scalability, lookups should only perform reads, not writes.
> This is not a big deal for now, but it will become one once MTTCG matures.
>
> The appended fixes these issues by using qht as the implementation of
> the TB hash table. This solution is superior to other alternatives considered,
> namely:
>
> - master: implementation in QEMU before this patchset
> - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
> - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
>               MRU is implemented here by adding an intermediate struct
>               that contains the u32 hash and a pointer to the TB; this
>               allows us, on an MRU promotion, to copy said struct (that is not
>               at the head), and put this new copy at the head. After a grace
>               period, the original non-head struct can be eliminated, and
>               after another grace period, freed.
> - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
>                    no MRU for lookups; MRU for inserts.
> The appended solution is the following:
> - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
>                  no MRU for lookups; MRU for inserts.
>
> The plots below compare the considered solutions. The Y axis shows the
> boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
> sweeps the number of buckets (or initial number of buckets for qht-autoresize).
> The plots in PNG format (and with errorbars) can be seen here:
>   http://imgur.com/a/Awgnq
>
> Each test runs 5 times, and the entire QEMU process is pinned to a
> single core for repeatability of results.
>
>                             Host: Intel Xeon E5-2690
>
>   28 ++------------+-------------+-------------+-------------+------------++
>      A*****        +             +             +             master **A*** +
>   27 ++    *                                                 xxhash ##B###++
>      |      A******A******                               xxhash-rcu $$C$$$ |
>   26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
>      D%%$$                              A******A******A*qht-dyn-mru A*E****A
>   25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
>      B#####%                                                               |
>   24 ++    #C$$$$$                                                        ++
>      |      B###  $                                                        |
>      |          ## C$$$$$$                                                 |
>   23 ++           #       C$$$$$$                                         ++
>      |             B######       C$$$$$$                                %%%D
>   22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
>      |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
>   21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
>      +             E@@@   F&&&   +      E@     +      F&&&   +             +
>   20 ++------------+-------------+-------------+-------------+------------++
>      14            16            18            20            22            24
>                              log2 number of buckets
>
>                                  Host: Intel i7-4790K
>
>   14.5 ++------------+------------+-------------+------------+------------++
>        A**           +            +             +            master **A*** +
>     14 ++ **                                                 xxhash ##B###++
>   13.5 ++   **                                           xxhash-rcu $$C$$$++
>        |                                            qht-fixed-nomru %%D%%% |
>     13 ++     A******                                   qht-dyn-mru @@E@@@++
>        |             A*****A******A******             qht-dyn-nomru &&F&&& |
>   12.5 C$$                               A******A******A*****A******    ***A
>     12 ++ $$                                                        A***  ++
>        D%%% $$                                                             |
>   11.5 ++  %%                                                             ++
>        B###  %C$$$$$$                                                      |
>     11 ++  ## D%%%%% C$$$$$                                               ++
>        |     #      %      C$$$$$$                                         |
>   10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
>     10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
>        +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
>    9.5 ++------------+------------+-------------+------------+------------++
>        14            16           18            20           22            24
>                               log2 number of buckets
>
> Note that the original point before this patch series is X=15 for "master";
> the little sensitivity to the increased number of buckets is due to the
> poor hashing function in master.
>
> xxhash-rcu has significant overhead due to the constant churn of allocating
> and deallocating intermediate structs for implementing MRU. An alternative
> would be do consider failed lookups as "maybe not there", and then
> acquire the external lock (tb_lock in this case) to really confirm that
> there was indeed a failed lookup. This, however, would not be enough
> to implement dynamic resizing--this is more complex: see
> "Resizable, Scalable, Concurrent Hash Tables via Relativistic
> Programming" by Triplett, McKenney and Walpole. This solution was
> discarded due to the very coarse RCU read critical sections that we have
> in MTTCG; resizing requires waiting for readers after every pointer update,
> and resizes require many pointer updates, so this would quickly become
> prohibitive.
>
> qht-fixed-nomru shows that MRU promotion is advisable for undersized
> hash tables.
>
> However, qht-dyn-mru shows that MRU promotion is not important if the
> hash table is properly sized: there is virtually no difference in
> performance between qht-dyn-nomru and qht-dyn-mru.
>
> Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
> X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
> can achieve with optimum sizing of the hash table, while keeping the hash
> table scalable for readers.
>
> The improvement we get before and after this patch for booting debian jessie
> with arm-softmmu is:
>
> - Intel Xeon E5-2690: 10.5% less time
> - Intel i7-4790K: 5.2% less time
>
> We could get this same improvement _for this particular workload_ by
> statically increasing the size of the hash table. But this would hurt
> workloads that do not need a large hash table. The dynamic (upward)
> resizing allows us to start small and enlarge the hash table as needed.
>
> A quick note on downsizing: the table is resized back to 2**15 buckets
> on every tb_flush; this makes sense because it is not guaranteed that the
> table will reach the same number of TBs later on (e.g. most bootup code is
> thrown away after boot); it makes sense to grow the hash table as
> more code blocks are translated. This also avoids the complication of
> having to build downsizing hysteresis logic into qht.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
>  include/exec/exec-all.h |  9 +++---
>  include/exec/tb-hash.h  |  3 +-
>  translate-all.c         | 85 ++++++++++++++++++++++---------------------------
>  4 files changed, 86 insertions(+), 93 deletions(-)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 395b302..daefe49 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -217,55 +217,59 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
>      tb_free(tb);
>  }
>
> +struct tb_desc {
> +    target_ulong pc;
> +    target_ulong cs_base;
> +    uint64_t flags;
> +    tb_page_addr_t phys_page1;
> +    CPUArchState *env;
> +};
> +
> +static bool tb_cmp(const void *p, const void *d)
> +{
> +    const TranslationBlock *tb = p;
> +    const struct tb_desc *desc = d;
> +
> +    if (tb->pc == desc->pc &&
> +        tb->page_addr[0] == desc->phys_page1 &&
> +        tb->cs_base == desc->cs_base &&
> +        tb->flags == desc->flags) {
> +        /* check next page if needed */
> +        if (tb->page_addr[1] == -1) {
> +            return true;
> +        } else {
> +            tb_page_addr_t phys_page2;
> +            target_ulong virt_page2;
> +
> +            virt_page2 = (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_SIZE;
> +            phys_page2 = get_page_addr_code(desc->env, virt_page2);
> +            if (tb->page_addr[1] == phys_page2) {
> +                return true;
> +            }
> +        }
> +    }
> +    return false;
> +}
> +
>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>                                            target_ulong pc,
>                                            target_ulong cs_base,
>                                            uint32_t flags)

This does not apply cleanly to master due to the flag change. You should
either reference the clean-up patch or include it in the series.

>  {
> -    CPUArchState *env = (CPUArchState *)cpu->env_ptr;
> -    TranslationBlock *tb, **ptb1;
> +    tb_page_addr_t phys_pc;
> +    struct tb_desc desc;
>      uint32_t h;
> -    tb_page_addr_t phys_pc, phys_page1;
> -    target_ulong virt_page2;
>
>      tcg_ctx.tb_ctx.tb_invalidated_flag = 0;
>
> -    /* find translated block using physical mappings */
> -    phys_pc = get_page_addr_code(env, pc);
> -    phys_page1 = phys_pc & TARGET_PAGE_MASK;
> +    desc.env = (CPUArchState *)cpu->env_ptr;
> +    desc.cs_base = cs_base;
> +    desc.flags = flags;
> +    desc.pc = pc;
> +    phys_pc = get_page_addr_code(desc.env, pc);
> +    desc.phys_page1 = phys_pc & TARGET_PAGE_MASK;
>      h = tb_hash_func(phys_pc, pc, flags);
> -    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    for(;;) {
> -        tb = *ptb1;
> -        if (!tb) {
> -            return NULL;
> -        }
> -        if (tb->pc == pc &&
> -            tb->page_addr[0] == phys_page1 &&
> -            tb->cs_base == cs_base &&
> -            tb->flags == flags) {
> -            /* check next page if needed */
> -            if (tb->page_addr[1] != -1) {
> -                tb_page_addr_t phys_page2;
> -
> -                virt_page2 = (pc & TARGET_PAGE_MASK) +
> -                    TARGET_PAGE_SIZE;
> -                phys_page2 = get_page_addr_code(env, virt_page2);
> -                if (tb->page_addr[1] == phys_page2) {
> -                    break;
> -                }
> -            } else {
> -                break;
> -            }
> -        }
> -        ptb1 = &tb->phys_hash_next;
> -    }
> -
> -    /* Move the TB to the head of the list */
> -    *ptb1 = tb->phys_hash_next;
> -    tb->phys_hash_next = tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tcg_ctx.tb_ctx.tb_phys_hash[h] = tb;
> -    return tb;
> +    return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h);
>  }
>
>  static TranslationBlock *tb_find_slow(CPUState *cpu,
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index c75fb3a..27c25da 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -21,6 +21,7 @@
>  #define _EXEC_ALL_H_
>
>  #include "qemu-common.h"
> +#include "qemu/qht.h"
>
>  /* allow to see translation results - the slowdown should be negligible, so we leave it */
>  #define DEBUG_DISAS
> @@ -212,8 +213,8 @@ static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
>
>  #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
>
> -#define CODE_GEN_PHYS_HASH_BITS     15
> -#define CODE_GEN_PHYS_HASH_SIZE     (1 << CODE_GEN_PHYS_HASH_BITS)
> +#define CODE_GEN_HTABLE_BITS     15
> +#define CODE_GEN_HTABLE_SIZE     (1 << CODE_GEN_HTABLE_BITS)
>
>  /* Estimated block size for TB allocation.  */
>  /* ??? The following is based on a 2015 survey of x86_64 host output.
> @@ -249,8 +250,6 @@ struct TranslationBlock {
>
>      void *tc_ptr;    /* pointer to the translated code */
>      uint8_t *tc_search;  /* pointer to search data */
> -    /* next matching tb for physical address. */
> -    struct TranslationBlock *phys_hash_next;
>      /* original tb when cflags has CF_NOCACHE */
>      struct TranslationBlock *orig_tb;
>      /* first and second physical page containing code. The lower bit
> @@ -281,7 +280,7 @@ typedef struct TBContext TBContext;
>  struct TBContext {
>
>      TranslationBlock *tbs;
> -    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
> +    struct qht htable;
>      int nb_tbs;
>      /* any access to the tbs or the page table must use this lock */
>      QemuMutex tb_lock;
> diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h
> index 4b9635a..d274357 100644
> --- a/include/exec/tb-hash.h
> +++ b/include/exec/tb-hash.h
> @@ -20,7 +20,6 @@
>  #ifndef EXEC_TB_HASH
>  #define EXEC_TB_HASH
>
> -#include "exec/exec-all.h"
>  #include "exec/tb-hash-xx.h"
>
>  /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for
> @@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(target_ulong pc)
>  static inline
>  uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, int flags)
>  {
> -    return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE - 1);
> +    return tb_hash_func5(phys_pc, pc, flags);
>  }
>
>  #endif
> diff --git a/translate-all.c b/translate-all.c
> index 0463efc..0bf76d7 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -734,6 +734,13 @@ static inline void code_gen_alloc(size_t tb_size)
>      qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
>  }
>
> +static void tb_htable_init(void)
> +{
> +    unsigned int mode = QHT_MODE_AUTO_RESIZE;
> +
> +    qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode);
> +}
> +
>  /* Must be called before using the QEMU cpus. 'tb_size' is the size
>     (in bytes) allocated to the translation buffer. Zero means default
>     size. */
> @@ -741,6 +748,7 @@ void tcg_exec_init(unsigned long tb_size)
>  {
>      cpu_gen_init();
>      page_init();
> +    tb_htable_init();
>      code_gen_alloc(tb_size);
>  #if defined(CONFIG_SOFTMMU)
>      /* There's no guest base to take into account, so go ahead and
> @@ -842,7 +850,7 @@ void tb_flush(CPUState *cpu)
>          memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
>      }
>
> -    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> +    qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE);
>      page_flush_tb();
>
>      tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
> @@ -853,60 +861,46 @@ void tb_flush(CPUState *cpu)
>
>  #ifdef DEBUG_TB_CHECK
>
> -static void tb_invalidate_check(target_ulong address)
> +static void
> +do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb;
> -    int i;
> +    TranslationBlock *tb = p;
> +    target_ulong addr = *(target_ulong *)userp;
>
> -    address &= TARGET_PAGE_MASK;
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -             tb = tb->phys_hash_next) {
> -            if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
> -                  address >= tb->pc + tb->size)) {
> -                printf("ERROR invalidate: address=" TARGET_FMT_lx
> -                       " PC=%08lx size=%04x\n",
> -                       address, (long)tb->pc, tb->size);
> -            }
> -        }
> +    if (!(addr + TARGET_PAGE_SIZE <= tb->pc || addr >= tb->pc + tb->size)) {
> +        printf("ERROR invalidate: address=" TARGET_FMT_lx
> +               " PC=%08lx size=%04x\n", addr, (long)tb->pc, tb->size);
>      }
>  }
>
> -/* verify that all the pages have correct rights for code */
> -static void tb_page_check(void)
> +static void tb_invalidate_check(target_ulong address)
>  {
> -    TranslationBlock *tb;
> -    int i, flags1, flags2;
> -
> -    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
> -        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
> -                tb = tb->phys_hash_next) {
> -            flags1 = page_get_flags(tb->pc);
> -            flags2 = page_get_flags(tb->pc + tb->size - 1);
> -            if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> -                printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> -                       (long)tb->pc, tb->size, flags1, flags2);
> -            }
> -        }
> -    }
> +    address &= TARGET_PAGE_MASK;
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address);
>  }
>
> -#endif
> -
> -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
> +static void
> +do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp)
>  {
> -    TranslationBlock *tb1;
> +    TranslationBlock *tb = p;
> +    int flags1, flags2;
>
> -    for (;;) {
> -        tb1 = *ptb;
> -        if (tb1 == tb) {
> -            *ptb = tb1->phys_hash_next;
> -            break;
> -        }
> -        ptb = &tb1->phys_hash_next;
> +    flags1 = page_get_flags(tb->pc);
> +    flags2 = page_get_flags(tb->pc + tb->size - 1);
> +    if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
> +        printf("ERROR page flags: PC=%08lx size=%04x f1=%x f2=%x\n",
> +               (long)tb->pc, tb->size, flags1, flags2);
>      }
>  }
>
> +/* verify that all the pages have correct rights for code */
> +static void tb_page_check(void)
> +{
> +    qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL);
> +}
> +
> +#endif
> +
>  static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
>  {
>      TranslationBlock *tb1;
> @@ -973,7 +967,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>      /* remove the TB from the hash list */
>      phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
> +    qht_remove(&tcg_ctx.tb_ctx.htable, tb, h);
>
>      /* remove the TB from the page list */
>      if (tb->page_addr[0] != page_addr) {
> @@ -1472,13 +1466,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
>                           tb_page_addr_t phys_page2)
>  {
>      uint32_t h;
> -    TranslationBlock **ptb;
>
>      /* add in the hash table */
>      h = tb_hash_func(phys_pc, tb->pc, tb->flags);
> -    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
> -    tb->phys_hash_next = *ptb;
> -    *ptb = tb;
> +    qht_insert(&tcg_ctx.tb_ctx.htable, tb, h);
>
>      /* add in the page list */
>      tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);


--
Alex Bennée

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-03  7:36   ` Alex Bennée
@ 2016-05-03 17:26     ` Emilio G. Cota
  2016-05-04  5:24       ` Richard Henderson
  0 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-05-03 17:26 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Richard Henderson, Sergey Fedorov

On Tue, May 03, 2016 at 08:36:35 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> >  static TranslationBlock *tb_find_physical(CPUState *cpu,
> >                                            target_ulong pc,
> >                                            target_ulong cs_base,
> >                                            uint32_t flags)
> 
> This does not apply cleanly to master due to the flag change. You should
> either reference the clean-up patch or include it in the series.

Ouch, sorry. Won't happen again.

Grab the missing pre-requisite patch from:
  http://patchwork.ozlabs.org/patch/607662/mbox/

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 05/14] atomics: add atomic_test_and_set
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 05/14] atomics: add atomic_test_and_set Emilio G. Cota
@ 2016-05-04  5:10   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:10 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> This new helper expands to __atomic_test_and_set where available;
> otherwise it expands to atomic_xchg.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 06/14] qemu-thread: add simple test-and-set spinlock
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 06/14] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
@ 2016-05-04  5:10   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:10 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> From: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
>
> Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
> [Rewritten. - Paolo]
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [Emilio's additions: use atomic_test_and_set instead of atomic_xchg;
>  call cpu_relax() while spinning; optimize for uncontended locks by
>  acquiring the lock with TAS instead of TATAS.]
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/thread.h | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 09/14] qdist: add module to represent frequency distributions of data
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 09/14] qdist: add module to represent frequency distributions of data Emilio G. Cota
@ 2016-05-04  5:13   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:13 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/qdist.h |  62 +++++++++
>  util/Makefile.objs   |   2 +-
>  util/qdist.c         | 386 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 449 insertions(+), 1 deletion(-)
>  create mode 100644 include/qemu/qdist.h
>  create mode 100644 util/qdist.c


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/14] qht: QEMU's fast, resizable and scalable Hash Table
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 11/14] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
@ 2016-05-04  5:17   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:17 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> This is a hash table with optional auto-resizing and MRU promotion for
> reads and writes. Its implementation goal is to stay fast while
> scaling for read-mostly workloads.
>
> A hash table with these features will be necessary for the scalability
> of the ongoing MTTCG work; before those changes arrive we can already
> benefit from the single-threaded speedup that qht also provides.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/qht.h |  67 +++++
>  util/Makefile.objs |   2 +-
>  util/qht.c         | 722 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 790 insertions(+), 1 deletion(-)
>  create mode 100644 include/qemu/qht.h
>  create mode 100644 util/qht.c


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht Emilio G. Cota
  2016-05-03  7:36   ` Alex Bennée
@ 2016-05-04  5:19   ` Richard Henderson
  1 sibling, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:19 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c              | 82 ++++++++++++++++++++++++-----------------------
>  include/exec/exec-all.h |  9 +++---
>  include/exec/tb-hash.h  |  3 +-
>  translate-all.c         | 85 ++++++++++++++++++++++---------------------------
>  4 files changed, 86 insertions(+), 93 deletions(-)


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 14/14] translate-all: add tb hash bucket info to 'info jit' dump
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 14/14] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
@ 2016-05-04  5:21   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:21 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  translate-all.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 12/14] qht: add test program
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 12/14] qht: add test program Emilio G. Cota
@ 2016-05-04  5:22   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:22 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tests/.gitignore |   1 +
>  tests/Makefile   |   5 +-
>  tests/test-qht.c | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 182 insertions(+), 1 deletion(-)
>  create mode 100644 tests/test-qht.c


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 10/14] qdist: add test program
  2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 10/14] qdist: add test program Emilio G. Cota
@ 2016-05-04  5:23   ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:23 UTC (permalink / raw)
  To: Emilio G. Cota, QEMU Developers, MTTCG Devel
  Cc: Alex Bennée, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 04/29/2016 05:33 PM, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tests/.gitignore   |   1 +
>  tests/Makefile     |   6 +-
>  tests/test-qdist.c | 363 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 369 insertions(+), 1 deletion(-)
>  create mode 100644 tests/test-qdist.c


Reviewed-by: Richard Henderson <rth@twiddle.net>


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-03 17:26     ` Emilio G. Cota
@ 2016-05-04  5:24       ` Richard Henderson
  2016-05-04  9:31         ` Alex Bennée
  2016-05-04 15:36         ` Emilio G. Cota
  0 siblings, 2 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-04  5:24 UTC (permalink / raw)
  To: Emilio G. Cota, Alex Bennée
  Cc: QEMU Developers, MTTCG Devel, Paolo Bonzini, Peter Crosthwaite,
	Sergey Fedorov

On 05/03/2016 07:26 AM, Emilio G. Cota wrote:
> On Tue, May 03, 2016 at 08:36:35 +0100, Alex Bennée wrote:
>> Emilio G. Cota <cota@braap.org> writes:
>>>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>>>                                            target_ulong pc,
>>>                                            target_ulong cs_base,
>>>                                            uint32_t flags)
>>
>> This does not apply cleanly to master due to the flag change. You should
>> either reference the clean-up patch or include it in the series.
>
> Ouch, sorry. Won't happen again.
>
> Grab the missing pre-requisite patch from:
>   http://patchwork.ozlabs.org/patch/607662/mbox/

Well, that said, there are further conflicts within my tcg-next tree.
If you'll rebase on

   git://github.com/rth7680/qemu.git tcg-next

I'll queue the patchset for 2.7.


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-04  5:24       ` Richard Henderson
@ 2016-05-04  9:31         ` Alex Bennée
  2016-05-04 15:36         ` Emilio G. Cota
  1 sibling, 0 replies; 31+ messages in thread
From: Alex Bennée @ 2016-05-04  9:31 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Emilio G. Cota, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Peter Crosthwaite, Sergey Fedorov


Richard Henderson <rth@twiddle.net> writes:

> On 05/03/2016 07:26 AM, Emilio G. Cota wrote:
>> On Tue, May 03, 2016 at 08:36:35 +0100, Alex Bennée wrote:
>>> Emilio G. Cota <cota@braap.org> writes:
>>>>  static TranslationBlock *tb_find_physical(CPUState *cpu,
>>>>                                            target_ulong pc,
>>>>                                            target_ulong cs_base,
>>>>                                            uint32_t flags)
>>>
>>> This does not apply cleanly to master due to the flag change. You should
>>> either reference the clean-up patch or include it in the series.
>>
>> Ouch, sorry. Won't happen again.
>>
>> Grab the missing pre-requisite patch from:
>>   http://patchwork.ozlabs.org/patch/607662/mbox/
>
> Well, that said, there are further conflicts within my tcg-next tree.
> If you'll rebase on
>
>    git://github.com/rth7680/qemu.git tcg-next

FYI I did build a tree on which to base the rest of the MTTCG work:

https://github.com/stsquad/qemu/commits/mttcg/tcg-next-plus-qht

>
> I'll queue the patchset for 2.7.
>
>
> r~


--
Alex Bennée

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-04  5:24       ` Richard Henderson
  2016-05-04  9:31         ` Alex Bennée
@ 2016-05-04 15:36         ` Emilio G. Cota
  2016-05-04 17:22           ` Richard Henderson
  1 sibling, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-05-04 15:36 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Alex Bennée, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Peter Crosthwaite, Sergey Fedorov

On Tue, May 03, 2016 at 19:24:31 -1000, Richard Henderson wrote:
> On 05/03/2016 07:26 AM, Emilio G. Cota wrote:
> >Ouch, sorry. Won't happen again.
> >
> >Grab the missing pre-requisite patch from:
> >  http://patchwork.ozlabs.org/patch/607662/mbox/
> 
> Well, that said, there are further conflicts within my tcg-next tree.
> If you'll rebase on
> 
>   git://github.com/rth7680/qemu.git tcg-next
> 
> I'll queue the patchset for 2.7.

Great, will do.

BTW in the last couple of days I did some more work beyond v4:

- Added a benchmark (not a correctness test) to measure parallel
  performance of QHT (recall that test/qht-test is sequential.)

- Added support for concurrent insertions as long as they're not to the
  same bucket, thus getting rid of the "external lock" requirement.
  This is not really needed for MTTCG because all insertions are supposed
  to be serialized by tb_lock; however, the feature (1) has no negative
  performance impact (just adds an unlikely() branch after lock acquisition
  on insertions/removals) and (2) could be useful for future (parallel)
  users of qht.

Should I send this work as follow-up patches to v4 to ease review, or
should I send a v5 with them merged in?

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-04 15:36         ` Emilio G. Cota
@ 2016-05-04 17:22           ` Richard Henderson
  2016-05-05 21:41             ` Emilio G. Cota
  0 siblings, 1 reply; 31+ messages in thread
From: Richard Henderson @ 2016-05-04 17:22 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Alex Bennée, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Peter Crosthwaite, Sergey Fedorov

On 05/04/2016 05:36 AM, Emilio G. Cota wrote:
> BTW in the last couple of days I did some more work beyond v4:
>
> - Added a benchmark (not a correctness test) to measure parallel
>   performance of QHT (recall that test/qht-test is sequential.)
>
> - Added support for concurrent insertions as long as they're not to the
>   same bucket, thus getting rid of the "external lock" requirement.
>   This is not really needed for MTTCG because all insertions are supposed
>   to be serialized by tb_lock; however, the feature (1) has no negative
>   performance impact (just adds an unlikely() branch after lock acquisition
>   on insertions/removals) and (2) could be useful for future (parallel)
>   users of qht.
>
> Should I send this work as follow-up patches to v4 to ease review, or
> should I send a v5 with them merged in?

Let's handle these as follow-on, since we've already got multiple R-b tags for v4.


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-04 17:22           ` Richard Henderson
@ 2016-05-05 21:41             ` Emilio G. Cota
  2016-05-06 22:14               ` Richard Henderson
  0 siblings, 1 reply; 31+ messages in thread
From: Emilio G. Cota @ 2016-05-05 21:41 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Alex Bennée, QEMU Developers, MTTCG Devel, Paolo Bonzini,
	Peter Crosthwaite, Sergey Fedorov

On Wed, May 04, 2016 at 07:22:16 -1000, Richard Henderson wrote:
> On 05/04/2016 05:36 AM, Emilio G. Cota wrote:
> >BTW in the last couple of days I did some more work beyond v4:
> >
> >- Added a benchmark (not a correctness test) to measure parallel
> >  performance of QHT (recall that test/qht-test is sequential.)
> >
> >- Added support for concurrent insertions as long as they're not to the
> >  same bucket, thus getting rid of the "external lock" requirement.
> >  This is not really needed for MTTCG because all insertions are supposed
> >  to be serialized by tb_lock; however, the feature (1) has no negative
> >  performance impact (just adds an unlikely() branch after lock acquisition
> >  on insertions/removals) and (2) could be useful for future (parallel)
> >  users of qht.
> >
> >Should I send this work as follow-up patches to v4 to ease review, or
> >should I send a v5 with them merged in?
> 
> Let's handle these as follow-on, since we've already got multiple R-b tags for v4.

OK, will submit the modifications next week.

BTW Benchmarking with the new test is giving me some interesting
results.

For instance, I'm measuring a 5% lookup latency reduction
(single-threaded throughput goes from 45.84 to 48.41 M lookups/s)
if I remove support in qht for MRU promotions.

This is tempting me to just kill the feature, since resizing
works very well.  And it would save ~90 lines of code. Is anyone
against this?

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht
  2016-05-05 21:41             ` Emilio G. Cota
@ 2016-05-06 22:14               ` Richard Henderson
  0 siblings, 0 replies; 31+ messages in thread
From: Richard Henderson @ 2016-05-06 22:14 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: MTTCG Devel, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Paolo Bonzini, Alex Bennée

On 05/05/2016 11:41 AM, Emilio G. Cota wrote:
> This is tempting me to just kill the feature, since resizing
> works very well.  And it would save ~90 lines of code. Is anyone
> against this?

Not I.  I think it'll nicely simplify things.


r~

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-05-06 22:15 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-30  3:33 [Qemu-devel] [PATCH v4 00/14] tb hash improvements Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 01/14] compiler.h: add QEMU_ALIGNED() to enforce struct alignment Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 02/14] seqlock: remove optional mutex Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 03/14] seqlock: rename write_lock/unlock to write_begin/end Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 04/14] include/processor.h: define cpu_relax() Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 05/14] atomics: add atomic_test_and_set Emilio G. Cota
2016-05-04  5:10   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 06/14] qemu-thread: add simple test-and-set spinlock Emilio G. Cota
2016-05-04  5:10   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 07/14] exec: add tb_hash_func5, derived from xxhash Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 08/14] tb hash: hash phys_pc, pc, and flags with xxhash Emilio G. Cota
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 09/14] qdist: add module to represent frequency distributions of data Emilio G. Cota
2016-05-04  5:13   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 10/14] qdist: add test program Emilio G. Cota
2016-05-04  5:23   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 11/14] qht: QEMU's fast, resizable and scalable Hash Table Emilio G. Cota
2016-05-04  5:17   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 12/14] qht: add test program Emilio G. Cota
2016-05-04  5:22   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 13/14] tb hash: track translated blocks with qht Emilio G. Cota
2016-05-03  7:36   ` Alex Bennée
2016-05-03 17:26     ` Emilio G. Cota
2016-05-04  5:24       ` Richard Henderson
2016-05-04  9:31         ` Alex Bennée
2016-05-04 15:36         ` Emilio G. Cota
2016-05-04 17:22           ` Richard Henderson
2016-05-05 21:41             ` Emilio G. Cota
2016-05-06 22:14               ` Richard Henderson
2016-05-04  5:19   ` Richard Henderson
2016-04-30  3:33 ` [Qemu-devel] [PATCH v4 14/14] translate-all: add tb hash bucket info to 'info jit' dump Emilio G. Cota
2016-05-04  5:21   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).