From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46103) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bXTh0-00039m-8H for qemu-devel@nongnu.org; Wed, 10 Aug 2016 09:37:12 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bXTgv-0001co-OU for qemu-devel@nongnu.org; Wed, 10 Aug 2016 09:37:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40930) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bXTgv-0001cQ-Fb for qemu-devel@nongnu.org; Wed, 10 Aug 2016 09:37:05 -0400 Date: Wed, 10 Aug 2016 15:36:59 +0200 From: Igor Mammedov Message-ID: <20160810153659.4b0172bd@nial.brq.redhat.com> In-Reply-To: <1465568813-19771-15-git-send-email-rth@twiddle.net> References: <1465568813-19771-1-git-send-email-rth@twiddle.net> <1465568813-19771-15-git-send-email-rth@twiddle.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PULL 14/15] tb hash: track translated blocks with qht List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: qemu-devel@nongnu.org, peter.maydell@linaro.org, "Emilio G. Cota" , Paolo Bonzini , Peter Crosthwaite , MTTCG Devel On Fri, 10 Jun 2016 07:26:52 -0700 Richard Henderson wrote: This patch make SIGSEGVs QEMU when debugging KVM guest with gdb Steps to reproduce: Seabios: clone and build current master with defconfig plus CONFIG_RELOCATE_INIT=3Dn on top of it QEMU: ./configure --enable-debug --target-list=3Dx86_64-softmmu ./x86_64-softmmu/qemu-system-x86_64 -enable-kvm -bios ../seabios/out/bio= s.bin -s -S GDB: gdb ../seabios/out/rom.o (gdb) b smp_setup (gdb) target remote localhost:1234 (gdb) c Continuing. Remote connection closed QEMU's backtrace: Program received signal SIGSEGV, Segmentation fault. 0x0000555555c19c6f in qht_reset_size (ht=3D0x555556240058 , n_= elems=3D0x8000) at util/qht.c:422 422 if (n_buckets !=3D map->n_buckets) { (gdb) bt #0 0x0000555555c19c6f in qht_reset_size (ht=3D0x555556240058 = , n_elems=3D0x8000) at util/qht.c:422 #1 0x0000555555754bdd in tb_flush (cpu=3D0x0) at /home/imammedo/builds/qem= u/translate-all.c:855 #2 0x000055555579026e in gdb_vm_state_change (opaque=3D0x0, running=3D0x0,= state=3DRUN_STATE_DEBUG) at /home/imammedo/builds/qemu/gdbstub.c:1276 #3 0x00005555558be59f in vm_state_notify (running=3D0x0, state=3DRUN_STATE= _DEBUG) at vl.c:1585 #4 0x000055555578137e in do_vm_stop (state=3DRUN_STATE_DEBUG) at /home/ima= mmedo/builds/qemu/cpus.c:743 #5 0x0000555555782b5e in vm_stop (state=3DRUN_STATE_DEBUG) at /home/imamme= do/builds/qemu/cpus.c:1476 #6 0x00005555558bebdf in main_loop_should_exit () at vl.c:1856 #7 0x00005555558bed57 in main_loop () at vl.c:1912 #8 0x00005555558c678a in main (argc=3D0x6, argv=3D0x7fffffffe0c8, envp=3D0= x7fffffffe100) at vl.c:4603 > From: "Emilio G. Cota" >=20 > Having a fixed-size hash table for keeping track of all translation blocks > is suboptimal: some workloads are just too big or too small to get maximum > performance from the hash table. The MRU promotion policy helps improve > performance when the hash table is a little undersized, but it cannot > make up for severely undersized hash tables. >=20 > Furthermore, frequent MRU promotions result in writes that are a scalabil= ity > bottleneck. For scalability, lookups should only perform reads, not write= s. > This is not a big deal for now, but it will become one once MTTCG matures. >=20 > The appended fixes these issues by using qht as the implementation of > the TB hash table. This solution is superior to other alternatives consid= ered, > namely: >=20 > - master: implementation in QEMU before this patchset > - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU. > - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU. > MRU is implemented here by adding an intermediate struct > that contains the u32 hash and a pointer to the TB; this > allows us, on an MRU promotion, to copy said struct (that i= s not > at the head), and put this new copy at the head. After a gr= ace > period, the original non-head struct can be eliminated, and > after another grace period, freed. > - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize + > no MRU for lookups; MRU for inserts. > The appended solution is the following: > - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize + > no MRU for lookups; MRU for inserts. >=20 > The plots below compare the considered solutions. The Y axis shows the > boot time (in seconds) of a debian jessie image with arm-softmmu; the X a= xis > sweeps the number of buckets (or initial number of buckets for qht-autore= size). > The plots in PNG format (and with errorbars) can be seen here: > http://imgur.com/a/Awgnq >=20 > Each test runs 5 times, and the entire QEMU process is pinned to a > single core for repeatability of results. >=20 > Host: Intel Xeon E5-2690 >=20 > 28 ++------------+-------------+-------------+-------------+-----------= -++ > A***** + + + master **A**= * + > 27 ++ * xxhash ##B##= #++ > | A******A****** xxhash-rcu $$C$$= $ | > 26 C$$ A******A****** qht-fixed-nomru*%%D%%= %++ > D%%$$ A******A******A*qht-dyn-mru A*E**= **A > 25 ++ %%$$ qht-dyn-nomru &&F&&= &++ > B#####% = | > 24 ++ #C$$$$$ = ++ > | B### $ = | > | ## C$$$$$$ = | > 23 ++ # C$$$$$$ = ++ > | B###### C$$$$$$ %= %%D > 22 ++ %B###### C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$= $$C > | D%%%%%%B###### @E@@@@@@ %%%D%%%@@@E@@@@= @@E > 21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B####= ##B > + E@@@ F&&& + E@ + F&&& + = + > 20 ++------------+-------------+-------------+-------------+-----------= -++ > 14 16 18 20 22 = 24 > log2 number of buckets >=20 > Host: Intel i7-4790K >=20 > 14.5 ++------------+------------+-------------+------------+-----------= -++ > A** + + + master **A**= * + > 14 ++ ** xxhash ##B##= #++ > 13.5 ++ ** xxhash-rcu $$C$$= $++ > | qht-fixed-nomru %%D%%= % | > 13 ++ A****** qht-dyn-mru @@E@@= @++ > | A*****A******A****** qht-dyn-nomru &&F&&= & | > 12.5 C$$ A******A******A*****A****** *= **A > 12 ++ $$ A*** = ++ > D%%% $$ = | > 11.5 ++ %% = ++ > B### %C$$$$$$ = | > 11 ++ ## D%%%%% C$$$$$ = ++ > | # % C$$$$$$ = | > 10.5 F&&&&&&B######D%%%%% C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$ $= $$C > 10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B####= ##B > + F&& D%%%%%%B######B######B#####B###@@@D%%% = + > 9.5 ++------------+------------+-------------+------------+-----------= -++ > 14 16 18 20 22 = 24 > log2 number of buckets >=20 > Note that the original point before this patch series is X=3D15 for "mast= er"; > the little sensitivity to the increased number of buckets is due to the > poor hashing function in master. >=20 > xxhash-rcu has significant overhead due to the constant churn of allocati= ng > and deallocating intermediate structs for implementing MRU. An alternative > would be do consider failed lookups as "maybe not there", and then > acquire the external lock (tb_lock in this case) to really confirm that > there was indeed a failed lookup. This, however, would not be enough > to implement dynamic resizing--this is more complex: see > "Resizable, Scalable, Concurrent Hash Tables via Relativistic > Programming" by Triplett, McKenney and Walpole. This solution was > discarded due to the very coarse RCU read critical sections that we have > in MTTCG; resizing requires waiting for readers after every pointer updat= e, > and resizes require many pointer updates, so this would quickly become > prohibitive. >=20 > qht-fixed-nomru shows that MRU promotion is advisable for undersized > hash tables. >=20 > However, qht-dyn-mru shows that MRU promotion is not important if the > hash table is properly sized: there is virtually no difference in > performance between qht-dyn-nomru and qht-dyn-mru. >=20 > Before this patch, we're at X=3D15 on "xxhash"; after this patch, we're at > X=3D15 @ qht-dyn-nomru. This patch thus matches the best performance that= we > can achieve with optimum sizing of the hash table, while keeping the hash > table scalable for readers. >=20 > The improvement we get before and after this patch for booting debian jes= sie > with arm-softmmu is: >=20 > - Intel Xeon E5-2690: 10.5% less time > - Intel i7-4790K: 5.2% less time >=20 > We could get this same improvement _for this particular workload_ by > statically increasing the size of the hash table. But this would hurt > workloads that do not need a large hash table. The dynamic (upward) > resizing allows us to start small and enlarge the hash table as needed. >=20 > A quick note on downsizing: the table is resized back to 2**15 buckets > on every tb_flush; this makes sense because it is not guaranteed that the > table will reach the same number of TBs later on (e.g. most bootup code is > thrown away after boot); it makes sense to grow the hash table as > more code blocks are translated. This also avoids the complication of > having to build downsizing hysteresis logic into qht. >=20 > Reviewed-by: Sergey Fedorov > Reviewed-by: Alex Benn=C3=A9e > Reviewed-by: Richard Henderson > Signed-off-by: Emilio G. Cota > Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org> > Signed-off-by: Richard Henderson > --- > cpu-exec.c | 86 +++++++++++++++++++++++------------------= ------ > include/exec/exec-all.h | 2 -- > include/exec/tb-context.h | 7 ++-- > include/exec/tb-hash.h | 3 +- > translate-all.c | 85 +++++++++++++++++++++--------------------= ----- > 5 files changed, 86 insertions(+), 97 deletions(-) >=20 > diff --git a/cpu-exec.c b/cpu-exec.c > index b9e294c..b840e1d 100644 > --- a/cpu-exec.c > +++ b/cpu-exec.c > @@ -225,57 +225,57 @@ static void cpu_exec_nocache(CPUState *cpu, int max= _cycles, > } > #endif > =20 > +struct tb_desc { > + target_ulong pc; > + target_ulong cs_base; > + CPUArchState *env; > + tb_page_addr_t phys_page1; > + uint32_t flags; > +}; > + > +static bool tb_cmp(const void *p, const void *d) > +{ > + const TranslationBlock *tb =3D p; > + const struct tb_desc *desc =3D d; > + > + if (tb->pc =3D=3D desc->pc && > + tb->page_addr[0] =3D=3D desc->phys_page1 && > + tb->cs_base =3D=3D desc->cs_base && > + tb->flags =3D=3D desc->flags) { > + /* check next page if needed */ > + if (tb->page_addr[1] =3D=3D -1) { > + return true; > + } else { > + tb_page_addr_t phys_page2; > + target_ulong virt_page2; > + > + virt_page2 =3D (desc->pc & TARGET_PAGE_MASK) + TARGET_PAGE_S= IZE; > + phys_page2 =3D get_page_addr_code(desc->env, virt_page2); > + if (tb->page_addr[1] =3D=3D phys_page2) { > + return true; > + } > + } > + } > + return false; > +} > + > static TranslationBlock *tb_find_physical(CPUState *cpu, > target_ulong pc, > target_ulong cs_base, > uint32_t flags) > { > - CPUArchState *env =3D (CPUArchState *)cpu->env_ptr; > - TranslationBlock *tb, **tb_hash_head, **ptb1; > + tb_page_addr_t phys_pc; > + struct tb_desc desc; > uint32_t h; > - tb_page_addr_t phys_pc, phys_page1; > =20 > - /* find translated block using physical mappings */ > - phys_pc =3D get_page_addr_code(env, pc); > - phys_page1 =3D phys_pc & TARGET_PAGE_MASK; > + desc.env =3D (CPUArchState *)cpu->env_ptr; > + desc.cs_base =3D cs_base; > + desc.flags =3D flags; > + desc.pc =3D pc; > + phys_pc =3D get_page_addr_code(desc.env, pc); > + desc.phys_page1 =3D phys_pc & TARGET_PAGE_MASK; > h =3D tb_hash_func(phys_pc, pc, flags); > - > - /* Start at head of the hash entry */ > - ptb1 =3D tb_hash_head =3D &tcg_ctx.tb_ctx.tb_phys_hash[h]; > - tb =3D *ptb1; > - > - while (tb) { > - if (tb->pc =3D=3D pc && > - tb->page_addr[0] =3D=3D phys_page1 && > - tb->cs_base =3D=3D cs_base && > - tb->flags =3D=3D flags) { > - > - if (tb->page_addr[1] =3D=3D -1) { > - /* done, we have a match */ > - break; > - } else { > - /* check next page if needed */ > - target_ulong virt_page2 =3D (pc & TARGET_PAGE_MASK) + > - TARGET_PAGE_SIZE; > - tb_page_addr_t phys_page2 =3D get_page_addr_code(env, vi= rt_page2); > - > - if (tb->page_addr[1] =3D=3D phys_page2) { > - break; > - } > - } > - } > - > - ptb1 =3D &tb->phys_hash_next; > - tb =3D *ptb1; > - } > - > - if (tb) { > - /* Move the TB to the head of the list */ > - *ptb1 =3D tb->phys_hash_next; > - tb->phys_hash_next =3D *tb_hash_head; > - *tb_hash_head =3D tb; > - } > - return tb; > + return qht_lookup(&tcg_ctx.tb_ctx.htable, tb_cmp, &desc, h); > } > =20 > static TranslationBlock *tb_find_slow(CPUState *cpu, > diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h > index e076397..c1f59fa 100644 > --- a/include/exec/exec-all.h > +++ b/include/exec/exec-all.h > @@ -215,8 +215,6 @@ struct TranslationBlock { > =20 > void *tc_ptr; /* pointer to the translated code */ > uint8_t *tc_search; /* pointer to search data */ > - /* next matching tb for physical address. */ > - struct TranslationBlock *phys_hash_next; > /* original tb when cflags has CF_NOCACHE */ > struct TranslationBlock *orig_tb; > /* first and second physical page containing code. The lower bit > diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h > index 5efe3d9..e209c1c 100644 > --- a/include/exec/tb-context.h > +++ b/include/exec/tb-context.h > @@ -21,9 +21,10 @@ > #define QEMU_TB_CONTEXT_H_ > =20 > #include "qemu/thread.h" > +#include "qemu/qht.h" > =20 > -#define CODE_GEN_PHYS_HASH_BITS 15 > -#define CODE_GEN_PHYS_HASH_SIZE (1 << CODE_GEN_PHYS_HASH_BITS) > +#define CODE_GEN_HTABLE_BITS 15 > +#define CODE_GEN_HTABLE_SIZE (1 << CODE_GEN_HTABLE_BITS) > =20 > typedef struct TranslationBlock TranslationBlock; > typedef struct TBContext TBContext; > @@ -31,7 +32,7 @@ typedef struct TBContext TBContext; > struct TBContext { > =20 > TranslationBlock *tbs; > - TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE]; > + struct qht htable; > int nb_tbs; > /* any access to the tbs or the page table must use this lock */ > QemuMutex tb_lock; > diff --git a/include/exec/tb-hash.h b/include/exec/tb-hash.h > index 88ccfd1..1d0200b 100644 > --- a/include/exec/tb-hash.h > +++ b/include/exec/tb-hash.h > @@ -20,7 +20,6 @@ > #ifndef EXEC_TB_HASH > #define EXEC_TB_HASH > =20 > -#include "exec/exec-all.h" > #include "exec/tb-hash-xx.h" > =20 > /* Only the bottom TB_JMP_PAGE_BITS of the jump cache hash bits vary for > @@ -49,7 +48,7 @@ static inline unsigned int tb_jmp_cache_hash_func(targe= t_ulong pc) > static inline > uint32_t tb_hash_func(tb_page_addr_t phys_pc, target_ulong pc, uint32_t = flags) > { > - return tb_hash_func5(phys_pc, pc, flags) & (CODE_GEN_PHYS_HASH_SIZE = - 1); > + return tb_hash_func5(phys_pc, pc, flags); > } > =20 > #endif > diff --git a/translate-all.c b/translate-all.c > index d75737c..b620fcc 100644 > --- a/translate-all.c > +++ b/translate-all.c > @@ -735,6 +735,13 @@ static inline void code_gen_alloc(size_t tb_size) > qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock); > } > =20 > +static void tb_htable_init(void) > +{ > + unsigned int mode =3D QHT_MODE_AUTO_RESIZE; > + > + qht_init(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE, mode); > +} > + > /* Must be called before using the QEMU cpus. 'tb_size' is the size > (in bytes) allocated to the translation buffer. Zero means default > size. */ > @@ -742,6 +749,7 @@ void tcg_exec_init(unsigned long tb_size) > { > cpu_gen_init(); > page_init(); > + tb_htable_init(); > code_gen_alloc(tb_size); > #if defined(CONFIG_SOFTMMU) > /* There's no guest base to take into account, so go ahead and > @@ -846,7 +854,7 @@ void tb_flush(CPUState *cpu) > cpu->tb_flushed =3D true; > } > =20 > - memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys= _hash)); > + qht_reset_size(&tcg_ctx.tb_ctx.htable, CODE_GEN_HTABLE_SIZE); > page_flush_tb(); > =20 > tcg_ctx.code_gen_ptr =3D tcg_ctx.code_gen_buffer; > @@ -857,60 +865,46 @@ void tb_flush(CPUState *cpu) > =20 > #ifdef DEBUG_TB_CHECK > =20 > -static void tb_invalidate_check(target_ulong address) > +static void > +do_tb_invalidate_check(struct qht *ht, void *p, uint32_t hash, void *use= rp) > { > - TranslationBlock *tb; > - int i; > + TranslationBlock *tb =3D p; > + target_ulong addr =3D *(target_ulong *)userp; > =20 > - address &=3D TARGET_PAGE_MASK; > - for (i =3D 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) { > - for (tb =3D tcg_ctx.tb_ctx.tb_phys_hash[i]; tb !=3D NULL; > - tb =3D tb->phys_hash_next) { > - if (!(address + TARGET_PAGE_SIZE <=3D tb->pc || > - address >=3D tb->pc + tb->size)) { > - printf("ERROR invalidate: address=3D" TARGET_FMT_lx > - " PC=3D%08lx size=3D%04x\n", > - address, (long)tb->pc, tb->size); > - } > - } > + if (!(addr + TARGET_PAGE_SIZE <=3D tb->pc || addr >=3D tb->pc + tb->= size)) { > + printf("ERROR invalidate: address=3D" TARGET_FMT_lx > + " PC=3D%08lx size=3D%04x\n", addr, (long)tb->pc, tb->size= ); > } > } > =20 > -/* verify that all the pages have correct rights for code */ > -static void tb_page_check(void) > +static void tb_invalidate_check(target_ulong address) > { > - TranslationBlock *tb; > - int i, flags1, flags2; > - > - for (i =3D 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) { > - for (tb =3D tcg_ctx.tb_ctx.tb_phys_hash[i]; tb !=3D NULL; > - tb =3D tb->phys_hash_next) { > - flags1 =3D page_get_flags(tb->pc); > - flags2 =3D page_get_flags(tb->pc + tb->size - 1); > - if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) { > - printf("ERROR page flags: PC=3D%08lx size=3D%04x f1=3D%x= f2=3D%x\n", > - (long)tb->pc, tb->size, flags1, flags2); > - } > - } > - } > + address &=3D TARGET_PAGE_MASK; > + qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_invalidate_check, &address); > } > =20 > -#endif > - > -static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlo= ck *tb) > +static void > +do_tb_page_check(struct qht *ht, void *p, uint32_t hash, void *userp) > { > - TranslationBlock *tb1; > + TranslationBlock *tb =3D p; > + int flags1, flags2; > =20 > - for (;;) { > - tb1 =3D *ptb; > - if (tb1 =3D=3D tb) { > - *ptb =3D tb1->phys_hash_next; > - break; > - } > - ptb =3D &tb1->phys_hash_next; > + flags1 =3D page_get_flags(tb->pc); > + flags2 =3D page_get_flags(tb->pc + tb->size - 1); > + if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) { > + printf("ERROR page flags: PC=3D%08lx size=3D%04x f1=3D%x f2=3D%x= \n", > + (long)tb->pc, tb->size, flags1, flags2); > } > } > =20 > +/* verify that all the pages have correct rights for code */ > +static void tb_page_check(void) > +{ > + qht_iter(&tcg_ctx.tb_ctx.htable, do_tb_page_check, NULL); > +} > + > +#endif > + > static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlo= ck *tb) > { > TranslationBlock *tb1; > @@ -998,7 +992,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page= _addr_t page_addr) > /* remove the TB from the hash list */ > phys_pc =3D tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK); > h =3D tb_hash_func(phys_pc, tb->pc, tb->flags); > - tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb); > + qht_remove(&tcg_ctx.tb_ctx.htable, tb, h); > =20 > /* remove the TB from the page list */ > if (tb->page_addr[0] !=3D page_addr) { > @@ -1128,13 +1122,10 @@ static void tb_link_page(TranslationBlock *tb, tb= _page_addr_t phys_pc, > tb_page_addr_t phys_page2) > { > uint32_t h; > - TranslationBlock **ptb; > =20 > /* add in the hash table */ > h =3D tb_hash_func(phys_pc, tb->pc, tb->flags); > - ptb =3D &tcg_ctx.tb_ctx.tb_phys_hash[h]; > - tb->phys_hash_next =3D *ptb; > - *ptb =3D tb; > + qht_insert(&tcg_ctx.tb_ctx.htable, tb, h); > =20 > /* add in the page list */ > tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);