* [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing @ 2018-10-08 23:27 Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers Emilio G. Cota ` (5 more replies) 0 siblings, 6 replies; 18+ messages in thread From: Emilio G. Cota @ 2018-10-08 23:27 UTC (permalink / raw) To: qemu-devel; +Cc: Richard Henderson, Alex Bennée v1: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01146.html Changes since v1: - Add tlb_index and tlb_entry helpers from Richard - Introduce sizeof_tlb() and tlb_n_entries() - Extract tlb_mask as its own array in CPUArchState, as suggested by Richard. For the associated helpers (tlb_index etc) I tried several approaches, and performance-wise they're all the same, so went for the simplest implementation. - Use uintptr_t for tlb_mask, as done in Richard's patch + tcg/i386: use hrexw when reading tlb_mask + Define tlbtype and tlbrexw solely based on TARGET_PAGE_BITS - Rename tlb_is_invalid to tlb_entry_is_empty, comparing all fields (except .addend) against -1. - Rename CPUTLBDesc.used to .n_used_entries. - Drop the MIN/MAX CPU_TLB_BITS patches, defining instead some values for MIN/MAX as well as a default. - Use new_size and old_size consistently in the resizing function, as suggested by Richard. - Add an additional chart to the last patch, where softmmu performance is compared against user-mode: https://imgur.com/a/eXkjMCE You can fetch this series from: https://github.com/cota/qemu/tree/tlb-dyn-v2 Note that it applies on top of my tlb-lock-v4 series: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01421.html Thanks, Emilio ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota @ 2018-10-08 23:27 ` Emilio G. Cota 2018-10-09 14:43 ` Alex Bennée 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 2/5] (XXX) cputlb: introduce indirection for TLB size Emilio G. Cota ` (4 subsequent siblings) 5 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2018-10-08 23:27 UTC (permalink / raw) To: qemu-devel; +Cc: Richard Henderson, Alex Bennée From: Richard Henderson <richard.henderson@linaro.org> Isolate the computation of an index from an address into a helper before we change that function. Signed-off-by: Richard Henderson <richard.henderson@linaro.org> [ cota: convert tlb_vaddr_to_host; use atomic_read on addr_write ] Signed-off-by: Emilio G. Cota <cota@braap.org> --- accel/tcg/softmmu_template.h | 72 ++++++++++++++++---------------- include/exec/cpu_ldst.h | 19 +++++++-- include/exec/cpu_ldst_template.h | 25 +++++------ accel/tcg/cputlb.c | 61 ++++++++++++--------------- 4 files changed, 92 insertions(+), 85 deletions(-) diff --git a/accel/tcg/softmmu_template.h b/accel/tcg/softmmu_template.h index 1e50263871..aa34c692ee 100644 --- a/accel/tcg/softmmu_template.h +++ b/accel/tcg/softmmu_template.h @@ -111,9 +111,10 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env, WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi, uintptr_t retaddr) { - unsigned mmu_idx = get_mmuidx(oi); - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; + uintptr_t mmu_idx = get_mmuidx(oi); + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); + target_ulong tlb_addr = entry->ADDR_READ; unsigned a_bits = get_alignment_bits(get_memop(oi)); uintptr_t haddr; DATA_TYPE res; @@ -129,7 +130,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, READ_ACCESS_TYPE, mmu_idx, retaddr); } - tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; + tlb_addr = entry->ADDR_READ; } /* Handle an IO access. */ @@ -166,7 +167,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, return res; } - haddr = addr + env->tlb_table[mmu_idx][index].addend; + haddr = addr + entry->addend; #if DATA_SIZE == 1 res = glue(glue(ld, LSUFFIX), _p)((uint8_t *)haddr); #else @@ -179,9 +180,10 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi, uintptr_t retaddr) { - unsigned mmu_idx = get_mmuidx(oi); - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; + uintptr_t mmu_idx = get_mmuidx(oi); + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); + target_ulong tlb_addr = entry->ADDR_READ; unsigned a_bits = get_alignment_bits(get_memop(oi)); uintptr_t haddr; DATA_TYPE res; @@ -197,7 +199,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, READ_ACCESS_TYPE, mmu_idx, retaddr); } - tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; + tlb_addr = entry->ADDR_READ; } /* Handle an IO access. */ @@ -234,7 +236,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, return res; } - haddr = addr + env->tlb_table[mmu_idx][index].addend; + haddr = addr + entry->addend; res = glue(glue(ld, LSUFFIX), _be_p)((uint8_t *)haddr); return res; } @@ -275,10 +277,10 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env, void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, TCGMemOpIdx oi, uintptr_t retaddr) { - unsigned mmu_idx = get_mmuidx(oi); - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - target_ulong tlb_addr = - atomic_read(&env->tlb_table[mmu_idx][index].addr_write); + uintptr_t mmu_idx = get_mmuidx(oi); + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); + target_ulong tlb_addr = atomic_read(&entry->addr_write); unsigned a_bits = get_alignment_bits(get_memop(oi)); uintptr_t haddr; @@ -293,8 +295,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, MMU_DATA_STORE, mmu_idx, retaddr); } - tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write) & - ~TLB_INVALID_MASK; + tlb_addr = atomic_read(&entry->addr_write) & ~TLB_INVALID_MASK; } /* Handle an IO access. */ @@ -315,16 +316,16 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, if (DATA_SIZE > 1 && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1 >= TARGET_PAGE_SIZE)) { - int i, index2; - target_ulong page2, tlb_addr2; + int i; + target_ulong page2; + CPUTLBEntry *entry2; do_unaligned_access: /* Ensure the second page is in the TLB. Note that the first page is already guaranteed to be filled, and that the second page cannot evict the first. */ page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK; - index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write); - if (!tlb_hit_page(tlb_addr2, page2) + entry2 = tlb_entry(env, mmu_idx, page2); + if (!tlb_hit_page(atomic_read(&entry2->addr_write), page2) && !VICTIM_TLB_HIT(addr_write, page2)) { tlb_fill(ENV_GET_CPU(env), page2, DATA_SIZE, MMU_DATA_STORE, mmu_idx, retaddr); @@ -342,7 +343,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, return; } - haddr = addr + env->tlb_table[mmu_idx][index].addend; + haddr = addr + entry->addend; #if DATA_SIZE == 1 glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val); #else @@ -354,10 +355,10 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, TCGMemOpIdx oi, uintptr_t retaddr) { - unsigned mmu_idx = get_mmuidx(oi); - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - target_ulong tlb_addr = - atomic_read(&env->tlb_table[mmu_idx][index].addr_write); + uintptr_t mmu_idx = get_mmuidx(oi); + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); + target_ulong tlb_addr = atomic_read(&entry->addr_write); unsigned a_bits = get_alignment_bits(get_memop(oi)); uintptr_t haddr; @@ -372,8 +373,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, MMU_DATA_STORE, mmu_idx, retaddr); } - tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write) & - ~TLB_INVALID_MASK; + tlb_addr = atomic_read(&entry->addr_write) & ~TLB_INVALID_MASK; } /* Handle an IO access. */ @@ -385,8 +385,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, /* ??? Note that the io helpers always read data in the target byte ordering. We should push the LE/BE request down into io. */ val = TGT_BE(val); - glue(io_write, SUFFIX)(env, mmu_idx, index, val, addr, retaddr, - tlb_addr & TLB_RECHECK); + glue(io_write, SUFFIX)(env, mmu_idx, index, val, addr, + retaddr, tlb_addr & TLB_RECHECK); return; } @@ -394,16 +394,16 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, if (DATA_SIZE > 1 && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1 >= TARGET_PAGE_SIZE)) { - int i, index2; - target_ulong page2, tlb_addr2; + int i; + target_ulong page2; + CPUTLBEntry *entry2; do_unaligned_access: /* Ensure the second page is in the TLB. Note that the first page is already guaranteed to be filled, and that the second page cannot evict the first. */ page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK; - index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write); - if (!tlb_hit_page(tlb_addr2, page2) + entry2 = tlb_entry(env, mmu_idx, page2); + if (!tlb_hit_page(atomic_read(&entry2->addr_write), page2) && !VICTIM_TLB_HIT(addr_write, page2)) { tlb_fill(ENV_GET_CPU(env), page2, DATA_SIZE, MMU_DATA_STORE, mmu_idx, retaddr); @@ -421,7 +421,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, return; } - haddr = addr + env->tlb_table[mmu_idx][index].addend; + haddr = addr + entry->addend; glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val); } #endif /* DATA_SIZE > 1 */ diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h index 9581587ce1..e3d8d738aa 100644 --- a/include/exec/cpu_ldst.h +++ b/include/exec/cpu_ldst.h @@ -126,6 +126,20 @@ extern __thread uintptr_t helper_retaddr; /* The memory helpers for tcg-generated code need tcg_target_long etc. */ #include "tcg.h" +/* Find the TLB index corresponding to the mmu_idx + address pair. */ +static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx, + target_ulong addr) +{ + return (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); +} + +/* Find the TLB entry corresponding to the mmu_idx + address pair. */ +static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx, + target_ulong addr) +{ + return &env->tlb_table[mmu_idx][tlb_index(env, mmu_idx, addr)]; +} + #ifdef MMU_MODE0_SUFFIX #define CPU_MMU_INDEX 0 #define MEMSUFFIX MMU_MODE0_SUFFIX @@ -416,8 +430,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr, #if defined(CONFIG_USER_ONLY) return g2h(addr); #else - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - CPUTLBEntry *tlbentry = &env->tlb_table[mmu_idx][index]; + CPUTLBEntry *tlbentry = tlb_entry(env, mmu_idx, addr); abi_ptr tlb_addr; uintptr_t haddr; @@ -445,7 +458,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr, return NULL; } - haddr = addr + env->tlb_table[mmu_idx][index].addend; + haddr = addr + tlbentry->addend; return (void *)haddr; #endif /* defined(CONFIG_USER_ONLY) */ } diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h index ba7a11123c..924713eeed 100644 --- a/include/exec/cpu_ldst_template.h +++ b/include/exec/cpu_ldst_template.h @@ -81,7 +81,7 @@ glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, target_ulong ptr, uintptr_t retaddr) { - int page_index; + CPUTLBEntry *entry; RES_TYPE res; target_ulong addr; int mmu_idx; @@ -94,15 +94,15 @@ glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, #endif addr = ptr; - page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); mmu_idx = CPU_MMU_INDEX; - if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ != + entry = tlb_entry(env, mmu_idx, addr); + if (unlikely(entry->ADDR_READ != (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) { oi = make_memop_idx(SHIFT, mmu_idx); res = glue(glue(helper_ret_ld, URETSUFFIX), MMUSUFFIX)(env, addr, oi, retaddr); } else { - uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend; + uintptr_t hostaddr = addr + entry->addend; res = glue(glue(ld, USUFFIX), _p)((uint8_t *)hostaddr); } return res; @@ -120,7 +120,8 @@ glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, target_ulong ptr, uintptr_t retaddr) { - int res, page_index; + CPUTLBEntry *entry; + int res; target_ulong addr; int mmu_idx; TCGMemOpIdx oi; @@ -132,15 +133,15 @@ glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, #endif addr = ptr; - page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); mmu_idx = CPU_MMU_INDEX; - if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ != + entry = tlb_entry(env, mmu_idx, addr); + if (unlikely(entry->ADDR_READ != (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) { oi = make_memop_idx(SHIFT, mmu_idx); res = (DATA_STYPE)glue(glue(helper_ret_ld, SRETSUFFIX), MMUSUFFIX)(env, addr, oi, retaddr); } else { - uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend; + uintptr_t hostaddr = addr + entry->addend; res = glue(glue(lds, SUFFIX), _p)((uint8_t *)hostaddr); } return res; @@ -162,7 +163,7 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, target_ulong ptr, RES_TYPE v, uintptr_t retaddr) { - int page_index; + CPUTLBEntry *entry; target_ulong addr; int mmu_idx; TCGMemOpIdx oi; @@ -174,15 +175,15 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, #endif addr = ptr; - page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); mmu_idx = CPU_MMU_INDEX; - if (unlikely(atomic_read(&env->tlb_table[mmu_idx][page_index].addr_write) != + entry = tlb_entry(env, mmu_idx, addr); + if (unlikely(atomic_read(&entry->addr_write) != (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) { oi = make_memop_idx(SHIFT, mmu_idx); glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, v, oi, retaddr); } else { - uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend; + uintptr_t hostaddr = addr + entry->addend; glue(glue(st, SUFFIX), _p)((uint8_t *)hostaddr, v); } } diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index 200444142a..a5972773de 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -286,7 +286,6 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data) { CPUArchState *env = cpu->env_ptr; target_ulong addr = (target_ulong) data.target_ptr; - int i; int mmu_idx; assert_cpu_is_self(cpu); @@ -304,10 +303,9 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data) } addr &= TARGET_PAGE_MASK; - i = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); qemu_spin_lock(&env->tlb_lock); for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { - tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr); + tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); tlb_flush_vtlb_page_locked(env, mmu_idx, addr); } qemu_spin_unlock(&env->tlb_lock); @@ -339,18 +337,17 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu, target_ulong addr_and_mmuidx = (target_ulong) data.target_ptr; target_ulong addr = addr_and_mmuidx & TARGET_PAGE_MASK; unsigned long mmu_idx_bitmap = addr_and_mmuidx & ALL_MMUIDX_BITS; - int page = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); int mmu_idx; assert_cpu_is_self(cpu); - tlb_debug("page:%d addr:"TARGET_FMT_lx" mmu_idx:0x%lx\n", - page, addr, mmu_idx_bitmap); + tlb_debug("flush page addr:"TARGET_FMT_lx" mmu_idx:0x%lx\n", + addr, mmu_idx_bitmap); qemu_spin_lock(&env->tlb_lock); for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { if (test_bit(mmu_idx, &mmu_idx_bitmap)) { - tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr); + tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); tlb_flush_vtlb_page_locked(env, mmu_idx, addr); } } @@ -554,16 +551,14 @@ static inline void tlb_set_dirty1_locked(CPUTLBEntry *tlb_entry, void tlb_set_dirty(CPUState *cpu, target_ulong vaddr) { CPUArchState *env = cpu->env_ptr; - int i; int mmu_idx; assert_cpu_is_self(cpu); vaddr &= TARGET_PAGE_MASK; - i = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); qemu_spin_lock(&env->tlb_lock); for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { - tlb_set_dirty1_locked(&env->tlb_table[mmu_idx][i], vaddr); + tlb_set_dirty1_locked(tlb_entry(env, mmu_idx, vaddr), vaddr); } for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { @@ -663,8 +658,8 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, iotlb = memory_region_section_get_iotlb(cpu, section, vaddr_page, paddr_page, xlat, prot, &address); - index = (vaddr_page >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - te = &env->tlb_table[mmu_idx][index]; + index = tlb_index(env, mmu_idx, vaddr_page); + te = tlb_entry(env, mmu_idx, vaddr_page); /* * Hold the TLB lock for the rest of the function. We could acquire/release @@ -786,16 +781,16 @@ static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry, * repeat the MMU check here. This tlb_fill() call might * longjump out if this access should cause a guest exception. */ - int index; + CPUTLBEntry *entry; target_ulong tlb_addr; tlb_fill(cpu, addr, size, MMU_DATA_LOAD, mmu_idx, retaddr); - index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - tlb_addr = env->tlb_table[mmu_idx][index].addr_read; + entry = tlb_entry(env, mmu_idx, addr); + tlb_addr = entry->addr_read; if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) { /* RAM access */ - uintptr_t haddr = addr + env->tlb_table[mmu_idx][index].addend; + uintptr_t haddr = addr + entry->addend; return ldn_p((void *)haddr, size); } @@ -853,16 +848,16 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry, * repeat the MMU check here. This tlb_fill() call might * longjump out if this access should cause a guest exception. */ - int index; + CPUTLBEntry *entry; target_ulong tlb_addr; tlb_fill(cpu, addr, size, MMU_DATA_STORE, mmu_idx, retaddr); - index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write); + entry = tlb_entry(env, mmu_idx, addr); + tlb_addr = atomic_read(&entry->addr_write); if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) { /* RAM access */ - uintptr_t haddr = addr + env->tlb_table[mmu_idx][index].addend; + uintptr_t haddr = addr + entry->addend; stn_p((void *)haddr, size, val); return; @@ -943,20 +938,19 @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index, */ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr) { - int mmu_idx, index; + uintptr_t mmu_idx = cpu_mmu_index(env, true); + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); void *p; - index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - mmu_idx = cpu_mmu_index(env, true); - if (unlikely(!tlb_hit(env->tlb_table[mmu_idx][index].addr_code, addr))) { + if (unlikely(!tlb_hit(entry->addr_code, addr))) { if (!VICTIM_TLB_HIT(addr_code, addr)) { tlb_fill(ENV_GET_CPU(env), addr, 0, MMU_INST_FETCH, mmu_idx, 0); } - assert(tlb_hit(env->tlb_table[mmu_idx][index].addr_code, addr)); + assert(tlb_hit(entry->addr_code, addr)); } - if (unlikely(env->tlb_table[mmu_idx][index].addr_code & - (TLB_RECHECK | TLB_MMIO))) { + if (unlikely(entry->addr_code & (TLB_RECHECK | TLB_MMIO))) { /* * Return -1 if we can't translate and execute from an entire * page of RAM here, which will cause us to execute by loading @@ -968,7 +962,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr) return -1; } - p = (void *)((uintptr_t)addr + env->tlb_table[mmu_idx][index].addend); + p = (void *)((uintptr_t)addr + entry->addend); return qemu_ram_addr_from_host_nofail(p); } @@ -981,11 +975,10 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr) void probe_write(CPUArchState *env, target_ulong addr, int size, int mmu_idx, uintptr_t retaddr) { - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - target_ulong tlb_addr = - atomic_read(&env->tlb_table[mmu_idx][index].addr_write); + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); - if (!tlb_hit(tlb_addr, addr)) { + if (!tlb_hit(atomic_read(&entry->addr_write), addr)) { /* TLB entry is for a different page */ if (!VICTIM_TLB_HIT(addr_write, addr)) { tlb_fill(ENV_GET_CPU(env), addr, size, MMU_DATA_STORE, @@ -1001,8 +994,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr, NotDirtyInfo *ndi) { size_t mmu_idx = get_mmuidx(oi); - size_t index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); - CPUTLBEntry *tlbe = &env->tlb_table[mmu_idx][index]; + uintptr_t index = tlb_index(env, mmu_idx, addr); + CPUTLBEntry *tlbe = tlb_entry(env, mmu_idx, addr); target_ulong tlb_addr = atomic_read(&tlbe->addr_write); TCGMemOp mop = get_memop(oi); int a_bits = get_alignment_bits(mop); -- 2.17.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers Emilio G. Cota @ 2018-10-09 14:43 ` Alex Bennée 0 siblings, 0 replies; 18+ messages in thread From: Alex Bennée @ 2018-10-09 14:43 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > From: Richard Henderson <richard.henderson@linaro.org> > > Isolate the computation of an index from an address into a > helper before we change that function. > > Signed-off-by: Richard Henderson <richard.henderson@linaro.org> > [ cota: convert tlb_vaddr_to_host; use atomic_read on addr_write ] > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > accel/tcg/softmmu_template.h | 72 ++++++++++++++++---------------- > include/exec/cpu_ldst.h | 19 +++++++-- > include/exec/cpu_ldst_template.h | 25 +++++------ > accel/tcg/cputlb.c | 61 ++++++++++++--------------- > 4 files changed, 92 insertions(+), 85 deletions(-) > > diff --git a/accel/tcg/softmmu_template.h b/accel/tcg/softmmu_template.h > index 1e50263871..aa34c692ee 100644 > --- a/accel/tcg/softmmu_template.h > +++ b/accel/tcg/softmmu_template.h > @@ -111,9 +111,10 @@ static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env, > WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, > TCGMemOpIdx oi, uintptr_t retaddr) > { > - unsigned mmu_idx = get_mmuidx(oi); > - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; > + uintptr_t mmu_idx = get_mmuidx(oi); > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > + target_ulong tlb_addr = entry->ADDR_READ; > unsigned a_bits = get_alignment_bits(get_memop(oi)); > uintptr_t haddr; > DATA_TYPE res; > @@ -129,7 +130,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, > tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, READ_ACCESS_TYPE, > mmu_idx, retaddr); > } > - tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; > + tlb_addr = entry->ADDR_READ; > } Hmm I can see this clashing with the de-macro clean-up > > /* Handle an IO access. */ > @@ -166,7 +167,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, > return res; > } > > - haddr = addr + env->tlb_table[mmu_idx][index].addend; > + haddr = addr + entry->addend; > #if DATA_SIZE == 1 > res = glue(glue(ld, LSUFFIX), _p)((uint8_t *)haddr); > #else > @@ -179,9 +180,10 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, > WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, > TCGMemOpIdx oi, uintptr_t retaddr) > { > - unsigned mmu_idx = get_mmuidx(oi); > - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; > + uintptr_t mmu_idx = get_mmuidx(oi); > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > + target_ulong tlb_addr = entry->ADDR_READ; > unsigned a_bits = get_alignment_bits(get_memop(oi)); > uintptr_t haddr; > DATA_TYPE res; > @@ -197,7 +199,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, > tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, READ_ACCESS_TYPE, > mmu_idx, retaddr); > } > - tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ; > + tlb_addr = entry->ADDR_READ; > } > > /* Handle an IO access. */ > @@ -234,7 +236,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, > return res; > } > > - haddr = addr + env->tlb_table[mmu_idx][index].addend; > + haddr = addr + entry->addend; > res = glue(glue(ld, LSUFFIX), _be_p)((uint8_t *)haddr); > return res; > } > @@ -275,10 +277,10 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env, > void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > TCGMemOpIdx oi, uintptr_t retaddr) > { > - unsigned mmu_idx = get_mmuidx(oi); > - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - target_ulong tlb_addr = > - atomic_read(&env->tlb_table[mmu_idx][index].addr_write); > + uintptr_t mmu_idx = get_mmuidx(oi); > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > + target_ulong tlb_addr = atomic_read(&entry->addr_write); > unsigned a_bits = get_alignment_bits(get_memop(oi)); > uintptr_t haddr; > > @@ -293,8 +295,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, MMU_DATA_STORE, > mmu_idx, retaddr); > } > - tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write) & > - ~TLB_INVALID_MASK; > + tlb_addr = atomic_read(&entry->addr_write) & ~TLB_INVALID_MASK; > } > > /* Handle an IO access. */ > @@ -315,16 +316,16 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > if (DATA_SIZE > 1 > && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1 > >= TARGET_PAGE_SIZE)) { > - int i, index2; > - target_ulong page2, tlb_addr2; > + int i; > + target_ulong page2; > + CPUTLBEntry *entry2; > do_unaligned_access: > /* Ensure the second page is in the TLB. Note that the first page > is already guaranteed to be filled, and that the second page > cannot evict the first. */ > page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK; > - index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write); > - if (!tlb_hit_page(tlb_addr2, page2) > + entry2 = tlb_entry(env, mmu_idx, page2); > + if (!tlb_hit_page(atomic_read(&entry2->addr_write), page2) > && !VICTIM_TLB_HIT(addr_write, page2)) { > tlb_fill(ENV_GET_CPU(env), page2, DATA_SIZE, MMU_DATA_STORE, > mmu_idx, retaddr); > @@ -342,7 +343,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > return; > } > > - haddr = addr + env->tlb_table[mmu_idx][index].addend; > + haddr = addr + entry->addend; > #if DATA_SIZE == 1 > glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val); > #else > @@ -354,10 +355,10 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > TCGMemOpIdx oi, uintptr_t retaddr) > { > - unsigned mmu_idx = get_mmuidx(oi); > - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - target_ulong tlb_addr = > - atomic_read(&env->tlb_table[mmu_idx][index].addr_write); > + uintptr_t mmu_idx = get_mmuidx(oi); > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > + target_ulong tlb_addr = atomic_read(&entry->addr_write); > unsigned a_bits = get_alignment_bits(get_memop(oi)); > uintptr_t haddr; > > @@ -372,8 +373,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > tlb_fill(ENV_GET_CPU(env), addr, DATA_SIZE, MMU_DATA_STORE, > mmu_idx, retaddr); > } > - tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write) & > - ~TLB_INVALID_MASK; > + tlb_addr = atomic_read(&entry->addr_write) & ~TLB_INVALID_MASK; > } > > /* Handle an IO access. */ > @@ -385,8 +385,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > /* ??? Note that the io helpers always read data in the target > byte ordering. We should push the LE/BE request down into io. */ > val = TGT_BE(val); > - glue(io_write, SUFFIX)(env, mmu_idx, index, val, addr, retaddr, > - tlb_addr & TLB_RECHECK); > + glue(io_write, SUFFIX)(env, mmu_idx, index, val, addr, > + retaddr, tlb_addr & TLB_RECHECK); This is a nop formatting change. > return; > } > > @@ -394,16 +394,16 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > if (DATA_SIZE > 1 > && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1 > >= TARGET_PAGE_SIZE)) { > - int i, index2; > - target_ulong page2, tlb_addr2; > + int i; > + target_ulong page2; > + CPUTLBEntry *entry2; > do_unaligned_access: > /* Ensure the second page is in the TLB. Note that the first page > is already guaranteed to be filled, and that the second page > cannot evict the first. */ > page2 = (addr + DATA_SIZE) & TARGET_PAGE_MASK; > - index2 = (page2 >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - tlb_addr2 = atomic_read(&env->tlb_table[mmu_idx][index2].addr_write); > - if (!tlb_hit_page(tlb_addr2, page2) > + entry2 = tlb_entry(env, mmu_idx, page2); > + if (!tlb_hit_page(atomic_read(&entry2->addr_write), page2) > && !VICTIM_TLB_HIT(addr_write, page2)) { > tlb_fill(ENV_GET_CPU(env), page2, DATA_SIZE, MMU_DATA_STORE, > mmu_idx, retaddr); > @@ -421,7 +421,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val, > return; > } > > - haddr = addr + env->tlb_table[mmu_idx][index].addend; > + haddr = addr + entry->addend; > glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val); > } > #endif /* DATA_SIZE > 1 */ > diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h > index 9581587ce1..e3d8d738aa 100644 > --- a/include/exec/cpu_ldst.h > +++ b/include/exec/cpu_ldst.h > @@ -126,6 +126,20 @@ extern __thread uintptr_t helper_retaddr; > /* The memory helpers for tcg-generated code need tcg_target_long etc. */ > #include "tcg.h" > > +/* Find the TLB index corresponding to the mmu_idx + address pair. */ > +static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx, > + target_ulong addr) > +{ > + return (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > +} > + > +/* Find the TLB entry corresponding to the mmu_idx + address pair. */ > +static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx, > + target_ulong addr) > +{ > + return &env->tlb_table[mmu_idx][tlb_index(env, mmu_idx, addr)]; > +} > + > #ifdef MMU_MODE0_SUFFIX > #define CPU_MMU_INDEX 0 > #define MEMSUFFIX MMU_MODE0_SUFFIX > @@ -416,8 +430,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr, > #if defined(CONFIG_USER_ONLY) > return g2h(addr); > #else > - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - CPUTLBEntry *tlbentry = &env->tlb_table[mmu_idx][index]; > + CPUTLBEntry *tlbentry = tlb_entry(env, mmu_idx, addr); > abi_ptr tlb_addr; > uintptr_t haddr; > > @@ -445,7 +458,7 @@ static inline void *tlb_vaddr_to_host(CPUArchState *env, abi_ptr addr, > return NULL; > } > > - haddr = addr + env->tlb_table[mmu_idx][index].addend; > + haddr = addr + tlbentry->addend; > return (void *)haddr; > #endif /* defined(CONFIG_USER_ONLY) */ > } > diff --git a/include/exec/cpu_ldst_template.h b/include/exec/cpu_ldst_template.h > index ba7a11123c..924713eeed 100644 > --- a/include/exec/cpu_ldst_template.h > +++ b/include/exec/cpu_ldst_template.h > @@ -81,7 +81,7 @@ glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, > target_ulong ptr, > uintptr_t retaddr) > { > - int page_index; > + CPUTLBEntry *entry; > RES_TYPE res; > target_ulong addr; > int mmu_idx; > @@ -94,15 +94,15 @@ glue(glue(glue(cpu_ld, USUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, > #endif > > addr = ptr; > - page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > mmu_idx = CPU_MMU_INDEX; > - if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ != > + entry = tlb_entry(env, mmu_idx, addr); > + if (unlikely(entry->ADDR_READ != > (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) { > oi = make_memop_idx(SHIFT, mmu_idx); > res = glue(glue(helper_ret_ld, URETSUFFIX), MMUSUFFIX)(env, addr, > oi, retaddr); > } else { > - uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend; > + uintptr_t hostaddr = addr + entry->addend; > res = glue(glue(ld, USUFFIX), _p)((uint8_t *)hostaddr); > } > return res; > @@ -120,7 +120,8 @@ glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, > target_ulong ptr, > uintptr_t retaddr) > { > - int res, page_index; > + CPUTLBEntry *entry; > + int res; > target_ulong addr; > int mmu_idx; > TCGMemOpIdx oi; > @@ -132,15 +133,15 @@ glue(glue(glue(cpu_lds, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, > #endif > > addr = ptr; > - page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > mmu_idx = CPU_MMU_INDEX; > - if (unlikely(env->tlb_table[mmu_idx][page_index].ADDR_READ != > + entry = tlb_entry(env, mmu_idx, addr); > + if (unlikely(entry->ADDR_READ != > (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) { > oi = make_memop_idx(SHIFT, mmu_idx); > res = (DATA_STYPE)glue(glue(helper_ret_ld, SRETSUFFIX), > MMUSUFFIX)(env, addr, oi, retaddr); > } else { > - uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend; > + uintptr_t hostaddr = addr + entry->addend; > res = glue(glue(lds, SUFFIX), _p)((uint8_t *)hostaddr); > } > return res; > @@ -162,7 +163,7 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, > target_ulong ptr, > RES_TYPE v, uintptr_t retaddr) > { > - int page_index; > + CPUTLBEntry *entry; > target_ulong addr; > int mmu_idx; > TCGMemOpIdx oi; > @@ -174,15 +175,15 @@ glue(glue(glue(cpu_st, SUFFIX), MEMSUFFIX), _ra)(CPUArchState *env, > #endif > > addr = ptr; > - page_index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > mmu_idx = CPU_MMU_INDEX; > - if (unlikely(atomic_read(&env->tlb_table[mmu_idx][page_index].addr_write) != > + entry = tlb_entry(env, mmu_idx, addr); > + if (unlikely(atomic_read(&entry->addr_write) != > (addr & (TARGET_PAGE_MASK | (DATA_SIZE - 1))))) { > oi = make_memop_idx(SHIFT, mmu_idx); > glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, v, oi, > retaddr); > } else { > - uintptr_t hostaddr = addr + env->tlb_table[mmu_idx][page_index].addend; > + uintptr_t hostaddr = addr + entry->addend; > glue(glue(st, SUFFIX), _p)((uint8_t *)hostaddr, v); > } > } > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c > index 200444142a..a5972773de 100644 > --- a/accel/tcg/cputlb.c > +++ b/accel/tcg/cputlb.c > @@ -286,7 +286,6 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data) > { > CPUArchState *env = cpu->env_ptr; > target_ulong addr = (target_ulong) data.target_ptr; > - int i; > int mmu_idx; > > assert_cpu_is_self(cpu); > @@ -304,10 +303,9 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data) > } > > addr &= TARGET_PAGE_MASK; > - i = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > qemu_spin_lock(&env->tlb_lock); > for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { > - tlb_flush_entry_locked(&env->tlb_table[mmu_idx][i], addr); > + tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); > tlb_flush_vtlb_page_locked(env, mmu_idx, addr); > } > qemu_spin_unlock(&env->tlb_lock); > @@ -339,18 +337,17 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu, > target_ulong addr_and_mmuidx = (target_ulong) data.target_ptr; > target_ulong addr = addr_and_mmuidx & TARGET_PAGE_MASK; > unsigned long mmu_idx_bitmap = addr_and_mmuidx & ALL_MMUIDX_BITS; > - int page = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > int mmu_idx; > > assert_cpu_is_self(cpu); > > - tlb_debug("page:%d addr:"TARGET_FMT_lx" mmu_idx:0x%lx\n", > - page, addr, mmu_idx_bitmap); > + tlb_debug("flush page addr:"TARGET_FMT_lx" mmu_idx:0x%lx\n", > + addr, mmu_idx_bitmap); > > qemu_spin_lock(&env->tlb_lock); > for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { > if (test_bit(mmu_idx, &mmu_idx_bitmap)) { > - tlb_flush_entry_locked(&env->tlb_table[mmu_idx][page], addr); > + tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); > tlb_flush_vtlb_page_locked(env, mmu_idx, addr); > } > } > @@ -554,16 +551,14 @@ static inline void tlb_set_dirty1_locked(CPUTLBEntry *tlb_entry, > void tlb_set_dirty(CPUState *cpu, target_ulong vaddr) > { > CPUArchState *env = cpu->env_ptr; > - int i; > int mmu_idx; > > assert_cpu_is_self(cpu); > > vaddr &= TARGET_PAGE_MASK; > - i = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > qemu_spin_lock(&env->tlb_lock); > for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { > - tlb_set_dirty1_locked(&env->tlb_table[mmu_idx][i], vaddr); > + tlb_set_dirty1_locked(tlb_entry(env, mmu_idx, vaddr), vaddr); > } > > for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { > @@ -663,8 +658,8 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, > iotlb = memory_region_section_get_iotlb(cpu, section, vaddr_page, > paddr_page, xlat, prot, &address); > > - index = (vaddr_page >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - te = &env->tlb_table[mmu_idx][index]; > + index = tlb_index(env, mmu_idx, vaddr_page); > + te = tlb_entry(env, mmu_idx, vaddr_page); > > /* > * Hold the TLB lock for the rest of the function. We could acquire/release > @@ -786,16 +781,16 @@ static uint64_t io_readx(CPUArchState *env, CPUIOTLBEntry *iotlbentry, > * repeat the MMU check here. This tlb_fill() call might > * longjump out if this access should cause a guest exception. > */ > - int index; > + CPUTLBEntry *entry; > target_ulong tlb_addr; > > tlb_fill(cpu, addr, size, MMU_DATA_LOAD, mmu_idx, retaddr); > > - index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - tlb_addr = env->tlb_table[mmu_idx][index].addr_read; > + entry = tlb_entry(env, mmu_idx, addr); > + tlb_addr = entry->addr_read; > if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) { > /* RAM access */ > - uintptr_t haddr = addr + env->tlb_table[mmu_idx][index].addend; > + uintptr_t haddr = addr + entry->addend; > > return ldn_p((void *)haddr, size); > } > @@ -853,16 +848,16 @@ static void io_writex(CPUArchState *env, CPUIOTLBEntry *iotlbentry, > * repeat the MMU check here. This tlb_fill() call might > * longjump out if this access should cause a guest exception. > */ > - int index; > + CPUTLBEntry *entry; > target_ulong tlb_addr; > > tlb_fill(cpu, addr, size, MMU_DATA_STORE, mmu_idx, retaddr); > > - index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - tlb_addr = atomic_read(&env->tlb_table[mmu_idx][index].addr_write); > + entry = tlb_entry(env, mmu_idx, addr); > + tlb_addr = atomic_read(&entry->addr_write); > if (!(tlb_addr & ~(TARGET_PAGE_MASK | TLB_RECHECK))) { > /* RAM access */ > - uintptr_t haddr = addr + env->tlb_table[mmu_idx][index].addend; > + uintptr_t haddr = addr + entry->addend; > > stn_p((void *)haddr, size, val); > return; > @@ -943,20 +938,19 @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index, > */ > tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr) > { > - int mmu_idx, index; > + uintptr_t mmu_idx = cpu_mmu_index(env, true); > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > void *p; > > - index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - mmu_idx = cpu_mmu_index(env, true); > - if (unlikely(!tlb_hit(env->tlb_table[mmu_idx][index].addr_code, addr))) { > + if (unlikely(!tlb_hit(entry->addr_code, addr))) { > if (!VICTIM_TLB_HIT(addr_code, addr)) { > tlb_fill(ENV_GET_CPU(env), addr, 0, MMU_INST_FETCH, mmu_idx, 0); > } > - assert(tlb_hit(env->tlb_table[mmu_idx][index].addr_code, addr)); > + assert(tlb_hit(entry->addr_code, addr)); > } > > - if (unlikely(env->tlb_table[mmu_idx][index].addr_code & > - (TLB_RECHECK | TLB_MMIO))) { > + if (unlikely(entry->addr_code & (TLB_RECHECK | TLB_MMIO))) { > /* > * Return -1 if we can't translate and execute from an entire > * page of RAM here, which will cause us to execute by loading > @@ -968,7 +962,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr) > return -1; > } > > - p = (void *)((uintptr_t)addr + env->tlb_table[mmu_idx][index].addend); > + p = (void *)((uintptr_t)addr + entry->addend); > return qemu_ram_addr_from_host_nofail(p); > } > > @@ -981,11 +975,10 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env, target_ulong addr) > void probe_write(CPUArchState *env, target_ulong addr, int size, int mmu_idx, > uintptr_t retaddr) > { > - int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - target_ulong tlb_addr = > - atomic_read(&env->tlb_table[mmu_idx][index].addr_write); > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > > - if (!tlb_hit(tlb_addr, addr)) { > + if (!tlb_hit(atomic_read(&entry->addr_write), addr)) { > /* TLB entry is for a different page */ > if (!VICTIM_TLB_HIT(addr_write, addr)) { > tlb_fill(ENV_GET_CPU(env), addr, size, MMU_DATA_STORE, > @@ -1001,8 +994,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr, > NotDirtyInfo *ndi) > { > size_t mmu_idx = get_mmuidx(oi); > - size_t index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); > - CPUTLBEntry *tlbe = &env->tlb_table[mmu_idx][index]; > + uintptr_t index = tlb_index(env, mmu_idx, addr); > + CPUTLBEntry *tlbe = tlb_entry(env, mmu_idx, addr); > target_ulong tlb_addr = atomic_read(&tlbe->addr_write); > TCGMemOp mop = get_memop(oi); > int a_bits = get_alignment_bits(mop); Reviewed-by: Alex Bennée <alex.bennee@linaro.org> -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [RFC v2 2/5] (XXX) cputlb: introduce indirection for TLB size 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers Emilio G. Cota @ 2018-10-08 23:27 ` Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb Emilio G. Cota ` (3 subsequent siblings) 5 siblings, 0 replies; 18+ messages in thread From: Emilio G. Cota @ 2018-10-08 23:27 UTC (permalink / raw) To: qemu-devel; +Cc: Richard Henderson, Alex Bennée This paves the way for implementing dynamic TLB resizing. XXX: convert other TCG backends Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/exec/cpu-defs.h | 10 ++++++---- include/exec/cpu_ldst.h | 14 +++++++++++++- accel/tcg/cputlb.c | 18 +++++++++++++++--- tcg/i386/tcg-target.inc.c | 28 ++++++++++++++-------------- 4 files changed, 48 insertions(+), 22 deletions(-) diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h index 4ff62f32bf..87cd015f60 100644 --- a/include/exec/cpu-defs.h +++ b/include/exec/cpu-defs.h @@ -141,13 +141,15 @@ typedef struct CPUIOTLBEntry { MemTxAttrs attrs; } CPUIOTLBEntry; -#define CPU_COMMON_TLB \ +#define CPU_COMMON_TLB \ /* The meaning of the MMU modes is defined in the target code. */ \ - /* tlb_lock serializes updates to tlb_table and tlb_v_table */ \ + /* tlb_lock serializes updates to tlb_mask, tlb_table and tlb_v_table */ \ QemuSpin tlb_lock; \ - CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE]; \ + /* tlb_mask[i] contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */ \ + uintptr_t tlb_mask[NB_MMU_MODES]; \ + CPUTLBEntry *tlb_table[NB_MMU_MODES]; \ CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE]; \ - CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE]; \ + CPUIOTLBEntry *iotlb[NB_MMU_MODES]; \ CPUIOTLBEntry iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE]; \ size_t tlb_flush_count; \ target_ulong tlb_flush_addr; \ diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h index e3d8d738aa..3ded1df9b7 100644 --- a/include/exec/cpu_ldst.h +++ b/include/exec/cpu_ldst.h @@ -130,7 +130,9 @@ extern __thread uintptr_t helper_retaddr; static inline uintptr_t tlb_index(CPUArchState *env, uintptr_t mmu_idx, target_ulong addr) { - return (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1); + uintptr_t size_mask = env->tlb_mask[mmu_idx] >> CPU_TLB_ENTRY_BITS; + + return (addr >> TARGET_PAGE_BITS) & size_mask; } /* Find the TLB entry corresponding to the mmu_idx + address pair. */ @@ -140,6 +142,16 @@ static inline CPUTLBEntry *tlb_entry(CPUArchState *env, uintptr_t mmu_idx, return &env->tlb_table[mmu_idx][tlb_index(env, mmu_idx, addr)]; } +static inline size_t sizeof_tlb(CPUArchState *env, uintptr_t mmu_idx) +{ + return env->tlb_mask[mmu_idx] + (1 << CPU_TLB_ENTRY_BITS); +} + +static inline size_t tlb_n_entries(CPUArchState *env, uintptr_t mmu_idx) +{ + return (env->tlb_mask[mmu_idx] >> CPU_TLB_ENTRY_BITS) + 1; +} + #ifdef MMU_MODE0_SUFFIX #define CPU_MMU_INDEX 0 #define MEMSUFFIX MMU_MODE0_SUFFIX diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index a5972773de..80406f1033 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -76,8 +76,16 @@ QEMU_BUILD_BUG_ON(NB_MMU_MODES > 16); void tlb_init(CPUState *cpu) { CPUArchState *env = cpu->env_ptr; + int i; qemu_spin_init(&env->tlb_lock); + for (i = 0; i < NB_MMU_MODES; i++) { + size_t n_entries = CPU_TLB_SIZE; + + env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS; + env->tlb_table[i] = g_new(CPUTLBEntry, n_entries); + env->iotlb[i] = g_new0(CPUIOTLBEntry, n_entries); + } } /* flush_all_helper: run fn across all cpus @@ -120,6 +128,7 @@ size_t tlb_flush_count(void) static void tlb_flush_nocheck(CPUState *cpu) { CPUArchState *env = cpu->env_ptr; + int i; /* The QOM tests will trigger tlb_flushes without setting up TCG * so we bug out here in that case. @@ -139,7 +148,9 @@ static void tlb_flush_nocheck(CPUState *cpu) * that do not hold the lock are performed by the same owner thread. */ qemu_spin_lock(&env->tlb_lock); - memset(env->tlb_table, -1, sizeof(env->tlb_table)); + for (i = 0; i < NB_MMU_MODES; i++) { + memset(env->tlb_table[i], -1, sizeof_tlb(env, i)); + } memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table)); qemu_spin_unlock(&env->tlb_lock); @@ -200,7 +211,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data) if (test_bit(mmu_idx, &mmu_idx_bitmask)) { tlb_debug("%d\n", mmu_idx); - memset(env->tlb_table[mmu_idx], -1, sizeof(env->tlb_table[0])); + memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx)); memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); } } @@ -523,8 +534,9 @@ void tlb_reset_dirty(CPUState *cpu, ram_addr_t start1, ram_addr_t length) qemu_spin_lock(&env->tlb_lock); for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { unsigned int i; + unsigned int n = tlb_n_entries(env, mmu_idx); - for (i = 0; i < CPU_TLB_SIZE; i++) { + for (i = 0; i < n; i++) { tlb_reset_dirty_range_locked(&env->tlb_table[mmu_idx][i], start1, length); } diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c index 436195894b..33de43305a 100644 --- a/tcg/i386/tcg-target.inc.c +++ b/tcg/i386/tcg-target.inc.c @@ -330,6 +330,7 @@ static inline int tcg_target_const_match(tcg_target_long val, TCGType type, #define OPC_ARITH_GvEv (0x03) /* ... plus (ARITH_FOO << 3) */ #define OPC_ANDN (0xf2 | P_EXT38) #define OPC_ADD_GvEv (OPC_ARITH_GvEv | (ARITH_ADD << 3)) +#define OPC_AND_GvEv (OPC_ARITH_GvEv | (ARITH_AND << 3)) #define OPC_BLENDPS (0x0c | P_EXT3A | P_DATA16) #define OPC_BSF (0xbc | P_EXT) #define OPC_BSR (0xbd | P_EXT) @@ -1625,7 +1626,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, } if (TCG_TYPE_PTR == TCG_TYPE_I64) { hrexw = P_REXW; - if (TARGET_PAGE_BITS + CPU_TLB_BITS > 32) { + if (TARGET_PAGE_BITS > 32) { tlbtype = TCG_TYPE_I64; tlbrexw = P_REXW; } @@ -1633,6 +1634,15 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, } tcg_out_mov(s, tlbtype, r0, addrlo); + tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0, + TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS); + + tcg_out_modrm_offset(s, OPC_AND_GvEv + hrexw, r0, TCG_AREG0, + offsetof(CPUArchState, tlb_mask[mem_index])); + + tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r0, TCG_AREG0, + offsetof(CPUArchState, tlb_table[mem_index])); + /* If the required alignment is at least as large as the access, simply copy the address and mask. For lesser alignments, check that we don't cross pages for the complete access. */ @@ -1642,20 +1652,10 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, tcg_out_modrm_offset(s, OPC_LEA + trexw, r1, addrlo, s_mask - a_mask); } tlb_mask = (target_ulong)TARGET_PAGE_MASK | a_mask; - - tcg_out_shifti(s, SHIFT_SHR + tlbrexw, r0, - TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS); - tgen_arithi(s, ARITH_AND + trexw, r1, tlb_mask, 0); - tgen_arithi(s, ARITH_AND + tlbrexw, r0, - (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS, 0); - - tcg_out_modrm_sib_offset(s, OPC_LEA + hrexw, r0, TCG_AREG0, r0, 0, - offsetof(CPUArchState, tlb_table[mem_index][0]) - + which); /* cmp 0(r0), r1 */ - tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, 0); + tcg_out_modrm_offset(s, OPC_CMP_GvEv + trexw, r1, r0, which); /* Prepare for both the fast path add of the tlb addend, and the slow path function argument setup. There are two cases worth note: @@ -1672,7 +1672,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) { /* cmp 4(r0), addrhi */ - tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, 4); + tcg_out_modrm_offset(s, OPC_CMP_GvEv, addrhi, r0, which + 4); /* jne slow_path */ tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0); @@ -1684,7 +1684,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi, /* add addend(r0), r1 */ tcg_out_modrm_offset(s, OPC_ADD_GvEv + hrexw, r1, r0, - offsetof(CPUTLBEntry, addend) - which); + offsetof(CPUTLBEntry, addend)); } /* -- 2.17.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 2/5] (XXX) cputlb: introduce indirection for TLB size Emilio G. Cota @ 2018-10-08 23:27 ` Emilio G. Cota 2018-10-09 14:45 ` Alex Bennée 2018-10-09 14:49 ` Richard Henderson 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 4/5] cputlb: track TLB use rate Emilio G. Cota ` (2 subsequent siblings) 5 siblings, 2 replies; 18+ messages in thread From: Emilio G. Cota @ 2018-10-08 23:27 UTC (permalink / raw) To: qemu-devel; +Cc: Richard Henderson, Alex Bennée Currently we evict an entry to the victim TLB when it doesn't match the current address. But it could be that there's no match because the current entry is empty (i.e. all -1's, for instance via tlb_flush). Do not evict the entry to the vtlb in that case. This change will help us keep track of the TLB's use rate. Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/exec/cpu-all.h | 9 +++++++++ accel/tcg/cputlb.c | 2 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h index 117d2fbbca..e21140049b 100644 --- a/include/exec/cpu-all.h +++ b/include/exec/cpu-all.h @@ -362,6 +362,15 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr) return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK); } +/** + * tlb_entry_is_empty - return true if the entry is not in use + * @te: pointer to CPUTLBEntry + */ +static inline bool tlb_entry_is_empty(const CPUTLBEntry *te) +{ + return te->addr_read == -1 && te->addr_write == -1 && te->addr_code == -1; +} + void dump_exec_info(FILE *f, fprintf_function cpu_fprintf); void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf); #endif /* !CONFIG_USER_ONLY */ diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index 80406f1033..4dc47e603c 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -689,7 +689,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, * Only evict the old entry to the victim tlb if it's for a * different page; otherwise just overwrite the stale data. */ - if (!tlb_hit_page_anyprot(te, vaddr_page)) { + if (!tlb_hit_page_anyprot(te, vaddr_page) && !tlb_entry_is_empty(te)) { unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE; CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx]; -- 2.17.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb Emilio G. Cota @ 2018-10-09 14:45 ` Alex Bennée 2018-10-09 14:49 ` Richard Henderson 1 sibling, 0 replies; 18+ messages in thread From: Alex Bennée @ 2018-10-09 14:45 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > Currently we evict an entry to the victim TLB when it doesn't match > the current address. But it could be that there's no match because > the current entry is empty (i.e. all -1's, for instance via tlb_flush). > Do not evict the entry to the vtlb in that case. > > This change will help us keep track of the TLB's use rate. > > Signed-off-by: Emilio G. Cota <cota@braap.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> > --- > include/exec/cpu-all.h | 9 +++++++++ > accel/tcg/cputlb.c | 2 +- > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h > index 117d2fbbca..e21140049b 100644 > --- a/include/exec/cpu-all.h > +++ b/include/exec/cpu-all.h > @@ -362,6 +362,15 @@ static inline bool tlb_hit(target_ulong tlb_addr, target_ulong addr) > return tlb_hit_page(tlb_addr, addr & TARGET_PAGE_MASK); > } > > +/** > + * tlb_entry_is_empty - return true if the entry is not in use > + * @te: pointer to CPUTLBEntry > + */ > +static inline bool tlb_entry_is_empty(const CPUTLBEntry *te) > +{ > + return te->addr_read == -1 && te->addr_write == -1 && te->addr_code == -1; > +} > + > void dump_exec_info(FILE *f, fprintf_function cpu_fprintf); > void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf); > #endif /* !CONFIG_USER_ONLY */ > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c > index 80406f1033..4dc47e603c 100644 > --- a/accel/tcg/cputlb.c > +++ b/accel/tcg/cputlb.c > @@ -689,7 +689,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, > * Only evict the old entry to the victim tlb if it's for a > * different page; otherwise just overwrite the stale data. > */ > - if (!tlb_hit_page_anyprot(te, vaddr_page)) { > + if (!tlb_hit_page_anyprot(te, vaddr_page) && !tlb_entry_is_empty(te)) { > unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE; > CPUTLBEntry *tv = &env->tlb_v_table[mmu_idx][vidx]; -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb Emilio G. Cota 2018-10-09 14:45 ` Alex Bennée @ 2018-10-09 14:49 ` Richard Henderson 1 sibling, 0 replies; 18+ messages in thread From: Richard Henderson @ 2018-10-09 14:49 UTC (permalink / raw) To: Emilio G. Cota, qemu-devel; +Cc: Alex Bennée On 10/8/18 4:27 PM, Emilio G. Cota wrote: > Currently we evict an entry to the victim TLB when it doesn't match > the current address. But it could be that there's no match because > the current entry is empty (i.e. all -1's, for instance via tlb_flush). > Do not evict the entry to the vtlb in that case. > > This change will help us keep track of the TLB's use rate. > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > include/exec/cpu-all.h | 9 +++++++++ > accel/tcg/cputlb.c | 2 +- > 2 files changed, 10 insertions(+), 1 deletion(-) Reviewed-by: Richard Henderson <richard.henderson@linaro.org> r~ ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [RFC v2 4/5] cputlb: track TLB use rate 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota ` (2 preceding siblings ...) 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb Emilio G. Cota @ 2018-10-08 23:27 ` Emilio G. Cota 2018-10-09 14:47 ` Alex Bennée 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on " Emilio G. Cota 2018-10-09 12:34 ` [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Alex Bennée 5 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2018-10-08 23:27 UTC (permalink / raw) To: qemu-devel; +Cc: Richard Henderson, Alex Bennée This paves the way for implementing a dynamically-sized softmmu. Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/exec/cpu-defs.h | 5 +++++ accel/tcg/cputlb.c | 17 ++++++++++++++--- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h index 87cd015f60..56f1887c7f 100644 --- a/include/exec/cpu-defs.h +++ b/include/exec/cpu-defs.h @@ -141,10 +141,15 @@ typedef struct CPUIOTLBEntry { MemTxAttrs attrs; } CPUIOTLBEntry; +typedef struct CPUTLBDesc { + size_t n_used_entries; +} CPUTLBDesc; + #define CPU_COMMON_TLB \ /* The meaning of the MMU modes is defined in the target code. */ \ /* tlb_lock serializes updates to tlb_mask, tlb_table and tlb_v_table */ \ QemuSpin tlb_lock; \ + CPUTLBDesc tlb_desc[NB_MMU_MODES]; \ /* tlb_mask[i] contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */ \ uintptr_t tlb_mask[NB_MMU_MODES]; \ CPUTLBEntry *tlb_table[NB_MMU_MODES]; \ diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index 4dc47e603c..11d6060eb0 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -82,6 +82,7 @@ void tlb_init(CPUState *cpu) for (i = 0; i < NB_MMU_MODES; i++) { size_t n_entries = CPU_TLB_SIZE; + env->tlb_desc[i].n_used_entries = 0; env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS; env->tlb_table[i] = g_new(CPUTLBEntry, n_entries); env->iotlb[i] = g_new0(CPUIOTLBEntry, n_entries); @@ -150,6 +151,7 @@ static void tlb_flush_nocheck(CPUState *cpu) qemu_spin_lock(&env->tlb_lock); for (i = 0; i < NB_MMU_MODES; i++) { memset(env->tlb_table[i], -1, sizeof_tlb(env, i)); + env->tlb_desc[i].n_used_entries = 0; } memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table)); qemu_spin_unlock(&env->tlb_lock); @@ -213,6 +215,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data) memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx)); memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); + env->tlb_desc[mmu_idx].n_used_entries = 0; } } qemu_spin_unlock(&env->tlb_lock); @@ -273,12 +276,14 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry *tlb_entry, } /* Called with tlb_lock held */ -static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry, +static inline bool tlb_flush_entry_locked(CPUTLBEntry *tlb_entry, target_ulong page) { if (tlb_hit_page_anyprot(tlb_entry, page)) { memset(tlb_entry, -1, sizeof(*tlb_entry)); + return true; } + return false; } /* Called with tlb_lock held */ @@ -316,7 +321,9 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data) addr &= TARGET_PAGE_MASK; qemu_spin_lock(&env->tlb_lock); for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { - tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); + if (tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr)) { + env->tlb_desc[mmu_idx].n_used_entries--; + } tlb_flush_vtlb_page_locked(env, mmu_idx, addr); } qemu_spin_unlock(&env->tlb_lock); @@ -358,7 +365,9 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu, qemu_spin_lock(&env->tlb_lock); for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { if (test_bit(mmu_idx, &mmu_idx_bitmap)) { - tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); + if (tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr)) { + env->tlb_desc[mmu_idx].n_used_entries--; + } tlb_flush_vtlb_page_locked(env, mmu_idx, addr); } } @@ -696,6 +705,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, /* Evict the old entry into the victim tlb. */ copy_tlb_helper_locked(tv, te); env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index]; + env->tlb_desc[mmu_idx].n_used_entries--; } /* refill the tlb */ @@ -747,6 +757,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, } copy_tlb_helper_locked(te, &tn); + env->tlb_desc[mmu_idx].n_used_entries++; qemu_spin_unlock(&env->tlb_lock); } -- 2.17.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 4/5] cputlb: track TLB use rate 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 4/5] cputlb: track TLB use rate Emilio G. Cota @ 2018-10-09 14:47 ` Alex Bennée 0 siblings, 0 replies; 18+ messages in thread From: Alex Bennée @ 2018-10-09 14:47 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > This paves the way for implementing a dynamically-sized softmmu. > > Signed-off-by: Emilio G. Cota <cota@braap.org> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> > --- > include/exec/cpu-defs.h | 5 +++++ > accel/tcg/cputlb.c | 17 ++++++++++++++--- > 2 files changed, 19 insertions(+), 3 deletions(-) > > diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h > index 87cd015f60..56f1887c7f 100644 > --- a/include/exec/cpu-defs.h > +++ b/include/exec/cpu-defs.h > @@ -141,10 +141,15 @@ typedef struct CPUIOTLBEntry { > MemTxAttrs attrs; > } CPUIOTLBEntry; > > +typedef struct CPUTLBDesc { > + size_t n_used_entries; > +} CPUTLBDesc; > + > #define CPU_COMMON_TLB \ > /* The meaning of the MMU modes is defined in the target code. */ \ > /* tlb_lock serializes updates to tlb_mask, tlb_table and tlb_v_table */ \ > QemuSpin tlb_lock; \ > + CPUTLBDesc tlb_desc[NB_MMU_MODES]; \ > /* tlb_mask[i] contains (n_entries - 1) << CPU_TLB_ENTRY_BITS */ \ > uintptr_t tlb_mask[NB_MMU_MODES]; \ > CPUTLBEntry *tlb_table[NB_MMU_MODES]; \ > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c > index 4dc47e603c..11d6060eb0 100644 > --- a/accel/tcg/cputlb.c > +++ b/accel/tcg/cputlb.c > @@ -82,6 +82,7 @@ void tlb_init(CPUState *cpu) > for (i = 0; i < NB_MMU_MODES; i++) { > size_t n_entries = CPU_TLB_SIZE; > > + env->tlb_desc[i].n_used_entries = 0; > env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS; > env->tlb_table[i] = g_new(CPUTLBEntry, n_entries); > env->iotlb[i] = g_new0(CPUIOTLBEntry, n_entries); > @@ -150,6 +151,7 @@ static void tlb_flush_nocheck(CPUState *cpu) > qemu_spin_lock(&env->tlb_lock); > for (i = 0; i < NB_MMU_MODES; i++) { > memset(env->tlb_table[i], -1, sizeof_tlb(env, i)); > + env->tlb_desc[i].n_used_entries = 0; > } > memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table)); > qemu_spin_unlock(&env->tlb_lock); > @@ -213,6 +215,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data) > > memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx)); > memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); > + env->tlb_desc[mmu_idx].n_used_entries = 0; > } > } > qemu_spin_unlock(&env->tlb_lock); > @@ -273,12 +276,14 @@ static inline bool tlb_hit_page_anyprot(CPUTLBEntry *tlb_entry, > } > > /* Called with tlb_lock held */ > -static inline void tlb_flush_entry_locked(CPUTLBEntry *tlb_entry, > +static inline bool tlb_flush_entry_locked(CPUTLBEntry *tlb_entry, > target_ulong page) > { > if (tlb_hit_page_anyprot(tlb_entry, page)) { > memset(tlb_entry, -1, sizeof(*tlb_entry)); > + return true; > } > + return false; > } > > /* Called with tlb_lock held */ > @@ -316,7 +321,9 @@ static void tlb_flush_page_async_work(CPUState *cpu, run_on_cpu_data data) > addr &= TARGET_PAGE_MASK; > qemu_spin_lock(&env->tlb_lock); > for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { > - tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); > + if (tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr)) { > + env->tlb_desc[mmu_idx].n_used_entries--; > + } > tlb_flush_vtlb_page_locked(env, mmu_idx, addr); > } > qemu_spin_unlock(&env->tlb_lock); > @@ -358,7 +365,9 @@ static void tlb_flush_page_by_mmuidx_async_work(CPUState *cpu, > qemu_spin_lock(&env->tlb_lock); > for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) { > if (test_bit(mmu_idx, &mmu_idx_bitmap)) { > - tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr); > + if (tlb_flush_entry_locked(tlb_entry(env, mmu_idx, addr), addr)) { > + env->tlb_desc[mmu_idx].n_used_entries--; > + } > tlb_flush_vtlb_page_locked(env, mmu_idx, addr); > } > } > @@ -696,6 +705,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, > /* Evict the old entry into the victim tlb. */ > copy_tlb_helper_locked(tv, te); > env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index]; > + env->tlb_desc[mmu_idx].n_used_entries--; > } > > /* refill the tlb */ > @@ -747,6 +757,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr, > } > > copy_tlb_helper_locked(te, &tn); > + env->tlb_desc[mmu_idx].n_used_entries++; > qemu_spin_unlock(&env->tlb_lock); > } -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on use rate 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota ` (3 preceding siblings ...) 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 4/5] cputlb: track TLB use rate Emilio G. Cota @ 2018-10-08 23:27 ` Emilio G. Cota 2018-10-09 14:54 ` Alex Bennée 2018-10-09 12:34 ` [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Alex Bennée 5 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2018-10-08 23:27 UTC (permalink / raw) To: qemu-devel; +Cc: Richard Henderson, Alex Bennée Perform the resizing only on flushes, otherwise we'd have to take a perf hit by either rehashing the array or unnecessarily flushing it. We grow the array aggressively, and reduce the size more slowly. This accommodates mixed workloads, where some processes might be memory-heavy while others are not. As the following experiments show, this a net perf gain, particularly for memory-heavy workloads. Experiments are run on an Intel i7-6700K CPU @ 4.00GHz. 1. System boot + shudown, debian aarch64: - Before (tb-lock-v3): Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs): 7469.363393 task-clock (msec) # 0.998 CPUs utilized ( +- 0.07% ) 31,507,707,190 cycles # 4.218 GHz ( +- 0.07% ) 57,101,577,452 instructions # 1.81 insns per cycle ( +- 0.08% ) 10,265,531,804 branches # 1374.352 M/sec ( +- 0.07% ) 173,020,681 branch-misses # 1.69% of all branches ( +- 0.10% ) 7.483359063 seconds time elapsed ( +- 0.08% ) - After: Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs): 7185.036730 task-clock (msec) # 0.999 CPUs utilized ( +- 0.11% ) 30,303,501,143 cycles # 4.218 GHz ( +- 0.11% ) 54,198,386,487 instructions # 1.79 insns per cycle ( +- 0.08% ) 9,726,518,945 branches # 1353.719 M/sec ( +- 0.08% ) 167,082,307 branch-misses # 1.72% of all branches ( +- 0.08% ) 7.195597842 seconds time elapsed ( +- 0.11% ) That is, a 3.8% improvement. 2. System boot + shutdown, ubuntu 18.04 x86_64: - Before (tb-lock-v3): Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs): 49971.036482 task-clock (msec) # 0.999 CPUs utilized ( +- 1.62% ) 210,766,077,140 cycles # 4.218 GHz ( +- 1.63% ) 428,829,830,790 instructions # 2.03 insns per cycle ( +- 0.75% ) 77,313,384,038 branches # 1547.164 M/sec ( +- 0.54% ) 835,610,706 branch-misses # 1.08% of all branches ( +- 2.97% ) 50.003855102 seconds time elapsed ( +- 1.61% ) - After: Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs): 50118.124477 task-clock (msec) # 0.999 CPUs utilized ( +- 4.30% ) 132,396 context-switches # 0.003 M/sec ( +- 1.20% ) 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 167,754 page-faults # 0.003 M/sec ( +- 0.06% ) 211,414,701,601 cycles # 4.218 GHz ( +- 4.30% ) <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 431,618,818,597 instructions # 2.04 insns per cycle ( +- 6.40% ) 80,197,256,524 branches # 1600.165 M/sec ( +- 8.59% ) 794,830,352 branch-misses # 0.99% of all branches ( +- 2.05% ) 50.177077175 seconds time elapsed ( +- 4.23% ) No improvement (within noise range). 3. x86_64 SPEC06int: SPEC06int (test set) [ Y axis: speedup over master ] 8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+ | | | tlb-lock-v3 | 7 +-+..................$$$...........................+indirection +-+ | $ $ +resizing | | $ $ | 6 +-+..................$.$..............................................+-+ | $ $ | | $ $ | 5 +-+..................$.$..............................................+-+ | $ $ | | $ $ | 4 +-+..................$.$..............................................+-+ | $ $ | | +++ $ $ | 3 +-+........$$+.......$.$..............................................+-+ | $$ $ $ | | $$ $ $ $$$ | 2 +-+........$$........$.$.................................$.$..........+-+ | $$ $ $ $ $ +$$ | | $$ $$+ $ $ $$$ +$$ $ $ $$$ $$ | 1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+ | * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ | | * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ | 0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+ 401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean png: https://imgur.com/a/b1wn3wc That is, a 1.53x average speedup over master, with a max speedup of 7.13x. Note that "indirection" (i.e. the first patch in this series) incurs no overhead, on average. To conclude, here is a different look at the SPEC06int results, using linux-user as the baseline and comparing master and this series ("tlb-dyn"): Softmmu slowdown vs. linux-user for SPEC06int (test set) [ Y axis: slowdown over linux-user ] 14 +-+--+----+----+----+----+----+-----+----+----+----+----+----+----+--+-+ | | | master | 12 +-+...............+**..................................tlb-dyn.......+-+ | ** | | ** | | ** | 10 +-+................**................................................+-+ | ** | | ** | 8 +-+................**................................................+-+ | ** | | ** | | ** | 6 +-+................**................................................+-+ | *** ** | | * * ** | 4 +-+.....*.*........**.................................***............+-+ | * * ** * * | | * * +++ ** *** *** * * *** *** | | * * +**++ ** **## *+*# *** * *#+* * * *##* * | 2 +-+.....*.*##.**##.**##.**.#.**##.*+*#.***#.*+*#.*.*#.*.*#+*.*.#*.*##+-+ |++***##*+*+#+**+#+**+#+**+#+**+#+*+*#+*+*#+*+*#+*+*#+*+*#+*+*+#*+*+#++| | * * #* * # ** # ** # ** # ** # * *# * *# * *# * *# * *# * * #* * # | 0 +-+***##***##-**##-**##-**##-**##-***#-***#-***#-***#-***#-***##***##+-+ 401.bzi403.g429445.g456.hm462.libq464.h471.omn4483.xalancbgeomean png: https://imgur.com/a/eXkjMCE After this series, we bring down the average softmmu overhead from 2.77x to 1.80x, with a maximum slowdown of 2.48x (omnetpp). Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/exec/cpu-defs.h | 39 +++++++++------------------------------ accel/tcg/cputlb.c | 39 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 47 insertions(+), 31 deletions(-) diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h index 56f1887c7f..d4af0b2a2d 100644 --- a/include/exec/cpu-defs.h +++ b/include/exec/cpu-defs.h @@ -67,37 +67,15 @@ typedef uint64_t target_ulong; #define CPU_TLB_ENTRY_BITS 5 #endif -/* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that - * the TLB is not unnecessarily small, but still small enough for the - * TLB lookup instruction sequence used by the TCG target. - * - * TCG will have to generate an operand as large as the distance between - * env and the tlb_table[NB_MMU_MODES - 1][0].addend. For simplicity, - * the TCG targets just round everything up to the next power of two, and - * count bits. This works because: 1) the size of each TLB is a largish - * power of two, 2) and because the limit of the displacement is really close - * to a power of two, 3) the offset of tlb_table[0][0] inside env is smaller - * than the size of a TLB. - * - * For example, the maximum displacement 0xFFF0 on PPC and MIPS, but TCG - * just says "the displacement is 16 bits". TCG_TARGET_TLB_DISPLACEMENT_BITS - * then ensures that tlb_table at least 0x8000 bytes large ("not unnecessarily - * small": 2^15). The operand then will come up smaller than 0xFFF0 without - * any particular care, because the TLB for a single MMU mode is larger than - * 0x10000-0xFFF0=16 bytes. In the end, the maximum value of the operand - * could be something like 0xC000 (the offset of the last TLB table) plus - * 0x18 (the offset of the addend field in each TLB entry) plus the offset - * of tlb_table inside env (which is non-trivial but not huge). +#define MIN_CPU_TLB_BITS 6 +#define DEFAULT_CPU_TLB_BITS 8 +/* + * Assuming TARGET_PAGE_BITS==12, with 2**22 entries we can cover 2**(22+12) == + * 2**34 == 16G of address space. This is roughly what one would expect a + * TLB to cover in a modern (as of 2018) x86_64 CPU. For instance, Intel + * Skylake's Level-2 STLB has 16 1G entries. */ -#define CPU_TLB_BITS \ - MIN(8, \ - TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS - \ - (NB_MMU_MODES <= 1 ? 0 : \ - NB_MMU_MODES <= 2 ? 1 : \ - NB_MMU_MODES <= 4 ? 2 : \ - NB_MMU_MODES <= 8 ? 3 : 4)) - -#define CPU_TLB_SIZE (1 << CPU_TLB_BITS) +#define MAX_CPU_TLB_BITS 22 typedef struct CPUTLBEntry { /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address @@ -143,6 +121,7 @@ typedef struct CPUIOTLBEntry { typedef struct CPUTLBDesc { size_t n_used_entries; + size_t n_flushes_low_rate; } CPUTLBDesc; #define CPU_COMMON_TLB \ diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c index 11d6060eb0..5ebfa4fbb5 100644 --- a/accel/tcg/cputlb.c +++ b/accel/tcg/cputlb.c @@ -80,9 +80,10 @@ void tlb_init(CPUState *cpu) qemu_spin_init(&env->tlb_lock); for (i = 0; i < NB_MMU_MODES; i++) { - size_t n_entries = CPU_TLB_SIZE; + size_t n_entries = 1 << DEFAULT_CPU_TLB_BITS; env->tlb_desc[i].n_used_entries = 0; + env->tlb_desc[i].n_flushes_low_rate = 0; env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS; env->tlb_table[i] = g_new(CPUTLBEntry, n_entries); env->iotlb[i] = g_new0(CPUIOTLBEntry, n_entries); @@ -121,6 +122,40 @@ size_t tlb_flush_count(void) return count; } +/* Call with tlb_lock held */ +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx) +{ + CPUTLBDesc *desc = &env->tlb_desc[mmu_idx]; + size_t old_size = tlb_n_entries(env, mmu_idx); + size_t rate = desc->n_used_entries * 100 / old_size; + size_t new_size = old_size; + + if (rate == 100) { + new_size = MIN(old_size << 2, 1 << MAX_CPU_TLB_BITS); + } else if (rate > 70) { + new_size = MIN(old_size << 1, 1 << MAX_CPU_TLB_BITS); + } else if (rate < 30) { + desc->n_flushes_low_rate++; + if (desc->n_flushes_low_rate == 100) { + new_size = MAX(old_size >> 1, 1 << MIN_CPU_TLB_BITS); + desc->n_flushes_low_rate = 0; + } + } + + if (new_size == old_size) { + return; + } + + g_free(env->tlb_table[mmu_idx]); + g_free(env->iotlb[mmu_idx]); + + /* desc->n_used_entries is cleared by the caller */ + desc->n_flushes_low_rate = 0; + env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS; + env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, new_size); + env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, new_size); +} + /* This is OK because CPU architectures generally permit an * implementation to drop entries from the TLB at any time, so * flushing more entries than required is only an efficiency issue, @@ -150,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu) */ qemu_spin_lock(&env->tlb_lock); for (i = 0; i < NB_MMU_MODES; i++) { + tlb_mmu_resize_locked(env, i); memset(env->tlb_table[i], -1, sizeof_tlb(env, i)); env->tlb_desc[i].n_used_entries = 0; } @@ -213,6 +249,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data) if (test_bit(mmu_idx, &mmu_idx_bitmask)) { tlb_debug("%d\n", mmu_idx); + tlb_mmu_resize_locked(env, mmu_idx); memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx)); memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); env->tlb_desc[mmu_idx].n_used_entries = 0; -- 2.17.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on use rate 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on " Emilio G. Cota @ 2018-10-09 14:54 ` Alex Bennée 2018-10-09 16:03 ` Emilio G. Cota 0 siblings, 1 reply; 18+ messages in thread From: Alex Bennée @ 2018-10-09 14:54 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > Perform the resizing only on flushes, otherwise we'd > have to take a perf hit by either rehashing the array > or unnecessarily flushing it. > > We grow the array aggressively, and reduce the size more > slowly. This accommodates mixed workloads, where some > processes might be memory-heavy while others are not. > > As the following experiments show, this a net perf gain, > particularly for memory-heavy workloads. Experiments > are run on an Intel i7-6700K CPU @ 4.00GHz. > > 1. System boot + shudown, debian aarch64: > > - Before (tb-lock-v3): > Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs): > > 7469.363393 task-clock (msec) # 0.998 CPUs utilized ( +- 0.07% ) > 31,507,707,190 cycles # 4.218 GHz ( +- 0.07% ) > 57,101,577,452 instructions # 1.81 insns per cycle ( +- 0.08% ) > 10,265,531,804 branches # 1374.352 M/sec ( +- 0.07% ) > 173,020,681 branch-misses # 1.69% of all branches ( +- 0.10% ) > > 7.483359063 seconds time elapsed ( +- 0.08% ) > > - After: > Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs): > > 7185.036730 task-clock (msec) # 0.999 CPUs utilized ( +- 0.11% ) > 30,303,501,143 cycles # 4.218 GHz ( +- 0.11% ) > 54,198,386,487 instructions # 1.79 insns per cycle ( +- 0.08% ) > 9,726,518,945 branches # 1353.719 M/sec ( +- 0.08% ) > 167,082,307 branch-misses # 1.72% of all branches ( +- 0.08% ) > > 7.195597842 seconds time elapsed ( +- 0.11% ) > > That is, a 3.8% improvement. > > 2. System boot + shutdown, ubuntu 18.04 x86_64: > > - Before (tb-lock-v3): > Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs): > > 49971.036482 task-clock (msec) # 0.999 CPUs utilized ( +- 1.62% ) > 210,766,077,140 cycles # 4.218 GHz ( +- 1.63% ) > 428,829,830,790 instructions # 2.03 insns per cycle ( +- 0.75% ) > 77,313,384,038 branches # 1547.164 M/sec ( +- 0.54% ) > 835,610,706 branch-misses # 1.08% of all branches ( +- 2.97% ) > > 50.003855102 seconds time elapsed ( +- 1.61% ) > > - After: > Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs): > > 50118.124477 task-clock (msec) # 0.999 CPUs utilized ( +- 4.30% ) > 132,396 context-switches # 0.003 M/sec ( +- 1.20% ) > 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) > 167,754 page-faults # 0.003 M/sec ( +- 0.06% ) > 211,414,701,601 cycles # 4.218 GHz ( +- 4.30% ) > <not supported> stalled-cycles-frontend > <not supported> stalled-cycles-backend > 431,618,818,597 instructions # 2.04 insns per cycle ( +- 6.40% ) > 80,197,256,524 branches # 1600.165 M/sec ( +- 8.59% ) > 794,830,352 branch-misses # 0.99% of all branches ( +- 2.05% ) > > 50.177077175 seconds time elapsed ( +- 4.23% ) > > No improvement (within noise range). > > 3. x86_64 SPEC06int: > SPEC06int (test set) > [ Y axis: speedup over master ] > 8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+ > | | > | tlb-lock-v3 | > 7 +-+..................$$$...........................+indirection +-+ > | $ $ +resizing | > | $ $ | > 6 +-+..................$.$..............................................+-+ > | $ $ | > | $ $ | > 5 +-+..................$.$..............................................+-+ > | $ $ | > | $ $ | > 4 +-+..................$.$..............................................+-+ > | $ $ | > | +++ $ $ | > 3 +-+........$$+.......$.$..............................................+-+ > | $$ $ $ | > | $$ $ $ $$$ | > 2 +-+........$$........$.$.................................$.$..........+-+ > | $$ $ $ $ $ +$$ | > | $$ $$+ $ $ $$$ +$$ $ $ $$$ $$ | > 1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+ > | * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ | > | * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$ | > 0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+ > 401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean > png: https://imgur.com/a/b1wn3wc > > That is, a 1.53x average speedup over master, with a max speedup of 7.13x. > > Note that "indirection" (i.e. the first patch in this series) incurs > no overhead, on average. > > To conclude, here is a different look at the SPEC06int results, using > linux-user as the baseline and comparing master and this series ("tlb-dyn"): > > Softmmu slowdown vs. linux-user for SPEC06int (test set) > [ Y axis: slowdown over linux-user ] > 14 +-+--+----+----+----+----+----+-----+----+----+----+----+----+----+--+-+ > | | > | master | > 12 +-+...............+**..................................tlb-dyn.......+-+ > | ** | > | ** | > | ** | > 10 +-+................**................................................+-+ > | ** | > | ** | > 8 +-+................**................................................+-+ > | ** | > | ** | > | ** | > 6 +-+................**................................................+-+ > | *** ** | > | * * ** | > 4 +-+.....*.*........**.................................***............+-+ > | * * ** * * | > | * * +++ ** *** *** * * *** *** | > | * * +**++ ** **## *+*# *** * *#+* * * *##* * | > 2 +-+.....*.*##.**##.**##.**.#.**##.*+*#.***#.*+*#.*.*#.*.*#+*.*.#*.*##+-+ > |++***##*+*+#+**+#+**+#+**+#+**+#+*+*#+*+*#+*+*#+*+*#+*+*#+*+*+#*+*+#++| > | * * #* * # ** # ** # ** # ** # * *# * *# * *# * *# * *# * * #* * # | > 0 +-+***##***##-**##-**##-**##-**##-***#-***#-***#-***#-***#-***##***##+-+ > 401.bzi403.g429445.g456.hm462.libq464.h471.omn4483.xalancbgeomean > > png: https://imgur.com/a/eXkjMCE > > After this series, we bring down the average softmmu overhead > from 2.77x to 1.80x, with a maximum slowdown of 2.48x (omnetpp). > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > include/exec/cpu-defs.h | 39 +++++++++------------------------------ > accel/tcg/cputlb.c | 39 ++++++++++++++++++++++++++++++++++++++- > 2 files changed, 47 insertions(+), 31 deletions(-) > > diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h > index 56f1887c7f..d4af0b2a2d 100644 > --- a/include/exec/cpu-defs.h > +++ b/include/exec/cpu-defs.h > @@ -67,37 +67,15 @@ typedef uint64_t target_ulong; > #define CPU_TLB_ENTRY_BITS 5 > #endif > > -/* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that > - * the TLB is not unnecessarily small, but still small enough for the > - * TLB lookup instruction sequence used by the TCG target. > - * > - * TCG will have to generate an operand as large as the distance between > - * env and the tlb_table[NB_MMU_MODES - 1][0].addend. For simplicity, > - * the TCG targets just round everything up to the next power of two, and > - * count bits. This works because: 1) the size of each TLB is a largish > - * power of two, 2) and because the limit of the displacement is really close > - * to a power of two, 3) the offset of tlb_table[0][0] inside env is smaller > - * than the size of a TLB. > - * > - * For example, the maximum displacement 0xFFF0 on PPC and MIPS, but TCG > - * just says "the displacement is 16 bits". TCG_TARGET_TLB_DISPLACEMENT_BITS > - * then ensures that tlb_table at least 0x8000 bytes large ("not unnecessarily > - * small": 2^15). The operand then will come up smaller than 0xFFF0 without > - * any particular care, because the TLB for a single MMU mode is larger than > - * 0x10000-0xFFF0=16 bytes. In the end, the maximum value of the operand > - * could be something like 0xC000 (the offset of the last TLB table) plus > - * 0x18 (the offset of the addend field in each TLB entry) plus the offset > - * of tlb_table inside env (which is non-trivial but not huge). > +#define MIN_CPU_TLB_BITS 6 > +#define DEFAULT_CPU_TLB_BITS 8 > +/* > + * Assuming TARGET_PAGE_BITS==12, with 2**22 entries we can cover 2**(22+12) == > + * 2**34 == 16G of address space. This is roughly what one would expect a > + * TLB to cover in a modern (as of 2018) x86_64 CPU. For instance, Intel > + * Skylake's Level-2 STLB has 16 1G entries. > */ > -#define CPU_TLB_BITS \ > - MIN(8, \ > - TCG_TARGET_TLB_DISPLACEMENT_BITS - CPU_TLB_ENTRY_BITS - \ > - (NB_MMU_MODES <= 1 ? 0 : \ > - NB_MMU_MODES <= 2 ? 1 : \ > - NB_MMU_MODES <= 4 ? 2 : \ > - NB_MMU_MODES <= 8 ? 3 : 4)) > - > -#define CPU_TLB_SIZE (1 << CPU_TLB_BITS) > +#define MAX_CPU_TLB_BITS 22 > > typedef struct CPUTLBEntry { > /* bit TARGET_LONG_BITS to TARGET_PAGE_BITS : virtual address > @@ -143,6 +121,7 @@ typedef struct CPUIOTLBEntry { > > typedef struct CPUTLBDesc { > size_t n_used_entries; > + size_t n_flushes_low_rate; > } CPUTLBDesc; > > #define CPU_COMMON_TLB \ > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c > index 11d6060eb0..5ebfa4fbb5 100644 > --- a/accel/tcg/cputlb.c > +++ b/accel/tcg/cputlb.c > @@ -80,9 +80,10 @@ void tlb_init(CPUState *cpu) > > qemu_spin_init(&env->tlb_lock); > for (i = 0; i < NB_MMU_MODES; i++) { > - size_t n_entries = CPU_TLB_SIZE; > + size_t n_entries = 1 << DEFAULT_CPU_TLB_BITS; > > env->tlb_desc[i].n_used_entries = 0; > + env->tlb_desc[i].n_flushes_low_rate = 0; > env->tlb_mask[i] = (n_entries - 1) << CPU_TLB_ENTRY_BITS; > env->tlb_table[i] = g_new(CPUTLBEntry, n_entries); > env->iotlb[i] = g_new0(CPUIOTLBEntry, n_entries); > @@ -121,6 +122,40 @@ size_t tlb_flush_count(void) > return count; > } > > +/* Call with tlb_lock held */ > +static void tlb_mmu_resize_locked(CPUArchState *env, int mmu_idx) > +{ > + CPUTLBDesc *desc = &env->tlb_desc[mmu_idx]; > + size_t old_size = tlb_n_entries(env, mmu_idx); > + size_t rate = desc->n_used_entries * 100 / old_size; > + size_t new_size = old_size; > + > + if (rate == 100) { > + new_size = MIN(old_size << 2, 1 << MAX_CPU_TLB_BITS); > + } else if (rate > 70) { > + new_size = MIN(old_size << 1, 1 << MAX_CPU_TLB_BITS); > + } else if (rate < 30) { > + desc->n_flushes_low_rate++; > + if (desc->n_flushes_low_rate == 100) { > + new_size = MAX(old_size >> 1, 1 << MIN_CPU_TLB_BITS); > + desc->n_flushes_low_rate = 0; > + } > + } > + > + if (new_size == old_size) { > + return; > + } > + > + g_free(env->tlb_table[mmu_idx]); > + g_free(env->iotlb[mmu_idx]); > + > + /* desc->n_used_entries is cleared by the caller */ > + desc->n_flushes_low_rate = 0; > + env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS; > + env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, new_size); > + env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, new_size); I guess the allocation is a big enough stall there is no point either pre-allocating or using RCU to clean-up the old data? Given this is a new behaviour it would be nice to expose the occupancy of the TLBs in "info jit" much like we do for TBs. Nevertheless: Reviewed-by: Alex Bennée <alex.bennee@linaro.org> > +} > + > /* This is OK because CPU architectures generally permit an > * implementation to drop entries from the TLB at any time, so > * flushing more entries than required is only an efficiency issue, > @@ -150,6 +185,7 @@ static void tlb_flush_nocheck(CPUState *cpu) > */ > qemu_spin_lock(&env->tlb_lock); > for (i = 0; i < NB_MMU_MODES; i++) { > + tlb_mmu_resize_locked(env, i); > memset(env->tlb_table[i], -1, sizeof_tlb(env, i)); > env->tlb_desc[i].n_used_entries = 0; > } > @@ -213,6 +249,7 @@ static void tlb_flush_by_mmuidx_async_work(CPUState *cpu, run_on_cpu_data data) > if (test_bit(mmu_idx, &mmu_idx_bitmask)) { > tlb_debug("%d\n", mmu_idx); > > + tlb_mmu_resize_locked(env, mmu_idx); > memset(env->tlb_table[mmu_idx], -1, sizeof_tlb(env, mmu_idx)); > memset(env->tlb_v_table[mmu_idx], -1, sizeof(env->tlb_v_table[0])); > env->tlb_desc[mmu_idx].n_used_entries = 0; -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on use rate 2018-10-09 14:54 ` Alex Bennée @ 2018-10-09 16:03 ` Emilio G. Cota 2018-10-09 16:34 ` Alex Bennée 0 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2018-10-09 16:03 UTC (permalink / raw) To: Alex Bennée; +Cc: qemu-devel, Richard Henderson On Tue, Oct 09, 2018 at 15:54:21 +0100, Alex Bennée wrote: > Emilio G. Cota <cota@braap.org> writes: > > + if (new_size == old_size) { > > + return; > > + } > > + > > + g_free(env->tlb_table[mmu_idx]); > > + g_free(env->iotlb[mmu_idx]); > > + > > + /* desc->n_used_entries is cleared by the caller */ > > + desc->n_flushes_low_rate = 0; > > + env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS; > > + env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, new_size); > > + env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, new_size); For the iotlb we can use g_new, right? iotlb[foo][bar] is only checked after having checked tlb_table[foo][bar]. Otherwise tlb_flush would also flush the iotlb. > I guess the allocation is a big enough stall there is no point either > pre-allocating or using RCU to clean-up the old data? I tried this. Turns out not to make a difference, because (1) we only resize on flushes, which do not happen that often, and (2) we size up aggressively, but the shrink rate is more conservative. So in the end, it's a drop in the ocean. For instance, bootup+shutdown requires 100 calls to g_new+g_free -- at ~300 cycles each, that's about 30us out of ~8s of execution time. > Given this is a new behaviour it would be nice to expose the occupancy > of the TLBs in "info jit" much like we do for TBs. The occupancy changes *very* quickly, so by the time the report is out, the info is stale. So I'm not sure that's very useful. The TLB size changes less often, but reporting on it is not obvious, since we have NB_MMU_MODES sizes per CPU. Say we have 20 CPUs, what should we report? A table with 20 * NB_MMU_MODES cells? I dunno. > Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Thanks! Emilio ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on use rate 2018-10-09 16:03 ` Emilio G. Cota @ 2018-10-09 16:34 ` Alex Bennée 0 siblings, 0 replies; 18+ messages in thread From: Alex Bennée @ 2018-10-09 16:34 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > On Tue, Oct 09, 2018 at 15:54:21 +0100, Alex Bennée wrote: >> Emilio G. Cota <cota@braap.org> writes: >> > + if (new_size == old_size) { >> > + return; >> > + } >> > + >> > + g_free(env->tlb_table[mmu_idx]); >> > + g_free(env->iotlb[mmu_idx]); >> > + >> > + /* desc->n_used_entries is cleared by the caller */ >> > + desc->n_flushes_low_rate = 0; >> > + env->tlb_mask[mmu_idx] = (new_size - 1) << CPU_TLB_ENTRY_BITS; >> > + env->tlb_table[mmu_idx] = g_new(CPUTLBEntry, new_size); >> > + env->iotlb[mmu_idx] = g_new0(CPUIOTLBEntry, new_size); > > For the iotlb we can use g_new, right? > > iotlb[foo][bar] is only checked after having checked tlb_table[foo][bar]. > Otherwise tlb_flush would also flush the iotlb. > >> I guess the allocation is a big enough stall there is no point either >> pre-allocating or using RCU to clean-up the old data? > > I tried this. Turns out not to make a difference, because (1) we only > resize on flushes, which do not happen that often, and (2) we > size up aggressively, but the shrink rate is more conservative. So > in the end, it's a drop in the ocean. For instance, bootup+shutdown > requires 100 calls to g_new+g_free -- at ~300 cycles each, that's > about 30us out of ~8s of execution time. > >> Given this is a new behaviour it would be nice to expose the occupancy >> of the TLBs in "info jit" much like we do for TBs. > > The occupancy changes *very* quickly, so by the time the report is out, > the info is stale. So I'm not sure that's very useful. Hmm do I mean occupancy or utilisation? I guess I want to get an idea of how much of the TLB has been used and how much is empty never to be used space. In theory as the TLB tends towards guest page size out TLB turnover should be of the order of the guests' TLB re-fill rate? > The TLB size changes less often, but reporting on it is not obvious, > since we have NB_MMU_MODES sizes per CPU. Say we have 20 CPUs, what should > we report? A table with 20 * NB_MMU_MODES cells? I dunno. I guess not. Although I suspect some MMU_MODES are more interesting than others. I'm hoping the usage of EL3 related modes is negligible if we haven't booted with a secure firmware for example. > >> Reviewed-by: Alex Bennée <alex.bennee@linaro.org> > > Thanks! > > Emilio -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota ` (4 preceding siblings ...) 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on " Emilio G. Cota @ 2018-10-09 12:34 ` Alex Bennée 2018-10-09 14:38 ` Emilio G. Cota 5 siblings, 1 reply; 18+ messages in thread From: Alex Bennée @ 2018-10-09 12:34 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > v1: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01146.html > > Changes since v1: Hmm I'm seeing some qtest failures, for example: $ make check-qtest-alpha V=1 ... QTEST_QEMU_BINARY=alpha-softmmu/qemu-system-alpha QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))} gtester -k --verbose -m=quick test s/boot-serial-test tests/qmp-test tests/qmp-cmd-test tests/device-introspect-test tests/cdrom-test tests/machine-none-test tests/qom-test tests/test-hmp TEST: tests/boot-serial-test... (pid=31091) /alpha/boot-serial/clipper: Broken pipe tests/libqtest.c:129: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) FAIL GTester: last random seed: R02S948c4a5112fd7682934f4d96e1aff38e (pid=31099) FAIL: tests/boot-serial-test > > - Add tlb_index and tlb_entry helpers from Richard > > - Introduce sizeof_tlb() and tlb_n_entries() > > - Extract tlb_mask as its own array in CPUArchState, as > suggested by Richard. For the associated helpers (tlb_index etc) > I tried several approaches, and performance-wise they're all > the same, so went for the simplest implementation. > > - Use uintptr_t for tlb_mask, as done in Richard's patch > + tcg/i386: use hrexw when reading tlb_mask > + Define tlbtype and tlbrexw solely based on TARGET_PAGE_BITS > > - Rename tlb_is_invalid to tlb_entry_is_empty, comparing all > fields (except .addend) against -1. > > - Rename CPUTLBDesc.used to .n_used_entries. > > - Drop the MIN/MAX CPU_TLB_BITS patches, defining instead > some values for MIN/MAX as well as a default. > > - Use new_size and old_size consistently in the resizing function, > as suggested by Richard. > > - Add an additional chart to the last patch, where softmmu > performance is compared against user-mode: > https://imgur.com/a/eXkjMCE > > You can fetch this series from: > https://github.com/cota/qemu/tree/tlb-dyn-v2 > > Note that it applies on top of my tlb-lock-v4 series: > https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01421.html > > Thanks, > > Emilio -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing 2018-10-09 12:34 ` [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Alex Bennée @ 2018-10-09 14:38 ` Emilio G. Cota 2018-10-09 14:45 ` Alex Bennée 0 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2018-10-09 14:38 UTC (permalink / raw) To: Alex Bennée; +Cc: qemu-devel, Richard Henderson On Tue, Oct 09, 2018 at 13:34:40 +0100, Alex Bennée wrote: > > Emilio G. Cota <cota@braap.org> writes: > > > v1: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01146.html > > > > Changes since v1: > > Hmm I'm seeing some qtest failures, for example: > > $ make check-qtest-alpha V=1 > ... > QTEST_QEMU_BINARY=alpha-softmmu/qemu-system-alpha QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))} gtester -k --verbose -m=quick test > s/boot-serial-test tests/qmp-test tests/qmp-cmd-test tests/device-introspect-test tests/cdrom-test tests/machine-none-test tests/qom-test tests/test-hmp > TEST: tests/boot-serial-test... (pid=31091) > /alpha/boot-serial/clipper: Broken pipe > tests/libqtest.c:129: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > FAIL > GTester: last random seed: R02S948c4a5112fd7682934f4d96e1aff38e > (pid=31099) > FAIL: tests/boot-serial-test I'm pretty sure that the problem is that tlb_init is not being called at all. Note that this applies to the tlb-lock series as well, although there we're just calling qemu_spin_init, which is not really necessary because CPUArchState is 0-allocated. I'll take a look. Thanks, E. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing 2018-10-09 14:38 ` Emilio G. Cota @ 2018-10-09 14:45 ` Alex Bennée 2018-10-09 15:19 ` Emilio G. Cota 0 siblings, 1 reply; 18+ messages in thread From: Alex Bennée @ 2018-10-09 14:45 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > On Tue, Oct 09, 2018 at 13:34:40 +0100, Alex Bennée wrote: >> >> Emilio G. Cota <cota@braap.org> writes: >> >> > v1: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01146.html >> > >> > Changes since v1: >> >> Hmm I'm seeing some qtest failures, for example: >> >> $ make check-qtest-alpha V=1 >> ... >> QTEST_QEMU_BINARY=alpha-softmmu/qemu-system-alpha QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))} gtester -k --verbose -m=quick test >> s/boot-serial-test tests/qmp-test tests/qmp-cmd-test tests/device-introspect-test tests/cdrom-test tests/machine-none-test tests/qom-test tests/test-hmp >> TEST: tests/boot-serial-test... (pid=31091) >> /alpha/boot-serial/clipper: Broken pipe >> tests/libqtest.c:129: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> FAIL >> GTester: last random seed: R02S948c4a5112fd7682934f4d96e1aff38e >> (pid=31099) >> FAIL: tests/boot-serial-test > > I'm pretty sure that the problem is that tlb_init is not being > called at all. Note that this applies to the tlb-lock series > as well, although there we're just calling qemu_spin_init, > which is not really necessary because CPUArchState is 0-allocated. > > I'll take a look. Yeah I hadn't tried to bisect it, but I'm on top of tlb-lock-v4 as requested. -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing 2018-10-09 14:45 ` Alex Bennée @ 2018-10-09 15:19 ` Emilio G. Cota 2018-10-09 15:46 ` Alex Bennée 0 siblings, 1 reply; 18+ messages in thread From: Emilio G. Cota @ 2018-10-09 15:19 UTC (permalink / raw) To: Alex Bennée; +Cc: qemu-devel, Richard Henderson On Tue, Oct 09, 2018 at 15:45:36 +0100, Alex Bennée wrote: > > Emilio G. Cota <cota@braap.org> writes: > > > On Tue, Oct 09, 2018 at 13:34:40 +0100, Alex Bennée wrote: > >> > >> Emilio G. Cota <cota@braap.org> writes: > >> > >> > v1: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01146.html > >> > > >> > Changes since v1: > >> > >> Hmm I'm seeing some qtest failures, for example: > >> > >> $ make check-qtest-alpha V=1 > >> ... > >> QTEST_QEMU_BINARY=alpha-softmmu/qemu-system-alpha QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))} gtester -k --verbose -m=quick test > >> s/boot-serial-test tests/qmp-test tests/qmp-cmd-test tests/device-introspect-test tests/cdrom-test tests/machine-none-test tests/qom-test tests/test-hmp > >> TEST: tests/boot-serial-test... (pid=31091) > >> /alpha/boot-serial/clipper: Broken pipe > >> tests/libqtest.c:129: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) > >> FAIL > >> GTester: last random seed: R02S948c4a5112fd7682934f4d96e1aff38e > >> (pid=31099) > >> FAIL: tests/boot-serial-test > > > > I'm pretty sure that the problem is that tlb_init is not being > > called at all. Note that this applies to the tlb-lock series > > as well, although there we're just calling qemu_spin_init, > > which is not really necessary because CPUArchState is 0-allocated. > > > > I'll take a look. > > Yeah I hadn't tried to bisect it, but I'm on top of tlb-lock-v4 as > requested. It's the tlb_flush in alpha_cpu_initfn: static void alpha_cpu_initfn(Object *obj) { CPUState *cs = CPU(obj); AlphaCPU *cpu = ALPHA_CPU(obj); CPUAlphaState *env = &cpu->env; cs->env_ptr = env; tlb_flush(cs); We call tlb_init later on at realize time. I think we can safely get rid of this tlb_flush. Unicore also has it. I'll add patches for both to the tlb-lock series. Emilio ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing 2018-10-09 15:19 ` Emilio G. Cota @ 2018-10-09 15:46 ` Alex Bennée 0 siblings, 0 replies; 18+ messages in thread From: Alex Bennée @ 2018-10-09 15:46 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > On Tue, Oct 09, 2018 at 15:45:36 +0100, Alex Bennée wrote: >> >> Emilio G. Cota <cota@braap.org> writes: >> >> > On Tue, Oct 09, 2018 at 13:34:40 +0100, Alex Bennée wrote: >> >> >> >> Emilio G. Cota <cota@braap.org> writes: >> >> >> >> > v1: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg01146.html >> >> > >> >> > Changes since v1: >> >> >> >> Hmm I'm seeing some qtest failures, for example: >> >> >> >> $ make check-qtest-alpha V=1 >> >> ... >> >> QTEST_QEMU_BINARY=alpha-softmmu/qemu-system-alpha QTEST_QEMU_IMG=qemu-img MALLOC_PERTURB_=${MALLOC_PERTURB_:-$(( ${RANDOM:-0} % 255 + 1))} gtester -k --verbose -m=quick test >> >> s/boot-serial-test tests/qmp-test tests/qmp-cmd-test tests/device-introspect-test tests/cdrom-test tests/machine-none-test tests/qom-test tests/test-hmp >> >> TEST: tests/boot-serial-test... (pid=31091) >> >> /alpha/boot-serial/clipper: Broken pipe >> >> tests/libqtest.c:129: kill_qemu() detected QEMU death from signal 11 (Segmentation fault) (core dumped) >> >> FAIL >> >> GTester: last random seed: R02S948c4a5112fd7682934f4d96e1aff38e >> >> (pid=31099) >> >> FAIL: tests/boot-serial-test >> > >> > I'm pretty sure that the problem is that tlb_init is not being >> > called at all. Note that this applies to the tlb-lock series >> > as well, although there we're just calling qemu_spin_init, >> > which is not really necessary because CPUArchState is 0-allocated. >> > >> > I'll take a look. >> >> Yeah I hadn't tried to bisect it, but I'm on top of tlb-lock-v4 as >> requested. > > It's the tlb_flush in alpha_cpu_initfn: > > static void alpha_cpu_initfn(Object *obj) > { > CPUState *cs = CPU(obj); > AlphaCPU *cpu = ALPHA_CPU(obj); > CPUAlphaState *env = &cpu->env; > > cs->env_ptr = env; > tlb_flush(cs); > > We call tlb_init later on at realize time. > > I think we can safely get rid of this tlb_flush. Agreed. > Unicore > also has it. I'll add patches for both to the tlb-lock series. Weirdly this didn't fail, but I agree it's superfluous. > > Emilio -- Alex Bennée ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2018-10-09 16:34 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-10-08 23:27 [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 1/5] tcg: Add tlb_index and tlb_entry helpers Emilio G. Cota 2018-10-09 14:43 ` Alex Bennée 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 2/5] (XXX) cputlb: introduce indirection for TLB size Emilio G. Cota 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 3/5] cputlb: do not evict empty entries to the vtlb Emilio G. Cota 2018-10-09 14:45 ` Alex Bennée 2018-10-09 14:49 ` Richard Henderson 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 4/5] cputlb: track TLB use rate Emilio G. Cota 2018-10-09 14:47 ` Alex Bennée 2018-10-08 23:27 ` [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on " Emilio G. Cota 2018-10-09 14:54 ` Alex Bennée 2018-10-09 16:03 ` Emilio G. Cota 2018-10-09 16:34 ` Alex Bennée 2018-10-09 12:34 ` [Qemu-devel] [RFC v2 0/5] Dynamic TLB sizing Alex Bennée 2018-10-09 14:38 ` Emilio G. Cota 2018-10-09 14:45 ` Alex Bennée 2018-10-09 15:19 ` Emilio G. Cota 2018-10-09 15:46 ` Alex Bennée
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).