[Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
@ 2014-01-22 14:48 Xin Tong
  2014-01-22 21:55 ` Richard Henderson
  2014-01-23 11:23 ` Alex Bennée
  0 siblings, 2 replies; 6+ messages in thread
From: Xin Tong @ 2014-01-22 14:48 UTC (permalink / raw)
  To: QEMU Developers, afaerber, aliguori

[-- Attachment #1: Type: text/plain, Size: 12990 bytes --]

This patch adds a victim TLB to the QEMU system mode TLB.

QEMU system mode page table walks are expensive. Taken by running QEMU
qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
4-level page tables in guest Linux OS takes ~450 X86 instructions on
average.

QEMU system mode TLB is implemented using a directly-mapped hashtable.
This structure suffers from conflict misses. Increasing the
associativity of the TLB may not be the solution to conflict misses as
all the ways may have to be walked in serial.

A victim TLB is a TLB used to hold translations evicted from the
primary TLB upon replacement. The victim TLB lies between the main TLB
and its refill path. Victim TLB is of greater associativity (fully
associative in this patch). It takes longer to lookup the victim TLB,
but its likely better than a full page table walk. The memory
translation path is changed as follows :

Before Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. TLB refill.
5. Do the memory access.
6. Return to code cache.

After Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. Victim TLB lookup.
5. If victim TLB misses, TLB refill
6. Do the memory access.
7. Return to code cache

The advantage is that victim TLB can offer more associativity to a
directly mapped TLB and thus potentially fewer page table walks while
still keeping the time taken to flush within reasonable limits.
However, placing a victim TLB before the refill path increase TLB
refill path as the victim TLB is consulted before the TLB refill. The
performance results demonstrate that the pros outweigh the cons.

Attached are some performance results taken on SPECINT2006 train
dataset and a Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine. In
summary, victim TLB improves the performance of qemu-system-x86_64 by
11% on average on SPECINT2006 and with highest improvement of in 254%
in
464.h264ref. And victim TLB does not result in any performance
degradation in any of the measured benchmarks. Furthermore, the
implemented victim TLB is architecture independent and is expected to
benefit other architectures in QEMU as well.

Although there are measurement fluctuations, the performance
improvement are very significant and by no means in the range of
noises.

Signed-off-by: Xin Tong <trent.tong@gmail.com>
---
 cputlb.c                        |   47 ++++++++++++++++++++++++--
 include/exec/cpu-defs.h         |   15 ++++++---
 include/exec/exec-all.h         |    2 ++
 include/exec/softmmu_template.h |   69 ++++++++++++++++++++++++++++++++++++---
 4 files changed, 122 insertions(+), 11 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index b533f3f..bb83c07 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -34,6 +34,19 @@
 /* statistics */
 int tlb_flush_count;

+#define TLB_XOR_SWAP(X, Y) do {*X = *X ^ *Y; *Y = *X ^ *Y; *X = *X ^
*Y;}while(0);
+
+/* used by victim tlb. swap the 2 given TLB entries as well as their
corresponding IOTLB */
+void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose)
+{
+   /* tlb and iotlb swap */
+   TLB_XOR_SWAP(iote, iose);
+   TLB_XOR_SWAP(&te->addend,     &se->addend);
+   TLB_XOR_SWAP(&te->addr_code,  &se->addr_code);
+   TLB_XOR_SWAP(&te->addr_read,  &se->addr_read);
+   TLB_XOR_SWAP(&te->addr_write, &se->addr_write);
+}
+
 /* NOTE:
  * If flush_global is true (the usual case), flush all tlb entries.
  * If flush_global is false, flush (at least) all tlb entries not
@@ -58,6 +71,7 @@ void tlb_flush(CPUArchState *env, int flush_global)
     cpu->current_tb = NULL;

     memset(env->tlb_table, -1, sizeof(env->tlb_table));
+    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));

     env->tlb_flush_addr = -1;
@@ -106,6 +120,14 @@ void tlb_flush_page(CPUArchState *env, target_ulong addr)
         tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
     }

+    /* check whether there are entries that need to be flushed in the vtlb */
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        unsigned int k;
+        for (k = 0;k < CPU_VTLB_SIZE; k++) {
+             tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
+        }
+    }
+
     tb_flush_jmp_cache(env, addr);
 }

@@ -165,11 +187,15 @@ void cpu_tlb_reset_dirty_all(ram_addr_t start1,
ram_addr_t length)
         env = cpu->env_ptr;
         for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
             unsigned int i;
-
             for (i = 0; i < CPU_TLB_SIZE; i++) {
                 tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
                                       start1, length);
             }
+
+            for (i = 0; i < CPU_VTLB_SIZE; i++) {
+                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
+                                      start1, length);
+            }
         }
     }
 }
@@ -193,6 +219,13 @@ void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
     }
+
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        unsigned int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
+        }
+    }
 }

 /* Our TLB does not support large pages, so remember the area covered by
@@ -264,8 +297,18 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                                             prot, &address);

     index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te = &env->tlb_table[mmu_idx][index];
+
+    /* do not discard the translation in te, evict it into a random
victim tlb */
+    unsigned vidx = rand() % CPU_VTLB_SIZE;
+    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
+    env->tlb_v_table[mmu_idx][vidx].addr_write  = te->addr_write;
+    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
+    env->tlb_v_table[mmu_idx][vidx].addend      = te->addend;
+    env->iotlb_v[mmu_idx][vidx]                        =
env->iotlb[mmu_idx][index];
+
+    /* refill the tlb */
+    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te->addend = addend - vaddr;
     if (prot & PAGE_READ) {
         te->addr_read = address;
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 01cd8c7..771f39c 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -72,8 +72,10 @@ typedef uint64_t target_ulong;
 #define TB_JMP_PAGE_MASK (TB_JMP_CACHE_SIZE - TB_JMP_PAGE_SIZE)

 #if !defined(CONFIG_USER_ONLY)
-#define CPU_TLB_BITS 8
-#define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+#define CPU_TLB_BITS  8
+#define CPU_TLB_SIZE  (1 << CPU_TLB_BITS)
+/* use a fully associative victim tlb */
+#define CPU_VTLB_SIZE 8

 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -103,12 +105,15 @@ typedef struct CPUTLBEntry {

 QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));

+/* The meaning of the MMU modes is defined in the target code. */
 #define CPU_COMMON_TLB \
     /* The meaning of the MMU modes is defined in the target code. */   \
-    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
-    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
+    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
+    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
+    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
+    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \
     target_ulong tlb_flush_addr;                                        \
-    target_ulong tlb_flush_mask;
+    target_ulong tlb_flush_mask;                                        \

 #else

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index ea90b64..74eb674 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -102,6 +102,8 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                   hwaddr paddr, int prot,
                   int mmu_idx, target_ulong size);
 void tb_invalidate_phys_addr(hwaddr addr);
+/* swap the 2 given tlb entries as well as their iotlb */
+void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
 #else
 static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
 {
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index c6a5440..d63f694 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -153,7 +153,23 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env,
target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
victim tlb.
+         * try to refill from the victim tlb before walking the page table. */
+        int vidx, vhit = false;
+        for(vidx = 0;vidx < CPU_VTLB_SIZE; ++vidx) {
+          if (env->tlb_v_table[mmu_idx][vidx].ADDR_READ == (addr &
TARGET_PAGE_MASK)) {
+             /* found entry in victim tlb */
+             swap_tlb(&env->tlb_table[mmu_idx][index],
+                      &env->tlb_v_table[mmu_idx][vidx],
+                      &env->iotlb[mmu_idx][index],
+                      &env->iotlb_v[mmu_idx][vidx]);
+             vhit = true;
+             break;
+          }
+        }
+
+        /* missed in victim tlb */
+        if (!vhit) tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }

@@ -235,7 +251,22 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env,
target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
victim tlb.
+         * try to refill from the victim tlb before walking the page table. */
+        int vidx, vhit = false;
+        for(vidx = 0;vidx < CPU_VTLB_SIZE; ++vidx) {
+          if (env->tlb_v_table[mmu_idx][vidx].ADDR_READ == (addr &
TARGET_PAGE_MASK)) {
+             /* found entry in victim tlb */
+             swap_tlb(&env->tlb_table[mmu_idx][index],
+                      &env->tlb_v_table[mmu_idx][vidx],
+                      &env->iotlb[mmu_idx][index],
+                      &env->iotlb_v[mmu_idx][vidx]);
+             vhit = true;
+             break;
+          }
+        }
+
+        if (!vhit) tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }

@@ -354,7 +385,22 @@ void helper_le_st_name(CPUArchState *env,
target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
victim tlb.
+         * try to refill from the victim tlb before walking the page table. */
+        int vidx, vhit = false;
+        for(vidx = 0;vidx < CPU_VTLB_SIZE; ++vidx) {
+          if (env->tlb_v_table[mmu_idx][vidx].addr_write == (addr &
TARGET_PAGE_MASK)) {
+             /* found entry in victim tlb */
+             swap_tlb(&env->tlb_table[mmu_idx][index],
+                      &env->tlb_v_table[mmu_idx][vidx],
+                      &env->iotlb[mmu_idx][index],
+                      &env->iotlb_v[mmu_idx][vidx]);
+             vhit = true;
+             break;
+          }
+        }
+
+        if (!vhit) tlb_fill(env, addr, 1, mmu_idx, retaddr);
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }

@@ -430,7 +476,22 @@ void helper_be_st_name(CPUArchState *env,
target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
victim tlb.
+         * try to refill from the victim tlb before walking the page table. */
+        int vidx, vhit = false;
+        for(vidx = 0;vidx < CPU_VTLB_SIZE; ++vidx) {
+          if (env->tlb_v_table[mmu_idx][vidx].addr_write == (addr &
TARGET_PAGE_MASK)) {
+             /* found entry in victim tlb */
+             swap_tlb(&env->tlb_table[mmu_idx][index],
+                      &env->tlb_v_table[mmu_idx][vidx],
+                      &env->iotlb[mmu_idx][index],
+                      &env->iotlb_v[mmu_idx][vidx]);
+             vhit = true;
+             break;
+          }
+        }
+
+        if (!vhit) tlb_fill(env, addr, 1, mmu_idx, retaddr);
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }

-- 
1.7.9.5

[-- Attachment #2: vtlb.xlsx --]
[-- Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, Size: 14939 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
  2014-01-22 14:48 [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB Xin Tong
@ 2014-01-22 21:55 ` Richard Henderson
  2014-01-22 22:40   ` Xin Tong
  2014-01-23 11:23 ` Alex Bennée
  1 sibling, 1 reply; 6+ messages in thread
From: Richard Henderson @ 2014-01-22 21:55 UTC (permalink / raw)
  To: Xin Tong, QEMU Developers, afaerber, aliguori

On 01/22/2014 06:48 AM, Xin Tong wrote:
> +#define TLB_XOR_SWAP(X, Y) do {*X = *X ^ *Y; *Y = *X ^ *Y; *X = *X ^
> *Y;}while(0);

First, your patch is line wrapped.  You really really really need to follow the
directions Peter gave you.

Second, using xor to swap values is a cute assembler trick, but it has no place
in high-level programming.  Look at the generated assembly and you'll find way
more memory accesses than necessary.

> +void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose)

This function could probably stand to be inline, so that we produce better code
for softmmu_template.h.

> +        for (k = 0;k < CPU_VTLB_SIZE; k++) {

Watch your spacing.  Did the patch pass checkpatch.pl?

>          for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>              unsigned int i;
> -
>              for (i = 0; i < CPU_TLB_SIZE; i++) {

Don't randomly change whitespace.

> +    /* do not discard the translation in te, evict it into a random
> victim tlb */
> +    unsigned vidx = rand() % CPU_VTLB_SIZE;

Don't use rand.  That's a huge heavy-weight function.  Treating the victim
table as a circular buffer would surely be quicker.  Using a LRU algorithm
might do better, but could also be overkill.

>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>          }
>  #endif
> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        /* we are about to do a page table walk. our last hope is the
> victim tlb.
> +         * try to refill from the victim tlb before walking the page table. */
> +        int vidx, vhit = false;

We're supposed to be c89 compliant.  No declarations in the middle of the
block.  Also, you can avoid the vhit variable entirely with

> +        for(vidx = 0;vidx < CPU_VTLB_SIZE; ++vidx) {

  for (vidx = CPU_VTLB_SIZE - 1; vidx >= 0; --vidx) {
      ...
  }
  if (vidx < 0) {
      tlb_fill(...);
  }

r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
  2014-01-22 21:55 ` Richard Henderson
@ 2014-01-22 22:40   ` Xin Tong
  2014-01-22 22:56     ` Richard Henderson
  0 siblings, 1 reply; 6+ messages in thread
From: Xin Tong @ 2014-01-22 22:40 UTC (permalink / raw)
  To: Richard Henderson; +Cc: QEMU Developers, aliguori, afaerber

Richard.

Thank you very much for your comments. I will provide fixes to the
problems you raised and make sure my next patch passes the
checkpatch.pl. I have a question I would like to make sure. After i go
back and fixed the problems with this patch. I need to send another
(different) email with the title of "[Qemu-devel] [PATCH v2] cpu:
implementing victim TLB for QEMU system emulated TLB"  and with the
changes from both of the patches ?

Xin

On Wed, Jan 22, 2014 at 3:55 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 01/22/2014 06:48 AM, Xin Tong wrote:
>> +#define TLB_XOR_SWAP(X, Y) do {*X = *X ^ *Y; *Y = *X ^ *Y; *X = *X ^
>> *Y;}while(0);
>
> First, your patch is line wrapped.  You really really really need to follow the
> directions Peter gave you.
>
> Second, using xor to swap values is a cute assembler trick, but it has no place
> in high-level programming.  Look at the generated assembly and you'll find way
> more memory accesses than necessary.
>
>> +void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose)
>
> This function could probably stand to be inline, so that we produce better code
> for softmmu_template.h.
>
>> +        for (k = 0;k < CPU_VTLB_SIZE; k++) {
>
> Watch your spacing.  Did the patch pass checkpatch.pl?
>
>>          for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>>              unsigned int i;
>> -
>>              for (i = 0; i < CPU_TLB_SIZE; i++) {
>
> Don't randomly change whitespace.
>
>> +    /* do not discard the translation in te, evict it into a random
>> victim tlb */
>> +    unsigned vidx = rand() % CPU_VTLB_SIZE;
>
> Don't use rand.  That's a huge heavy-weight function.  Treating the victim
> table as a circular buffer would surely be quicker.  Using a LRU algorithm
> might do better, but could also be overkill.
>
>>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>>          }
>>  #endif
>> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        /* we are about to do a page table walk. our last hope is the
>> victim tlb.
>> +         * try to refill from the victim tlb before walking the page table. */
>> +        int vidx, vhit = false;
>
> We're supposed to be c89 compliant.  No declarations in the middle of the
> block.  Also, you can avoid the vhit variable entirely with
>
>> +        for(vidx = 0;vidx < CPU_VTLB_SIZE; ++vidx) {
>
>   for (vidx = CPU_VTLB_SIZE - 1; vidx >= 0; --vidx) {
>       ...
>   }
>   if (vidx < 0) {
>       tlb_fill(...);
>   }
>
>
> r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
  2014-01-22 22:40   ` Xin Tong
@ 2014-01-22 22:56     ` Richard Henderson
  0 siblings, 0 replies; 6+ messages in thread
From: Richard Henderson @ 2014-01-22 22:56 UTC (permalink / raw)
  To: Xin Tong; +Cc: QEMU Developers, aliguori, afaerber

On 01/22/2014 02:40 PM, Xin Tong wrote:
> Thank you very much for your comments. I will provide fixes to the
> problems you raised and make sure my next patch passes the
> checkpatch.pl. I have a question I would like to make sure. After i go
> back and fixed the problems with this patch. I need to send another
> (different) email with the title of "[Qemu-devel] [PATCH v2] cpu:
> implementing victim TLB for QEMU system emulated TLB"  and with the
> changes from both of the patches?

Well, "both" patches is a misnomer.  One patch.  The new patch.

The subject does not need to contain [Qemu-devel].  Try

  git format-patch --subject-prefix='PATCH v2' master

from the branch containing your work.  Then use "git send-email" to avoid the
line wrapping you had before.

But otherwise yes.  And if you can get some different benchmarks, something
with more context switches than running a single program, that'd be great.
Peter mentioned an OS boot.  I might suggest a complex shell script.

An example that comes to mind is qemu's own configure script.  Set up a linux
virtual machine with everything to build qemu within the virtual machine.  This
part can be done with -enable-kvm for speed.  Then restart the vm without
-enable-kvm so that we use TCG.  Then you can do something as simple as "time
./configure" to get your number.


r~

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
  2014-01-22 14:48 [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB Xin Tong
  2014-01-22 21:55 ` Richard Henderson
@ 2014-01-23 11:23 ` Alex Bennée
  2014-01-23 13:50   ` Xin Tong
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Bennée @ 2014-01-23 11:23 UTC (permalink / raw)
  To: Xin Tong; +Cc: Stefan Hajnoczi, QEMU Developers, aliguori, afaerber


trent.tong@gmail.com writes:

> This patch adds a victim TLB to the QEMU system mode TLB.
>
> QEMU system mode page table walks are expensive. Taken by running QEMU
> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
> average.
<snip>
>
> Attached are some performance results taken on SPECINT2006 train
> dataset and a Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine. In
> summary, victim TLB improves the performance of qemu-system-x86_64 by
> 11% on average on SPECINT2006 and with highest improvement of in 254%
> in
> 464.h264ref. And victim TLB does not result in any performance
> degradation in any of the measured benchmarks. Furthermore, the
> implemented victim TLB is architecture independent and is expected to
> benefit other architectures in QEMU as well.
>
> Although there are measurement fluctuations, the performance
> improvement are very significant and by no means in the range of
> noises.
<snip>

I'm curious as the implication seems to be that entries are evicted from
initial TLB lookup before they are "done". What would the impact be of
simply growing the size of the main TLB cache?

What's the current state of instrumentation around the system TLB
handling? Can we trace the hit rates of the various caches with
perf/oprofile/whatever (Stefan?)?

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
  2014-01-23 11:23 ` Alex Bennée
@ 2014-01-23 13:50   ` Xin Tong
  0 siblings, 0 replies; 6+ messages in thread
From: Xin Tong @ 2014-01-23 13:50 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Stefan Hajnoczi, QEMU Developers, aliguori, afaerber

On Thu, Jan 23, 2014 at 5:23 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> trent.tong@gmail.com writes:
>
>> This patch adds a victim TLB to the QEMU system mode TLB.
>>
>> QEMU system mode page table walks are expensive. Taken by running QEMU
>> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
>> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
>> average.
> <snip>
>>
>> Attached are some performance results taken on SPECINT2006 train
>> dataset and a Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine. In
>> summary, victim TLB improves the performance of qemu-system-x86_64 by
>> 11% on average on SPECINT2006 and with highest improvement of in 254%
>> in
>> 464.h264ref. And victim TLB does not result in any performance
>> degradation in any of the measured benchmarks. Furthermore, the
>> implemented victim TLB is architecture independent and is expected to
>> benefit other architectures in QEMU as well.
>>
>> Although there are measurement fluctuations, the performance
>> improvement are very significant and by no means in the range of
>> noises.
> <snip>
>
> I'm curious as the implication seems to be that entries are evicted from
> initial TLB lookup before they are "done". What would the impact be of
> simply growing the size of the main TLB cache?

Growing the size of the TLB gives significant performance improvement
as well, i have an incomplete set of numbers. but with the numbers i
have, i see significant performance improvement. With this being said,
victim tlb is still a nice addition as no matter how big you make the
TLB, there will always be conflict misses due to the low associativity
of the directly mapped tlb table.

>
> What's the current state of instrumentation around the system TLB
> handling? Can we trace the hit rates of the various caches with
> perf/oprofile/whatever (Stefan?)?
>

we do not have any TLB hit/miss tracking in the QEMU mainline code
right now. I think perf/oprofile can tell us how much time we spend in
TLB lookup and TLB refill. we need TCG generated instrumentation to
get TLB hit/miss rate though.
> --
> Alex Bennée
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-01-23 13:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-22 14:48 [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB Xin Tong
2014-01-22 21:55 ` Richard Henderson
2014-01-22 22:40   ` Xin Tong
2014-01-22 22:56     ` Richard Henderson
2014-01-23 11:23 ` Alex Bennée
2014-01-23 13:50   ` Xin Tong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).