* [Qemu-devel] outlined TLB lookup on x86 @ 2013-11-27 7:41 Xin Tong 2013-11-27 13:12 ` Lluís Vilanova 2013-11-28 2:12 ` Richard Henderson 0 siblings, 2 replies; 21+ messages in thread From: Xin Tong @ 2013-11-27 7:41 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 1490 bytes --] I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on x86-64 machine, potentially for better instruction cache performance, I have a few questions. 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to the generated slow path and slow path refills the TLB, then load/store and jumps to the next emulated instruction. I am wondering is it easy to outline the code for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical path, so its not as important as the code when TLB hits. 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i measured this using Intel PIN. so even the miss rate is low (say 3%) the overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am thinking the tlb may need to be organized in a set associative fashion to reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb that is 4 way associative and use x86 simd instructions to do the lookup once the direct-mapped tlb misses. Has anybody done any work on this front ? 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K entries ? Xin [-- Attachment #2: Type: text/html, Size: 2127 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-27 7:41 [Qemu-devel] outlined TLB lookup on x86 Xin Tong @ 2013-11-27 13:12 ` Lluís Vilanova 2013-11-28 1:58 ` Xin Tong 2013-11-28 2:12 ` Richard Henderson 1 sibling, 1 reply; 21+ messages in thread From: Lluís Vilanova @ 2013-11-27 13:12 UTC (permalink / raw) To: qemu-devel Xin Tong writes: > I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on > x86-64 machine, potentially for better instruction cache performance, I have a > few questions. > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to > the generated slow path and slow path refills the TLB, then load/store and jumps > to the next emulated instruction. I am wondering is it easy to outline the code > for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code > should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk > the TLB after its refilled ? This code is off the critical path, so its not as > important as the code when TLB hits. > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the TLB > lookup is 10 x86 instructions , but every miss needs ~450 instructions, i > measured this using Intel PIN. so even the miss rate is low (say 3%) the overall > time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am thinking > the tlb may need to be organized in a set associative fashion to reduce conflict > miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb > that is 4 way associative and use x86 simd instructions to do the lookup once > the direct-mapped tlb misses. Has anybody done any work on this front ? > 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K > entries ? Using vector intrinsics for the TLB lookup will probably make the code less portable. I don't know how compatible are the GCC and LLVM vectorizing intrinsics between each other (since there has been some efforts on making QEMU also compile with LLVM). A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE in cputlb.c), but the higher hit ratio could pay off, although I don't know how the current size was chosen. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-27 13:12 ` Lluís Vilanova @ 2013-11-28 1:58 ` Xin Tong 2013-11-28 16:12 ` Lluís Vilanova 0 siblings, 1 reply; 21+ messages in thread From: Xin Tong @ 2013-11-28 1:58 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 2862 bytes --] Hi LIuis we can probably generate vector intrinsics using the tcg, e.g. add support to tcg to emit vector instructions directly in code cache why would a larger TLB make some operations slower, the TLB is a direct-mapped hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is always used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). Thank you Xin On Wed, Nov 27, 2013 at 5:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: > Xin Tong writes: > > > I am trying to implement a out-of-line TLB lookup for QEMU > softmmu-x86-64 on > > x86-64 machine, potentially for better instruction cache performance, I > have a > > few questions. > > > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are > generated > > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it > jumps to > > the generated slow path and slow path refills the TLB, then load/store > and jumps > > to the next emulated instruction. I am wondering is it easy to outline > the code > > for the slow path. I am thinking when a TLB misses, the outlined TLB > lookup code > > should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and > rewalk > > the TLB after its refilled ? This code is off the critical path, so its > not as > > important as the code when TLB hits. > > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. > the TLB > > lookup is 10 x86 instructions , but every miss needs ~450 instructions, i > > measured this using Intel PIN. so even the miss rate is low (say 3%) the > overall > > time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am > thinking > > the tlb may need to be organized in a set associative fashion to reduce > conflict > > miss, e.g. 2 way set associative to reduce the miss rate. or have a > victim tlb > > that is 4 way associative and use x86 simd instructions to do the lookup > once > > the direct-mapped tlb misses. Has anybody done any work on this front ? > > 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB > with 4K > > entries ? > > Using vector intrinsics for the TLB lookup will probably make the code less > portable. I don't know how compatible are the GCC and LLVM vectorizing > intrinsics between each other (since there has been some efforts on making > QEMU > also compile with LLVM). > > A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE > in > cputlb.c), but the higher hit ratio could pay off, although I don't know > how the > current size was chosen. > > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > > [-- Attachment #2: Type: text/html, Size: 4643 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-28 1:58 ` Xin Tong @ 2013-11-28 16:12 ` Lluís Vilanova 2013-12-08 10:54 ` Xin Tong 2013-12-09 12:18 ` Xin Tong 0 siblings, 2 replies; 21+ messages in thread From: Lluís Vilanova @ 2013-11-28 16:12 UTC (permalink / raw) To: Xin Tong; +Cc: qemu-devel Xin Tong writes: > Hi LIuis > we can probably generate vector intrinsics using the tcg, e.g. add support to > tcg to emit vector instructions directly in code cache There was some discussion long ago about adding vector instructions to TCG, but I don't remember what was the conclusion. Also remember that using vector instructions will "emulate" a low-associativity TLB; don't know how much better than a 1-way TLB will that be, though. > why would a larger TLB make some operations slower, the TLB is a direct-mapped > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is always > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). It would make TLB invalidations slower (e.g., see 'tlb_flush' in "cputlb.c"). And right now QEMU performs full TLB invalidations more frequently than the equivalent HW needs to, although I suppose that should be quantified too. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-28 16:12 ` Lluís Vilanova @ 2013-12-08 10:54 ` Xin Tong 2013-12-17 13:52 ` Xin Tong 2013-12-09 12:18 ` Xin Tong 1 sibling, 1 reply; 21+ messages in thread From: Xin Tong @ 2013-12-08 10:54 UTC (permalink / raw) To: Xin Tong, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1800 bytes --] On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: > Xin Tong writes: > > > Hi LIuis > > we can probably generate vector intrinsics using the tcg, e.g. add > support to > > tcg to emit vector instructions directly in code cache > > There was some discussion long ago about adding vector instructions to > TCG, but > I don't remember what was the conclusion. > > Also remember that using vector instructions will "emulate" a > low-associativity > TLB; don't know how much better than a 1-way TLB will that be, though. > > > > why would a larger TLB make some operations slower, the TLB is a > direct-mapped > > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is > always > > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). > > It would make TLB invalidations slower (e.g., see 'tlb_flush' in > "cputlb.c"). And right now QEMU performs full TLB invalidations more > frequently > than the equivalent HW needs to, although I suppose that should be > quantified > too. > > you are right LIuis. QEMU does context switch quite more often that real hw, this is probably primarily due to the fact that QEMU is magnitude slower than real hw. I am wondering where timer is emulated in QEMU system-x86_64. I imagine the guest OS must program the timers to do interrupt for context switches. Another question, what happens when a vcpu is stuck in an infinite loop ? QEMU must need an timer interrupt somewhere as well ? Is my understanding correct ? Xin > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > [-- Attachment #2: Type: text/html, Size: 3144 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-12-08 10:54 ` Xin Tong @ 2013-12-17 13:52 ` Xin Tong 2013-12-18 2:22 ` Xin Tong 0 siblings, 1 reply; 21+ messages in thread From: Xin Tong @ 2013-12-17 13:52 UTC (permalink / raw) To: Xin Tong, QEMU Developers On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.tong@gmail.com> wrote: > > > > On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: >> >> Xin Tong writes: >> >> > Hi LIuis >> > we can probably generate vector intrinsics using the tcg, e.g. add >> > support to >> > tcg to emit vector instructions directly in code cache >> >> There was some discussion long ago about adding vector instructions to >> TCG, but >> I don't remember what was the conclusion. >> >> Also remember that using vector instructions will "emulate" a >> low-associativity >> TLB; don't know how much better than a 1-way TLB will that be, though. >> >> >> > why would a larger TLB make some operations slower, the TLB is a >> > direct-mapped >> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is >> > always >> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). >> >> It would make TLB invalidations slower (e.g., see 'tlb_flush' in >> "cputlb.c"). And right now QEMU performs full TLB invalidations more >> frequently >> than the equivalent HW needs to, although I suppose that should be >> quantified >> too. I see QEMU executed ~1M instructions per context switch for qemu-system-x86_64. Is this because of the fact that the periodical time interval interrupt is delivered in real time while QEMU is significantly slower than real hw ? Xin >> > you are right LIuis. QEMU does context switch quite more often that real hw, > this is probably primarily due to the fact that QEMU is magnitude slower > than real hw. I am wondering where timer is emulated in QEMU system-x86_64. > I imagine the guest OS must program the timers to do interrupt for context > switches. > > Another question, what happens when a vcpu is stuck in an infinite loop ? > QEMU must need an timer interrupt somewhere as well ? > > Is my understanding correct ? > > Xin >> >> >> Lluis >> >> -- >> "And it's much the same thing with knowledge, for whenever you learn >> something new, the whole world becomes that much richer." >> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom >> Tollbooth > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-12-17 13:52 ` Xin Tong @ 2013-12-18 2:22 ` Xin Tong 2014-01-21 14:22 ` Xin Tong 0 siblings, 1 reply; 21+ messages in thread From: Xin Tong @ 2013-12-18 2:22 UTC (permalink / raw) To: Xin Tong, QEMU Developers why is QEMU TLB organized based on the modes, e.g. on x86 there are 3 modes. what i think is that there may be conflicts between virtual addresses and physical addresses. organizing it by modes guarantees that QEMU does not hit a physical address translation entry when in user mode and vice versa ? Thank you, Xin On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <trent.tong@gmail.com> wrote: > On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.tong@gmail.com> wrote: >> >> >> >> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: >>> >>> Xin Tong writes: >>> >>> > Hi LIuis >>> > we can probably generate vector intrinsics using the tcg, e.g. add >>> > support to >>> > tcg to emit vector instructions directly in code cache >>> >>> There was some discussion long ago about adding vector instructions to >>> TCG, but >>> I don't remember what was the conclusion. >>> >>> Also remember that using vector instructions will "emulate" a >>> low-associativity >>> TLB; don't know how much better than a 1-way TLB will that be, though. >>> >>> >>> > why would a larger TLB make some operations slower, the TLB is a >>> > direct-mapped >>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is >>> > always >>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). >>> >>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in >>> "cputlb.c"). And right now QEMU performs full TLB invalidations more >>> frequently >>> than the equivalent HW needs to, although I suppose that should be >>> quantified >>> too. > > I see QEMU executed ~1M instructions per context switch for > qemu-system-x86_64. Is this because of the fact that the periodical > time interval interrupt is delivered in real time while QEMU is > significantly slower than real hw ? > > Xin > >>> >> you are right LIuis. QEMU does context switch quite more often that real hw, >> this is probably primarily due to the fact that QEMU is magnitude slower >> than real hw. I am wondering where timer is emulated in QEMU system-x86_64. >> I imagine the guest OS must program the timers to do interrupt for context >> switches. >> >> Another question, what happens when a vcpu is stuck in an infinite loop ? >> QEMU must need an timer interrupt somewhere as well ? >> >> Is my understanding correct ? >> >> Xin >>> >>> >>> Lluis >>> >>> -- >>> "And it's much the same thing with knowledge, for whenever you learn >>> something new, the whole world becomes that much richer." >>> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom >>> Tollbooth >> >> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-12-18 2:22 ` Xin Tong @ 2014-01-21 14:22 ` Xin Tong 2014-01-21 14:28 ` Peter Maydell 0 siblings, 1 reply; 21+ messages in thread From: Xin Tong @ 2014-01-21 14:22 UTC (permalink / raw) To: QEMU Developers Hi I have found that adding a small (8-entry) fully associative victim TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path (page table walking) improves the performance of QEMU x86_64 system emulation mode significantly on the specint2006 benchmarks. This is primarily due to the fact that the primary TLB is directly mapped and suffer from conflict misses. I have this implemented on QEMU trunk and would like to contribute this back to QEMU. Where should i start ? Xin On Tue, Dec 17, 2013 at 8:22 PM, Xin Tong <trent.tong@gmail.com> wrote: > why is QEMU TLB organized based on the modes, e.g. on x86 there are 3 > modes. what i think is that there may be conflicts between virtual > addresses and physical addresses. organizing it by modes guarantees > that QEMU does not hit a physical address translation entry when in > user mode and vice versa ? > > Thank you, > Xin > > On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <trent.tong@gmail.com> wrote: >> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.tong@gmail.com> wrote: >>> >>> >>> >>> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: >>>> >>>> Xin Tong writes: >>>> >>>> > Hi LIuis >>>> > we can probably generate vector intrinsics using the tcg, e.g. add >>>> > support to >>>> > tcg to emit vector instructions directly in code cache >>>> >>>> There was some discussion long ago about adding vector instructions to >>>> TCG, but >>>> I don't remember what was the conclusion. >>>> >>>> Also remember that using vector instructions will "emulate" a >>>> low-associativity >>>> TLB; don't know how much better than a 1-way TLB will that be, though. >>>> >>>> >>>> > why would a larger TLB make some operations slower, the TLB is a >>>> > direct-mapped >>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is >>>> > always >>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). >>>> >>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in >>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more >>>> frequently >>>> than the equivalent HW needs to, although I suppose that should be >>>> quantified >>>> too. >> >> I see QEMU executed ~1M instructions per context switch for >> qemu-system-x86_64. Is this because of the fact that the periodical >> time interval interrupt is delivered in real time while QEMU is >> significantly slower than real hw ? >> >> Xin >> >>>> >>> you are right LIuis. QEMU does context switch quite more often that real hw, >>> this is probably primarily due to the fact that QEMU is magnitude slower >>> than real hw. I am wondering where timer is emulated in QEMU system-x86_64. >>> I imagine the guest OS must program the timers to do interrupt for context >>> switches. >>> >>> Another question, what happens when a vcpu is stuck in an infinite loop ? >>> QEMU must need an timer interrupt somewhere as well ? >>> >>> Is my understanding correct ? >>> >>> Xin >>>> >>>> >>>> Lluis >>>> >>>> -- >>>> "And it's much the same thing with knowledge, for whenever you learn >>>> something new, the whole world becomes that much richer." >>>> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom >>>> Tollbooth >>> >>> ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-21 14:22 ` Xin Tong @ 2014-01-21 14:28 ` Peter Maydell 0 siblings, 0 replies; 21+ messages in thread From: Peter Maydell @ 2014-01-21 14:28 UTC (permalink / raw) To: Xin Tong; +Cc: QEMU Developers On 21 January 2014 14:22, Xin Tong <trent.tong@gmail.com> wrote: > I have found that adding a small (8-entry) fully associative victim > TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path > (page table walking) improves the performance of QEMU x86_64 system > emulation mode significantly on the specint2006 benchmarks. This is > primarily due to the fact that the primary TLB is directly mapped and > suffer from conflict misses. I have this implemented on QEMU trunk and > would like to contribute this back to QEMU. Where should i start ? The wiki page http://wiki.qemu.org/Contribute/SubmitAPatch tries to describe our usual process for reviewing code submissions. If you make sure your changes follow the guidelines described there and then send them to the mailing list as a series of patch emails in the right format, we can start reviewing the code. thanks -- PMM ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-28 16:12 ` Lluís Vilanova 2013-12-08 10:54 ` Xin Tong @ 2013-12-09 12:18 ` Xin Tong 2013-12-09 15:31 ` Lluís Vilanova 1 sibling, 1 reply; 21+ messages in thread From: Xin Tong @ 2013-12-09 12:18 UTC (permalink / raw) To: Xin Tong, qemu-devel [-- Attachment #1: Type: text/plain, Size: 1435 bytes --] On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: > Xin Tong writes: > > > Hi LIuis > > we can probably generate vector intrinsics using the tcg, e.g. add > support to > > tcg to emit vector instructions directly in code cache > > There was some discussion long ago about adding vector instructions to > TCG, but > I don't remember what was the conclusion. > > Hi LIuis Can you please forward me that email if it is not difficult to find. otherwise, it is ok!. Thank you Xin > Also remember that using vector instructions will "emulate" a > low-associativity > TLB; don't know how much better than a 1-way TLB will that be, though. > > > > why would a larger TLB make some operations slower, the TLB is a > direct-mapped > > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is > always > > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)). > > It would make TLB invalidations slower (e.g., see 'tlb_flush' in > "cputlb.c"). And right now QEMU performs full TLB invalidations more > frequently > than the equivalent HW needs to, although I suppose that should be > quantified > too. > > > Lluis > > -- > "And it's much the same thing with knowledge, for whenever you learn > something new, the whole world becomes that much richer." > -- The Princess of Pure Reason, as told by Norton Juster in The Phantom > Tollbooth > [-- Attachment #2: Type: text/html, Size: 2747 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-12-09 12:18 ` Xin Tong @ 2013-12-09 15:31 ` Lluís Vilanova 0 siblings, 0 replies; 21+ messages in thread From: Lluís Vilanova @ 2013-12-09 15:31 UTC (permalink / raw) To: Xin Tong; +Cc: qemu-devel Xin Tong writes: > On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote: > Xin Tong writes: >> Hi LIuis >> we can probably generate vector intrinsics using the tcg, e.g. add support > to >> tcg to emit vector instructions directly in code cache > There was some discussion long ago about adding vector instructions to TCG, > but > I don't remember what was the conclusion. > Hi LIuis > Can you please forward me that email if it is not difficult to find. otherwise, > it is ok!. Sorry, I can't remember when it was. You'll have to look it up in the mailing list archive. Lluis -- "And it's much the same thing with knowledge, for whenever you learn something new, the whole world becomes that much richer." -- The Princess of Pure Reason, as told by Norton Juster in The Phantom Tollbooth ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-27 7:41 [Qemu-devel] outlined TLB lookup on x86 Xin Tong 2013-11-27 13:12 ` Lluís Vilanova @ 2013-11-28 2:12 ` Richard Henderson 2013-11-28 3:56 ` Xin Tong ` (2 more replies) 1 sibling, 3 replies; 21+ messages in thread From: Richard Henderson @ 2013-11-28 2:12 UTC (permalink / raw) To: Xin Tong, qemu-devel On 11/27/2013 08:41 PM, Xin Tong wrote: > I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on > x86-64 machine, potentially for better instruction cache performance, I have a > few questions. > > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to > the generated slow path and slow path refills the TLB, then load/store and > jumps to the next emulated instruction. I am wondering is it easy to outline > the code for the slow path. Hard. There's quite a bit of code on that slow path that's unique to the surrounding code context -- which registers contain inputs and outputs, where to continue after slow path. The amount of code that's in the TB slow path now is approximately minimal, as far as I can see. If you've got an idea for improvement, please share. ;-) > I am thinking when a TLB misses, the outlined TLB > lookup code should generate a call out to the qemu_ld/st_helpers[opc & > ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical > path, so its not as important as the code when TLB hits. That would work for true TLB misses to RAM, but does not work for memory mapped I/O. > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the > TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i > measured this using Intel PIN. so even the miss rate is low (say 3%) the > overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. I'd be interested to experiment with different TLB sizes, to see what effect that has on performance. But I suspect that lack of TLB contexts mean that we wind up flushing the TLB more often than real hardware does, and therefore a larger TLB merely takes longer to flush. But be aware that we can't simply make the change universally. E.g. ARM can use an immediate 8-bit operand during the TLB lookup, but would have to use several insns to perform a 9-bit mask. > I am > thinking the tlb may need to be organized in a set associative fashion to > reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or > have a victim tlb that is 4 way associative and use x86 simd instructions to do > the lookup once the direct-mapped tlb misses. Has anybody done any work on this > front ? Even with SIMD, I don't believe you could make the fast-path of a set associative lookup fast. This is the sort of thing for which you really need the dedicated hardware of the real TLB. Feel free to prove me wrong with code, of course. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-28 2:12 ` Richard Henderson @ 2013-11-28 3:56 ` Xin Tong 2013-12-08 11:19 ` Avi Kivity 2014-01-22 15:28 ` Xin Tong 2 siblings, 0 replies; 21+ messages in thread From: Xin Tong @ 2013-11-28 3:56 UTC (permalink / raw) To: Richard Henderson; +Cc: qemu-devel [-- Attachment #1: Type: text/plain, Size: 4313 bytes --] On Wed, Nov 27, 2013 at 6:12 PM, Richard Henderson <rth@twiddle.net> wrote: > On 11/27/2013 08:41 PM, Xin Tong wrote: > > I am trying to implement a out-of-line TLB lookup for QEMU > softmmu-x86-64 on > > x86-64 machine, potentially for better instruction cache performance, I > have a > > few questions. > > > > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are > generated > > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it > jumps to > > the generated slow path and slow path refills the TLB, then load/store > and > > jumps to the next emulated instruction. I am wondering is it easy to > outline > > the code for the slow path. > > Hard. There's quite a bit of code on that slow path that's unique to the > surrounding code context -- which registers contain inputs and outputs, > where > to continue after slow path. > > The amount of code that's in the TB slow path now is approximately > minimal, as > far as I can see. If you've got an idea for improvement, please share. > ;-) > > > > I am thinking when a TLB misses, the outlined TLB > > lookup code should generate a call out to the qemu_ld/st_helpers[opc & > > ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the > critical > > path, so its not as important as the code when TLB hits. > > That would work for true TLB misses to RAM, but does not work for memory > mapped > I/O. > > > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 > entries. the > > TLB lookup is 10 x86 instructions , but every miss needs ~450 > instructions, i > > measured this using Intel PIN. so even the miss rate is low (say 3%) the > > overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. > > I'd be interested to experiment with different TLB sizes, to see what > effect > that has on performance. But I suspect that lack of TLB contexts mean > that we > wind up flushing the TLB more often than real hardware does, and therefore > a > larger TLB merely takes longer to flush. Hardware TLBs are limited in size primarily due to the fact that increasing their sizes increases their access latency as well. but software tlb does not suffer from that problem. so i think the size of the softtlb should be not influenced by the size of the hardware tlb. Flushing the TLB is minimal unless we have a really really large TLB, e.g. a TLB with 1M entries. I vaguely remember that i see ~8% of the time is spent in the cpu_x86_mmu_fault function in one of the speccpu2006 workload some time ago. so if we increase the size of the TLB significantly and potential getting rid of most of the TLB misses, we can get rid of most of the 8%. ( there are still compulsory misses and a few conflict misses, but i think compulsory misses is not the major player here). > > But be aware that we can't simply make the change universally. E.g. ARM > can > use an immediate 8-bit operand during the TLB lookup, but would have to use > several insns to perform a 9-bit mask. This can be handled with ifndefs. most of the tlb code common to all cpus need not be changed. > > > > I am > > thinking the tlb may need to be organized in a set associative fashion to > > reduce conflict miss, e.g. 2 way set associative to reduce the miss > rate. or > > have a victim tlb that is 4 way associative and use x86 simd > instructions to do > > the lookup once the direct-mapped tlb misses. Has anybody done any work > on this > > front ? > > Even with SIMD, I don't believe you could make the fast-path of a set > associative lookup fast. This is the sort of thing for which you really > need > the dedicated hardware of the real TLB. Feel free to prove me wrong with > code, > of course. > I am thinking the primary TLB should remain what it is, i.e. a direct mapped. but we can have a victim TLB with bigger assoiciatity. the victim TLB can be walked either sequentially or in parallel with simd instructions. its going to be slower than hitting in the direct mapped TLB. but (much) better than having to rewalk the page table. For OS with ASID, we can have a shared TLB with ASID. and this can potentially get rid of some compulsory misses for us. e.g. multiple threads in a process share the same ASID and we can check for the shared TLB on miss before walk the page table. > > > r~ > [-- Attachment #2: Type: text/html, Size: 6205 bytes --] ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-28 2:12 ` Richard Henderson 2013-11-28 3:56 ` Xin Tong @ 2013-12-08 11:19 ` Avi Kivity 2014-01-22 15:28 ` Xin Tong 2 siblings, 0 replies; 21+ messages in thread From: Avi Kivity @ 2013-12-08 11:19 UTC (permalink / raw) To: Richard Henderson, Xin Tong, qemu-devel On 11/28/2013 04:12 AM, Richard Henderson wrote: >> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the >> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i >> measured this using Intel PIN. so even the miss rate is low (say 3%) the >> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. > I'd be interested to experiment with different TLB sizes, to see what effect > that has on performance. But I suspect that lack of TLB contexts mean that we > wind up flushing the TLB more often than real hardware does, and therefore a > larger TLB merely takes longer to flush. > You could use a generation counter to flush the TLB in O(1) by incrementing the counter. That slows down the fast path though. Maybe you can do that for the larger second level TLB only. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2013-11-28 2:12 ` Richard Henderson 2013-11-28 3:56 ` Xin Tong 2013-12-08 11:19 ` Avi Kivity @ 2014-01-22 15:28 ` Xin Tong 2014-01-22 16:34 ` Richard Henderson 2014-01-22 16:55 ` Peter Maydell 2 siblings, 2 replies; 21+ messages in thread From: Xin Tong @ 2014-01-22 15:28 UTC (permalink / raw) To: Richard Henderson; +Cc: QEMU Developers On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <rth@twiddle.net> wrote: > On 11/27/2013 08:41 PM, Xin Tong wrote: >> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on >> x86-64 machine, potentially for better instruction cache performance, I have a >> few questions. >> >> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated >> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to >> the generated slow path and slow path refills the TLB, then load/store and >> jumps to the next emulated instruction. I am wondering is it easy to outline >> the code for the slow path. > > Hard. There's quite a bit of code on that slow path that's unique to the > surrounding code context -- which registers contain inputs and outputs, where > to continue after slow path. > > The amount of code that's in the TB slow path now is approximately minimal, as > far as I can see. If you've got an idea for improvement, please share. ;-) > > >> I am thinking when a TLB misses, the outlined TLB >> lookup code should generate a call out to the qemu_ld/st_helpers[opc & >> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical >> path, so its not as important as the code when TLB hits. > > That would work for true TLB misses to RAM, but does not work for memory mapped > I/O. > >> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the >> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i >> measured this using Intel PIN. so even the miss rate is low (say 3%) the >> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant. > > I'd be interested to experiment with different TLB sizes, to see what effect > that has on performance. But I suspect that lack of TLB contexts mean that we > wind up flushing the TLB more often than real hardware does, and therefore a > larger TLB merely takes longer to flush. > > But be aware that we can't simply make the change universally. E.g. ARM can > use an immediate 8-bit operand during the TLB lookup, but would have to use > several insns to perform a 9-bit mask. Hi Richard I've done some experiments on increasing the size of the tlb. increasing the size of the tlb from 256 entries to 4096 entries gives significant performance improvement on the specint2006 benchmarks on qemu-system-x86_64 running on a x86_64 linux machine . i am in the process of exploring more tlb sizes and will post the data after i am done. Can you tell me whether ARM is the only architecture that requires special treatment for increasing tlb size beyond 256 entries so that i can whip up a patch to the QEMU mainline. Thank you, Xin > >> I am >> thinking the tlb may need to be organized in a set associative fashion to >> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or >> have a victim tlb that is 4 way associative and use x86 simd instructions to do >> the lookup once the direct-mapped tlb misses. Has anybody done any work on this >> front ? > > Even with SIMD, I don't believe you could make the fast-path of a set > associative lookup fast. This is the sort of thing for which you really need > the dedicated hardware of the real TLB. Feel free to prove me wrong with code, > of course. > > > r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-22 15:28 ` Xin Tong @ 2014-01-22 16:34 ` Richard Henderson 2014-01-22 16:55 ` Peter Maydell 1 sibling, 0 replies; 21+ messages in thread From: Richard Henderson @ 2014-01-22 16:34 UTC (permalink / raw) To: Xin Tong; +Cc: QEMU Developers On 01/22/2014 07:28 AM, Xin Tong wrote: > Can you tell me whether ARM is the only architecture that requires > special treatment for increasing tlb size beyond 256 entries so that i > can whip up a patch to the QEMU mainline. The major constraint for the non-arm ports is CPU_TLB_ENTRY_SIZE + CPU_TLB_BITS < immediate bit size I.e. (CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS is representable as an immediate within an AND instruction. MIPS has a 16-bit unsigned immediate, and as written would generate bad code for CPU_TLB_BITS > 11. I386 has a 32-bit signed immediate, and would generate bad code for CPU_TLB_BITS > 26. Though I can't imagine you want to make it that big. SPARC has a 13-bit signed immediate, But it's written with a routine which checks the size of the constant and loads it if necessary. Which is good, because that's clearly already happening for CPU_TLB_BITS > 7. AArch64, ia64, ppc, ppc64 all use fully capable extract-bit-field type insns and could handle any change you make. S390 is written using generic routines like sparc, so it won't fail with any change. It ought to be adjusted to use the extract-bit-field type insns that exist in the current generation of machines. The oldest generation of machine would have reduced performance with CPU_TLB_BITS > 11. ARM is also a case in which armv6t2 and later could be written with an extract-bit-field insn, but previous versions would need to use 2 insns to form the constant. But at least we'd be able to combine the shift and and insns. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-22 15:28 ` Xin Tong 2014-01-22 16:34 ` Richard Henderson @ 2014-01-22 16:55 ` Peter Maydell 2014-01-22 17:32 ` Richard Henderson 1 sibling, 1 reply; 21+ messages in thread From: Peter Maydell @ 2014-01-22 16:55 UTC (permalink / raw) To: Xin Tong; +Cc: QEMU Developers, Richard Henderson On 22 January 2014 15:28, Xin Tong <trent.tong@gmail.com> wrote: > On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <rth@twiddle.net> wrote: >> I'd be interested to experiment with different TLB sizes, to see what effect >> that has on performance. But I suspect that lack of TLB contexts mean that we >> wind up flushing the TLB more often than real hardware does, and therefore a >> larger TLB merely takes longer to flush. > I've done some experiments on increasing the size of the tlb. > increasing the size of the tlb from 256 entries to 4096 entries gives > significant performance improvement on the specint2006 benchmarks on > qemu-system-x86_64 running on a x86_64 linux machine . i am in the > process of exploring more tlb sizes and will post the data after i am > done. Of course a single big benchmark program is probably the best case for "not having lots of TLB flushing". It would probably also be instructive to benchmark other cases, like OS bootup, running multiple different programs simultaneously and system call heavy workloads. Has anybody ever looked at implementing proper TLB contexts? thanks -- PMM ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-22 16:55 ` Peter Maydell @ 2014-01-22 17:32 ` Richard Henderson 2014-01-22 17:35 ` Peter Maydell 0 siblings, 1 reply; 21+ messages in thread From: Richard Henderson @ 2014-01-22 17:32 UTC (permalink / raw) To: Peter Maydell, Xin Tong; +Cc: QEMU Developers On 01/22/2014 08:55 AM, Peter Maydell wrote: > Has anybody ever looked at implementing proper TLB contexts? I've thought about it. The best I could come up with is a pointer within ENV that points to the current TLB context. It definitely adds another load insn on the fast path, but we should be able to schedule that first, since it depends on nothing but the mem_index constant. Depending on the schedule, it may require reserving another register on the fast path, which could be a problem for i686. It would also greatly expand the size of ENV. E.g. Alpha would need to implement 256 contexts to match the hardware. We currently get away with pretending to implement contexts by implementing none at all, and flushing the TLB at every context change. Our current TLB size is 8k. Times 256 contexts is 2MB. Which might just be within the range of possibility. Certainly not if we expand the size of the individual TLBs. Although interestingly, for Alpha we don't need 256 * NB_MMU_MODES, because MMU_KERNEL_IDX always uses context 0, the "global context". I don't recall enough about the intimate details of other TLB hardware to know if there are similar savings that can be had elsewhere. But the amount of memory involved is large enough to suggest that some sort of target-specific sizing of the number of contexts might be required. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-22 17:32 ` Richard Henderson @ 2014-01-22 17:35 ` Peter Maydell 2014-01-22 17:45 ` Richard Henderson 0 siblings, 1 reply; 21+ messages in thread From: Peter Maydell @ 2014-01-22 17:35 UTC (permalink / raw) To: Richard Henderson; +Cc: Xin Tong, QEMU Developers On 22 January 2014 17:32, Richard Henderson <rth@twiddle.net> wrote: > On 01/22/2014 08:55 AM, Peter Maydell wrote: >> Has anybody ever looked at implementing proper TLB contexts? > > I've thought about it. The best I could come up with is a pointer within ENV > that points to the current TLB context. It definitely adds another load insn > on the fast path, but we should be able to schedule that first, since it > depends on nothing but the mem_index constant. Depending on the schedule, it > may require reserving another register on the fast path, which could be a > problem for i686. > > It would also greatly expand the size of ENV. > > E.g. Alpha would need to implement 256 contexts to match the hardware. We > currently get away with pretending to implement contexts by implementing none > at all, and flushing the TLB at every context change. I don't really know the details of Alpha, but can you get away with just "we implement N contexts, and only actually keep the most recently used N"? This is effectively what we're doing at the moment, with N==1. (ASIDs on ARM are 16 bit now, so we definitely wouldn't want to keep an entire TLB for each ASID; if we ever implemented virtualization there's another 8 bits of VMID and each VMID has its own ASID range...) thanks -- PMM ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-22 17:35 ` Peter Maydell @ 2014-01-22 17:45 ` Richard Henderson 2014-01-22 17:56 ` Xin Tong 0 siblings, 1 reply; 21+ messages in thread From: Richard Henderson @ 2014-01-22 17:45 UTC (permalink / raw) To: Peter Maydell; +Cc: Xin Tong, QEMU Developers On 01/22/2014 09:35 AM, Peter Maydell wrote: > I don't really know the details of Alpha, but can you get away with just > "we implement N contexts, and only actually keep the most recently > used N"? This is effectively what we're doing at the moment, with N==1. Yes, I suppose we could do that. Rather than have the ASN switching code be a simple store, it could be a full helper function. At which point we can do just about anything at all. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86 2014-01-22 17:45 ` Richard Henderson @ 2014-01-22 17:56 ` Xin Tong 0 siblings, 0 replies; 21+ messages in thread From: Xin Tong @ 2014-01-22 17:56 UTC (permalink / raw) To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers I have submitted a patch to the QEMU devel list on implementing a victim tlb in QEMU. i should have you 2 CC'ed on the patch email so that you can help review the patch in case no one is reviewing it. The name of the patch is [Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB Thank you, Xin On Wed, Jan 22, 2014 at 11:45 AM, Richard Henderson <rth@twiddle.net> wrote: > On 01/22/2014 09:35 AM, Peter Maydell wrote: >> I don't really know the details of Alpha, but can you get away with just >> "we implement N contexts, and only actually keep the most recently >> used N"? This is effectively what we're doing at the moment, with N==1. > > Yes, I suppose we could do that. Rather than have the ASN switching code be a > simple store, it could be a full helper function. At which point we can do > just about anything at all. > > > r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-01-22 17:56 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-27 7:41 [Qemu-devel] outlined TLB lookup on x86 Xin Tong 2013-11-27 13:12 ` Lluís Vilanova 2013-11-28 1:58 ` Xin Tong 2013-11-28 16:12 ` Lluís Vilanova 2013-12-08 10:54 ` Xin Tong 2013-12-17 13:52 ` Xin Tong 2013-12-18 2:22 ` Xin Tong 2014-01-21 14:22 ` Xin Tong 2014-01-21 14:28 ` Peter Maydell 2013-12-09 12:18 ` Xin Tong 2013-12-09 15:31 ` Lluís Vilanova 2013-11-28 2:12 ` Richard Henderson 2013-11-28 3:56 ` Xin Tong 2013-12-08 11:19 ` Avi Kivity 2014-01-22 15:28 ` Xin Tong 2014-01-22 16:34 ` Richard Henderson 2014-01-22 16:55 ` Peter Maydell 2014-01-22 17:32 ` Richard Henderson 2014-01-22 17:35 ` Peter Maydell 2014-01-22 17:45 ` Richard Henderson 2014-01-22 17:56 ` Xin Tong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).