* [Qemu-devel] outlined TLB lookup on x86
@ 2013-11-27 7:41 Xin Tong
2013-11-27 13:12 ` Lluís Vilanova
2013-11-28 2:12 ` Richard Henderson
0 siblings, 2 replies; 21+ messages in thread
From: Xin Tong @ 2013-11-27 7:41 UTC (permalink / raw)
To: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1490 bytes --]
I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64
on x86-64 machine, potentially for better instruction cache performance, I
have a few questions.
1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are
generated when tcg_out_tb_finalize is called. And when a TLB lookup misses,
it jumps to the generated slow path and slow path refills the TLB, then
load/store and jumps to the next emulated instruction. I am wondering is it
easy to outline the code for the slow path. I am thinking when a TLB
misses, the outlined TLB lookup code should generate a call out to
the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk the TLB after its
refilled ? This code is off the critical path, so its not as important as
the code when TLB hits.
2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries.
the TLB lookup is 10 x86 instructions , but every miss needs ~450
instructions, i measured this using Intel PIN. so even the miss rate is low
(say 3%) the overall time spent in the cpu_x86_handle_mmu_fault is still
signifcant. I am thinking the tlb may need to be organized in a set
associative fashion to reduce conflict miss, e.g. 2 way set associative to
reduce the miss rate. or have a victim tlb that is 4 way associative and
use x86 simd instructions to do the lookup once the direct-mapped tlb
misses. Has anybody done any work on this front ?
3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB
with 4K entries ?
Xin
[-- Attachment #2: Type: text/html, Size: 2127 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-27 7:41 [Qemu-devel] outlined TLB lookup on x86 Xin Tong
@ 2013-11-27 13:12 ` Lluís Vilanova
2013-11-28 1:58 ` Xin Tong
2013-11-28 2:12 ` Richard Henderson
1 sibling, 1 reply; 21+ messages in thread
From: Lluís Vilanova @ 2013-11-27 13:12 UTC (permalink / raw)
To: qemu-devel
Xin Tong writes:
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few questions.
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and jumps
> to the next emulated instruction. I am wondering is it easy to outline the code
> for the slow path. I am thinking when a TLB misses, the outlined TLB lookup code
> should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and rewalk
> the TLB after its refilled ? This code is off the critical path, so its not as
> important as the code when TLB hits.
> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the TLB
> lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the overall
> time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am thinking
> the tlb may need to be organized in a set associative fashion to reduce conflict
> miss, e.g. 2 way set associative to reduce the miss rate. or have a victim tlb
> that is 4 way associative and use x86 simd instructions to do the lookup once
> the direct-mapped tlb misses. Has anybody done any work on this front ?
> 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB with 4K
> entries ?
Using vector intrinsics for the TLB lookup will probably make the code less
portable. I don't know how compatible are the GCC and LLVM vectorizing
intrinsics between each other (since there has been some efforts on making QEMU
also compile with LLVM).
A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE in
cputlb.c), but the higher hit ratio could pay off, although I don't know how the
current size was chosen.
Lluis
--
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-27 13:12 ` Lluís Vilanova
@ 2013-11-28 1:58 ` Xin Tong
2013-11-28 16:12 ` Lluís Vilanova
0 siblings, 1 reply; 21+ messages in thread
From: Xin Tong @ 2013-11-28 1:58 UTC (permalink / raw)
To: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]
Hi LIuis
we can probably generate vector intrinsics using the tcg, e.g. add support
to tcg to emit vector instructions directly in code cache
why would a larger TLB make some operations slower, the TLB is a
direct-mapped hash and lookup should be O(1) there. In the cputlb, the
CPU_TLB_SIZE is always used to index into the TLB, i.e. (X & (CPU_TLB_SIZE
-1)).
Thank you
Xin
On Wed, Nov 27, 2013 at 5:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Xin Tong writes:
>
> > I am trying to implement a out-of-line TLB lookup for QEMU
> softmmu-x86-64 on
> > x86-64 machine, potentially for better instruction cache performance, I
> have a
> > few questions.
>
> > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are
> generated
> > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it
> jumps to
> > the generated slow path and slow path refills the TLB, then load/store
> and jumps
> > to the next emulated instruction. I am wondering is it easy to outline
> the code
> > for the slow path. I am thinking when a TLB misses, the outlined TLB
> lookup code
> > should generate a call out to the qemu_ld/st_helpers[opc & ~MO_SIGN] and
> rewalk
> > the TLB after its refilled ? This code is off the critical path, so its
> not as
> > important as the code when TLB hits.
> > 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries.
> the TLB
> > lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> > measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall
> > time spent in the cpu_x86_handle_mmu_fault is still signifcant. I am
> thinking
> > the tlb may need to be organized in a set associative fashion to reduce
> conflict
> > miss, e.g. 2 way set associative to reduce the miss rate. or have a
> victim tlb
> > that is 4 way associative and use x86 simd instructions to do the lookup
> once
> > the direct-mapped tlb misses. Has anybody done any work on this front ?
> > 3. what are some of the drawbacks of using a superlarge TLB, i.e. a TLB
> with 4K
> > entries ?
>
> Using vector intrinsics for the TLB lookup will probably make the code less
> portable. I don't know how compatible are the GCC and LLVM vectorizing
> intrinsics between each other (since there has been some efforts on making
> QEMU
> also compile with LLVM).
>
> A larger TLB will make some operations slower (e.g., look for CPU_TLB_SIZE
> in
> cputlb.c), but the higher hit ratio could pay off, although I don't know
> how the
> current size was chosen.
>
>
> Lluis
>
> --
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth
>
>
[-- Attachment #2: Type: text/html, Size: 4643 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-27 7:41 [Qemu-devel] outlined TLB lookup on x86 Xin Tong
2013-11-27 13:12 ` Lluís Vilanova
@ 2013-11-28 2:12 ` Richard Henderson
2013-11-28 3:56 ` Xin Tong
` (2 more replies)
1 sibling, 3 replies; 21+ messages in thread
From: Richard Henderson @ 2013-11-28 2:12 UTC (permalink / raw)
To: Xin Tong, qemu-devel
On 11/27/2013 08:41 PM, Xin Tong wrote:
> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
> x86-64 machine, potentially for better instruction cache performance, I have a
> few questions.
>
> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
> the generated slow path and slow path refills the TLB, then load/store and
> jumps to the next emulated instruction. I am wondering is it easy to outline
> the code for the slow path.
Hard. There's quite a bit of code on that slow path that's unique to the
surrounding code context -- which registers contain inputs and outputs, where
to continue after slow path.
The amount of code that's in the TB slow path now is approximately minimal, as
far as I can see. If you've got an idea for improvement, please share. ;-)
> I am thinking when a TLB misses, the outlined TLB
> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
> path, so its not as important as the code when TLB hits.
That would work for true TLB misses to RAM, but does not work for memory mapped
I/O.
> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the
> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
> measured this using Intel PIN. so even the miss rate is low (say 3%) the
> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance. But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.
But be aware that we can't simply make the change universally. E.g. ARM can
use an immediate 8-bit operand during the TLB lookup, but would have to use
several insns to perform a 9-bit mask.
> I am
> thinking the tlb may need to be organized in a set associative fashion to
> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
> have a victim tlb that is 4 way associative and use x86 simd instructions to do
> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
> front ?
Even with SIMD, I don't believe you could make the fast-path of a set
associative lookup fast. This is the sort of thing for which you really need
the dedicated hardware of the real TLB. Feel free to prove me wrong with code,
of course.
r~
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-28 2:12 ` Richard Henderson
@ 2013-11-28 3:56 ` Xin Tong
2013-12-08 11:19 ` Avi Kivity
2014-01-22 15:28 ` Xin Tong
2 siblings, 0 replies; 21+ messages in thread
From: Xin Tong @ 2013-11-28 3:56 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 4313 bytes --]
On Wed, Nov 27, 2013 at 6:12 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 11/27/2013 08:41 PM, Xin Tong wrote:
> > I am trying to implement a out-of-line TLB lookup for QEMU
> softmmu-x86-64 on
> > x86-64 machine, potentially for better instruction cache performance, I
> have a
> > few questions.
> >
> > 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are
> generated
> > when tcg_out_tb_finalize is called. And when a TLB lookup misses, it
> jumps to
> > the generated slow path and slow path refills the TLB, then load/store
> and
> > jumps to the next emulated instruction. I am wondering is it easy to
> outline
> > the code for the slow path.
>
> Hard. There's quite a bit of code on that slow path that's unique to the
> surrounding code context -- which registers contain inputs and outputs,
> where
> to continue after slow path.
>
> The amount of code that's in the TB slow path now is approximately
> minimal, as
> far as I can see. If you've got an idea for improvement, please share.
> ;-)
>
>
> > I am thinking when a TLB misses, the outlined TLB
> > lookup code should generate a call out to the qemu_ld/st_helpers[opc &
> > ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the
> critical
> > path, so its not as important as the code when TLB hits.
>
> That would work for true TLB misses to RAM, but does not work for memory
> mapped
> I/O.
>
> > 2. why not use a TLB or bigger size? currently the TLB has 1<<8
> entries. the
> > TLB lookup is 10 x86 instructions , but every miss needs ~450
> instructions, i
> > measured this using Intel PIN. so even the miss rate is low (say 3%) the
> > overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
>
> I'd be interested to experiment with different TLB sizes, to see what
> effect
> that has on performance. But I suspect that lack of TLB contexts mean
> that we
> wind up flushing the TLB more often than real hardware does, and therefore
> a
> larger TLB merely takes longer to flush.
Hardware TLBs are limited in size primarily due to the fact that increasing
their sizes increases their access latency as well. but software tlb does
not suffer from that problem. so i think the size of the softtlb should be
not influenced by the size of the hardware tlb.
Flushing the TLB is minimal unless we have a really really large TLB, e.g.
a TLB with 1M entries. I vaguely remember that i see ~8% of the time is
spent in the cpu_x86_mmu_fault function in one of the speccpu2006 workload
some time ago. so if we increase the size of the TLB significantly and
potential getting rid of most of the TLB misses, we can get rid of most of
the 8%. ( there are still compulsory misses and a few conflict misses, but
i think compulsory misses is not the major player here).
>
> But be aware that we can't simply make the change universally. E.g. ARM
> can
> use an immediate 8-bit operand during the TLB lookup, but would have to use
> several insns to perform a 9-bit mask.
This can be handled with ifndefs. most of the tlb code common to all cpus
need not be changed.
>
>
> > I am
> > thinking the tlb may need to be organized in a set associative fashion to
> > reduce conflict miss, e.g. 2 way set associative to reduce the miss
> rate. or
> > have a victim tlb that is 4 way associative and use x86 simd
> instructions to do
> > the lookup once the direct-mapped tlb misses. Has anybody done any work
> on this
> > front ?
>
> Even with SIMD, I don't believe you could make the fast-path of a set
> associative lookup fast. This is the sort of thing for which you really
> need
> the dedicated hardware of the real TLB. Feel free to prove me wrong with
> code,
> of course.
>
I am thinking the primary TLB should remain what it is, i.e. a direct
mapped. but we can have a victim TLB with bigger assoiciatity. the victim
TLB can be walked either sequentially or in parallel with simd
instructions. its going to be slower than hitting in the direct mapped TLB.
but (much) better than having to rewalk the page table.
For OS with ASID, we can have a shared TLB with ASID. and this can
potentially get rid of some compulsory misses for us. e.g. multiple threads
in a process share the same ASID and we can check for the shared TLB on
miss before walk the page table.
>
>
> r~
>
[-- Attachment #2: Type: text/html, Size: 6205 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-28 1:58 ` Xin Tong
@ 2013-11-28 16:12 ` Lluís Vilanova
2013-12-08 10:54 ` Xin Tong
2013-12-09 12:18 ` Xin Tong
0 siblings, 2 replies; 21+ messages in thread
From: Lluís Vilanova @ 2013-11-28 16:12 UTC (permalink / raw)
To: Xin Tong; +Cc: qemu-devel
Xin Tong writes:
> Hi LIuis
> we can probably generate vector intrinsics using the tcg, e.g. add support to
> tcg to emit vector instructions directly in code cache
There was some discussion long ago about adding vector instructions to TCG, but
I don't remember what was the conclusion.
Also remember that using vector instructions will "emulate" a low-associativity
TLB; don't know how much better than a 1-way TLB will that be, though.
> why would a larger TLB make some operations slower, the TLB is a direct-mapped
> hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is always
> used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
It would make TLB invalidations slower (e.g., see 'tlb_flush' in
"cputlb.c"). And right now QEMU performs full TLB invalidations more frequently
than the equivalent HW needs to, although I suppose that should be quantified
too.
Lluis
--
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-28 16:12 ` Lluís Vilanova
@ 2013-12-08 10:54 ` Xin Tong
2013-12-17 13:52 ` Xin Tong
2013-12-09 12:18 ` Xin Tong
1 sibling, 1 reply; 21+ messages in thread
From: Xin Tong @ 2013-12-08 10:54 UTC (permalink / raw)
To: Xin Tong, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]
On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Xin Tong writes:
>
> > Hi LIuis
> > we can probably generate vector intrinsics using the tcg, e.g. add
> support to
> > tcg to emit vector instructions directly in code cache
>
> There was some discussion long ago about adding vector instructions to
> TCG, but
> I don't remember what was the conclusion.
>
> Also remember that using vector instructions will "emulate" a
> low-associativity
> TLB; don't know how much better than a 1-way TLB will that be, though.
>
>
> > why would a larger TLB make some operations slower, the TLB is a
> direct-mapped
> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
> always
> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>
> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
> "cputlb.c"). And right now QEMU performs full TLB invalidations more
> frequently
> than the equivalent HW needs to, although I suppose that should be
> quantified
> too.
>
> you are right LIuis. QEMU does context switch quite more often that real
hw, this is probably primarily due to the fact that QEMU is magnitude
slower than real hw. I am wondering where timer is emulated in QEMU
system-x86_64. I imagine the guest OS must program the timers to do
interrupt for context switches.
Another question, what happens when a vcpu is stuck in an infinite loop ?
QEMU must need an timer interrupt somewhere as well ?
Is my understanding correct ?
Xin
>
> Lluis
>
> --
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth
>
[-- Attachment #2: Type: text/html, Size: 3144 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-28 2:12 ` Richard Henderson
2013-11-28 3:56 ` Xin Tong
@ 2013-12-08 11:19 ` Avi Kivity
2014-01-22 15:28 ` Xin Tong
2 siblings, 0 replies; 21+ messages in thread
From: Avi Kivity @ 2013-12-08 11:19 UTC (permalink / raw)
To: Richard Henderson, Xin Tong, qemu-devel
On 11/28/2013 04:12 AM, Richard Henderson wrote:
>> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the
>> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
>> measured this using Intel PIN. so even the miss rate is low (say 3%) the
>> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
> I'd be interested to experiment with different TLB sizes, to see what effect
> that has on performance. But I suspect that lack of TLB contexts mean that we
> wind up flushing the TLB more often than real hardware does, and therefore a
> larger TLB merely takes longer to flush.
>
You could use a generation counter to flush the TLB in O(1) by
incrementing the counter. That slows down the fast path though. Maybe
you can do that for the larger second level TLB only.
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-28 16:12 ` Lluís Vilanova
2013-12-08 10:54 ` Xin Tong
@ 2013-12-09 12:18 ` Xin Tong
2013-12-09 15:31 ` Lluís Vilanova
1 sibling, 1 reply; 21+ messages in thread
From: Xin Tong @ 2013-12-09 12:18 UTC (permalink / raw)
To: Xin Tong, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1435 bytes --]
On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Xin Tong writes:
>
> > Hi LIuis
> > we can probably generate vector intrinsics using the tcg, e.g. add
> support to
> > tcg to emit vector instructions directly in code cache
>
> There was some discussion long ago about adding vector instructions to
> TCG, but
> I don't remember what was the conclusion.
>
>
Hi LIuis
Can you please forward me that email if it is not difficult to find.
otherwise, it is ok!.
Thank you
Xin
> Also remember that using vector instructions will "emulate" a
> low-associativity
> TLB; don't know how much better than a 1-way TLB will that be, though.
>
>
> > why would a larger TLB make some operations slower, the TLB is a
> direct-mapped
> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
> always
> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>
> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
> "cputlb.c"). And right now QEMU performs full TLB invalidations more
> frequently
> than the equivalent HW needs to, although I suppose that should be
> quantified
> too.
>
>
> Lluis
>
> --
> "And it's much the same thing with knowledge, for whenever you learn
> something new, the whole world becomes that much richer."
> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
> Tollbooth
>
[-- Attachment #2: Type: text/html, Size: 2747 bytes --]
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-12-09 12:18 ` Xin Tong
@ 2013-12-09 15:31 ` Lluís Vilanova
0 siblings, 0 replies; 21+ messages in thread
From: Lluís Vilanova @ 2013-12-09 15:31 UTC (permalink / raw)
To: Xin Tong; +Cc: qemu-devel
Xin Tong writes:
> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> Xin Tong writes:
>> Hi LIuis
>> we can probably generate vector intrinsics using the tcg, e.g. add support
> to
>> tcg to emit vector instructions directly in code cache
> There was some discussion long ago about adding vector instructions to TCG,
> but
> I don't remember what was the conclusion.
> Hi LIuis
> Can you please forward me that email if it is not difficult to find. otherwise,
> it is ok!.
Sorry, I can't remember when it was. You'll have to look it up in the mailing
list archive.
Lluis
--
"And it's much the same thing with knowledge, for whenever you learn
something new, the whole world becomes that much richer."
-- The Princess of Pure Reason, as told by Norton Juster in The Phantom
Tollbooth
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-12-08 10:54 ` Xin Tong
@ 2013-12-17 13:52 ` Xin Tong
2013-12-18 2:22 ` Xin Tong
0 siblings, 1 reply; 21+ messages in thread
From: Xin Tong @ 2013-12-17 13:52 UTC (permalink / raw)
To: Xin Tong, QEMU Developers
On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.tong@gmail.com> wrote:
>
>
>
> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>>
>> Xin Tong writes:
>>
>> > Hi LIuis
>> > we can probably generate vector intrinsics using the tcg, e.g. add
>> > support to
>> > tcg to emit vector instructions directly in code cache
>>
>> There was some discussion long ago about adding vector instructions to
>> TCG, but
>> I don't remember what was the conclusion.
>>
>> Also remember that using vector instructions will "emulate" a
>> low-associativity
>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>
>>
>> > why would a larger TLB make some operations slower, the TLB is a
>> > direct-mapped
>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>> > always
>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>
>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>> frequently
>> than the equivalent HW needs to, although I suppose that should be
>> quantified
>> too.
I see QEMU executed ~1M instructions per context switch for
qemu-system-x86_64. Is this because of the fact that the periodical
time interval interrupt is delivered in real time while QEMU is
significantly slower than real hw ?
Xin
>>
> you are right LIuis. QEMU does context switch quite more often that real hw,
> this is probably primarily due to the fact that QEMU is magnitude slower
> than real hw. I am wondering where timer is emulated in QEMU system-x86_64.
> I imagine the guest OS must program the timers to do interrupt for context
> switches.
>
> Another question, what happens when a vcpu is stuck in an infinite loop ?
> QEMU must need an timer interrupt somewhere as well ?
>
> Is my understanding correct ?
>
> Xin
>>
>>
>> Lluis
>>
>> --
>> "And it's much the same thing with knowledge, for whenever you learn
>> something new, the whole world becomes that much richer."
>> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>> Tollbooth
>
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-12-17 13:52 ` Xin Tong
@ 2013-12-18 2:22 ` Xin Tong
2014-01-21 14:22 ` Xin Tong
0 siblings, 1 reply; 21+ messages in thread
From: Xin Tong @ 2013-12-18 2:22 UTC (permalink / raw)
To: Xin Tong, QEMU Developers
why is QEMU TLB organized based on the modes, e.g. on x86 there are 3
modes. what i think is that there may be conflicts between virtual
addresses and physical addresses. organizing it by modes guarantees
that QEMU does not hit a physical address translation entry when in
user mode and vice versa ?
Thank you,
Xin
On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <trent.tong@gmail.com> wrote:
> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.tong@gmail.com> wrote:
>>
>>
>>
>> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>>>
>>> Xin Tong writes:
>>>
>>> > Hi LIuis
>>> > we can probably generate vector intrinsics using the tcg, e.g. add
>>> > support to
>>> > tcg to emit vector instructions directly in code cache
>>>
>>> There was some discussion long ago about adding vector instructions to
>>> TCG, but
>>> I don't remember what was the conclusion.
>>>
>>> Also remember that using vector instructions will "emulate" a
>>> low-associativity
>>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>>
>>>
>>> > why would a larger TLB make some operations slower, the TLB is a
>>> > direct-mapped
>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>>> > always
>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>>
>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>>> frequently
>>> than the equivalent HW needs to, although I suppose that should be
>>> quantified
>>> too.
>
> I see QEMU executed ~1M instructions per context switch for
> qemu-system-x86_64. Is this because of the fact that the periodical
> time interval interrupt is delivered in real time while QEMU is
> significantly slower than real hw ?
>
> Xin
>
>>>
>> you are right LIuis. QEMU does context switch quite more often that real hw,
>> this is probably primarily due to the fact that QEMU is magnitude slower
>> than real hw. I am wondering where timer is emulated in QEMU system-x86_64.
>> I imagine the guest OS must program the timers to do interrupt for context
>> switches.
>>
>> Another question, what happens when a vcpu is stuck in an infinite loop ?
>> QEMU must need an timer interrupt somewhere as well ?
>>
>> Is my understanding correct ?
>>
>> Xin
>>>
>>>
>>> Lluis
>>>
>>> --
>>> "And it's much the same thing with knowledge, for whenever you learn
>>> something new, the whole world becomes that much richer."
>>> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>>> Tollbooth
>>
>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-12-18 2:22 ` Xin Tong
@ 2014-01-21 14:22 ` Xin Tong
2014-01-21 14:28 ` Peter Maydell
0 siblings, 1 reply; 21+ messages in thread
From: Xin Tong @ 2014-01-21 14:22 UTC (permalink / raw)
To: QEMU Developers
Hi
I have found that adding a small (8-entry) fully associative victim
TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path
(page table walking) improves the performance of QEMU x86_64 system
emulation mode significantly on the specint2006 benchmarks. This is
primarily due to the fact that the primary TLB is directly mapped and
suffer from conflict misses. I have this implemented on QEMU trunk and
would like to contribute this back to QEMU. Where should i start ?
Xin
On Tue, Dec 17, 2013 at 8:22 PM, Xin Tong <trent.tong@gmail.com> wrote:
> why is QEMU TLB organized based on the modes, e.g. on x86 there are 3
> modes. what i think is that there may be conflicts between virtual
> addresses and physical addresses. organizing it by modes guarantees
> that QEMU does not hit a physical address translation entry when in
> user mode and vice versa ?
>
> Thank you,
> Xin
>
> On Tue, Dec 17, 2013 at 10:52 PM, Xin Tong <trent.tong@gmail.com> wrote:
>> On Sun, Dec 8, 2013 at 2:54 AM, Xin Tong <trent.tong@gmail.com> wrote:
>>>
>>>
>>>
>>> On Thu, Nov 28, 2013 at 8:12 AM, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>>>>
>>>> Xin Tong writes:
>>>>
>>>> > Hi LIuis
>>>> > we can probably generate vector intrinsics using the tcg, e.g. add
>>>> > support to
>>>> > tcg to emit vector instructions directly in code cache
>>>>
>>>> There was some discussion long ago about adding vector instructions to
>>>> TCG, but
>>>> I don't remember what was the conclusion.
>>>>
>>>> Also remember that using vector instructions will "emulate" a
>>>> low-associativity
>>>> TLB; don't know how much better than a 1-way TLB will that be, though.
>>>>
>>>>
>>>> > why would a larger TLB make some operations slower, the TLB is a
>>>> > direct-mapped
>>>> > hash and lookup should be O(1) there. In the cputlb, the CPU_TLB_SIZE is
>>>> > always
>>>> > used to index into the TLB, i.e. (X & (CPU_TLB_SIZE -1)).
>>>>
>>>> It would make TLB invalidations slower (e.g., see 'tlb_flush' in
>>>> "cputlb.c"). And right now QEMU performs full TLB invalidations more
>>>> frequently
>>>> than the equivalent HW needs to, although I suppose that should be
>>>> quantified
>>>> too.
>>
>> I see QEMU executed ~1M instructions per context switch for
>> qemu-system-x86_64. Is this because of the fact that the periodical
>> time interval interrupt is delivered in real time while QEMU is
>> significantly slower than real hw ?
>>
>> Xin
>>
>>>>
>>> you are right LIuis. QEMU does context switch quite more often that real hw,
>>> this is probably primarily due to the fact that QEMU is magnitude slower
>>> than real hw. I am wondering where timer is emulated in QEMU system-x86_64.
>>> I imagine the guest OS must program the timers to do interrupt for context
>>> switches.
>>>
>>> Another question, what happens when a vcpu is stuck in an infinite loop ?
>>> QEMU must need an timer interrupt somewhere as well ?
>>>
>>> Is my understanding correct ?
>>>
>>> Xin
>>>>
>>>>
>>>> Lluis
>>>>
>>>> --
>>>> "And it's much the same thing with knowledge, for whenever you learn
>>>> something new, the whole world becomes that much richer."
>>>> -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>>>> Tollbooth
>>>
>>>
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-21 14:22 ` Xin Tong
@ 2014-01-21 14:28 ` Peter Maydell
0 siblings, 0 replies; 21+ messages in thread
From: Peter Maydell @ 2014-01-21 14:28 UTC (permalink / raw)
To: Xin Tong; +Cc: QEMU Developers
On 21 January 2014 14:22, Xin Tong <trent.tong@gmail.com> wrote:
> I have found that adding a small (8-entry) fully associative victim
> TLB (http://en.wikipedia.org/wiki/Victim_Cache) before the refill path
> (page table walking) improves the performance of QEMU x86_64 system
> emulation mode significantly on the specint2006 benchmarks. This is
> primarily due to the fact that the primary TLB is directly mapped and
> suffer from conflict misses. I have this implemented on QEMU trunk and
> would like to contribute this back to QEMU. Where should i start ?
The wiki page http://wiki.qemu.org/Contribute/SubmitAPatch
tries to describe our usual process for reviewing code
submissions. If you make sure your changes follow the
guidelines described there and then send them to the mailing
list as a series of patch emails in the right format, we
can start reviewing the code.
thanks
-- PMM
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2013-11-28 2:12 ` Richard Henderson
2013-11-28 3:56 ` Xin Tong
2013-12-08 11:19 ` Avi Kivity
@ 2014-01-22 15:28 ` Xin Tong
2014-01-22 16:34 ` Richard Henderson
2014-01-22 16:55 ` Peter Maydell
2 siblings, 2 replies; 21+ messages in thread
From: Xin Tong @ 2014-01-22 15:28 UTC (permalink / raw)
To: Richard Henderson; +Cc: QEMU Developers
On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 11/27/2013 08:41 PM, Xin Tong wrote:
>> I am trying to implement a out-of-line TLB lookup for QEMU softmmu-x86-64 on
>> x86-64 machine, potentially for better instruction cache performance, I have a
>> few questions.
>>
>> 1. I see that tcg_out_qemu_ld_slow_path/tcg_out_qemu_st_slow_path are generated
>> when tcg_out_tb_finalize is called. And when a TLB lookup misses, it jumps to
>> the generated slow path and slow path refills the TLB, then load/store and
>> jumps to the next emulated instruction. I am wondering is it easy to outline
>> the code for the slow path.
>
> Hard. There's quite a bit of code on that slow path that's unique to the
> surrounding code context -- which registers contain inputs and outputs, where
> to continue after slow path.
>
> The amount of code that's in the TB slow path now is approximately minimal, as
> far as I can see. If you've got an idea for improvement, please share. ;-)
>
>
>> I am thinking when a TLB misses, the outlined TLB
>> lookup code should generate a call out to the qemu_ld/st_helpers[opc &
>> ~MO_SIGN] and rewalk the TLB after its refilled ? This code is off the critical
>> path, so its not as important as the code when TLB hits.
>
> That would work for true TLB misses to RAM, but does not work for memory mapped
> I/O.
>
>> 2. why not use a TLB or bigger size? currently the TLB has 1<<8 entries. the
>> TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
>> measured this using Intel PIN. so even the miss rate is low (say 3%) the
>> overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.
>
> I'd be interested to experiment with different TLB sizes, to see what effect
> that has on performance. But I suspect that lack of TLB contexts mean that we
> wind up flushing the TLB more often than real hardware does, and therefore a
> larger TLB merely takes longer to flush.
>
> But be aware that we can't simply make the change universally. E.g. ARM can
> use an immediate 8-bit operand during the TLB lookup, but would have to use
> several insns to perform a 9-bit mask.
Hi Richard
I've done some experiments on increasing the size of the tlb.
increasing the size of the tlb from 256 entries to 4096 entries gives
significant performance improvement on the specint2006 benchmarks on
qemu-system-x86_64 running on a x86_64 linux machine . i am in the
process of exploring more tlb sizes and will post the data after i am
done.
Can you tell me whether ARM is the only architecture that requires
special treatment for increasing tlb size beyond 256 entries so that i
can whip up a patch to the QEMU mainline.
Thank you,
Xin
>
>> I am
>> thinking the tlb may need to be organized in a set associative fashion to
>> reduce conflict miss, e.g. 2 way set associative to reduce the miss rate. or
>> have a victim tlb that is 4 way associative and use x86 simd instructions to do
>> the lookup once the direct-mapped tlb misses. Has anybody done any work on this
>> front ?
>
> Even with SIMD, I don't believe you could make the fast-path of a set
> associative lookup fast. This is the sort of thing for which you really need
> the dedicated hardware of the real TLB. Feel free to prove me wrong with code,
> of course.
>
>
> r~
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-22 15:28 ` Xin Tong
@ 2014-01-22 16:34 ` Richard Henderson
2014-01-22 16:55 ` Peter Maydell
1 sibling, 0 replies; 21+ messages in thread
From: Richard Henderson @ 2014-01-22 16:34 UTC (permalink / raw)
To: Xin Tong; +Cc: QEMU Developers
On 01/22/2014 07:28 AM, Xin Tong wrote:
> Can you tell me whether ARM is the only architecture that requires
> special treatment for increasing tlb size beyond 256 entries so that i
> can whip up a patch to the QEMU mainline.
The major constraint for the non-arm ports is
CPU_TLB_ENTRY_SIZE + CPU_TLB_BITS < immediate bit size
I.e.
(CPU_TLB_SIZE - 1) << CPU_TLB_ENTRY_BITS
is representable as an immediate within an AND instruction.
MIPS has a 16-bit unsigned immediate, and as written would generate bad code
for CPU_TLB_BITS > 11.
I386 has a 32-bit signed immediate, and would generate bad code for
CPU_TLB_BITS > 26. Though I can't imagine you want to make it that big.
SPARC has a 13-bit signed immediate, But it's written with a routine which
checks the size of the constant and loads it if necessary. Which is good,
because that's clearly already happening for CPU_TLB_BITS > 7.
AArch64, ia64, ppc, ppc64 all use fully capable extract-bit-field type insns
and could handle any change you make.
S390 is written using generic routines like sparc, so it won't fail with any
change. It ought to be adjusted to use the extract-bit-field type insns that
exist in the current generation of machines. The oldest generation of machine
would have reduced performance with CPU_TLB_BITS > 11.
ARM is also a case in which armv6t2 and later could be written with an
extract-bit-field insn, but previous versions would need to use 2 insns to form
the constant. But at least we'd be able to combine the shift and and insns.
r~
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-22 15:28 ` Xin Tong
2014-01-22 16:34 ` Richard Henderson
@ 2014-01-22 16:55 ` Peter Maydell
2014-01-22 17:32 ` Richard Henderson
1 sibling, 1 reply; 21+ messages in thread
From: Peter Maydell @ 2014-01-22 16:55 UTC (permalink / raw)
To: Xin Tong; +Cc: QEMU Developers, Richard Henderson
On 22 January 2014 15:28, Xin Tong <trent.tong@gmail.com> wrote:
> On Wed, Nov 27, 2013 at 8:12 PM, Richard Henderson <rth@twiddle.net> wrote:
>> I'd be interested to experiment with different TLB sizes, to see what effect
>> that has on performance. But I suspect that lack of TLB contexts mean that we
>> wind up flushing the TLB more often than real hardware does, and therefore a
>> larger TLB merely takes longer to flush.
> I've done some experiments on increasing the size of the tlb.
> increasing the size of the tlb from 256 entries to 4096 entries gives
> significant performance improvement on the specint2006 benchmarks on
> qemu-system-x86_64 running on a x86_64 linux machine . i am in the
> process of exploring more tlb sizes and will post the data after i am
> done.
Of course a single big benchmark program is probably the best case
for "not having lots of TLB flushing". It would probably also be instructive
to benchmark other cases, like OS bootup, running multiple different
programs simultaneously and system call heavy workloads.
Has anybody ever looked at implementing proper TLB contexts?
thanks
-- PMM
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-22 16:55 ` Peter Maydell
@ 2014-01-22 17:32 ` Richard Henderson
2014-01-22 17:35 ` Peter Maydell
0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2014-01-22 17:32 UTC (permalink / raw)
To: Peter Maydell, Xin Tong; +Cc: QEMU Developers
On 01/22/2014 08:55 AM, Peter Maydell wrote:
> Has anybody ever looked at implementing proper TLB contexts?
I've thought about it. The best I could come up with is a pointer within ENV
that points to the current TLB context. It definitely adds another load insn
on the fast path, but we should be able to schedule that first, since it
depends on nothing but the mem_index constant. Depending on the schedule, it
may require reserving another register on the fast path, which could be a
problem for i686.
It would also greatly expand the size of ENV.
E.g. Alpha would need to implement 256 contexts to match the hardware. We
currently get away with pretending to implement contexts by implementing none
at all, and flushing the TLB at every context change.
Our current TLB size is 8k. Times 256 contexts is 2MB. Which might just be
within the range of possibility. Certainly not if we expand the size of the
individual TLBs.
Although interestingly, for Alpha we don't need 256 * NB_MMU_MODES, because
MMU_KERNEL_IDX always uses context 0, the "global context".
I don't recall enough about the intimate details of other TLB hardware to know
if there are similar savings that can be had elsewhere. But the amount of
memory involved is large enough to suggest that some sort of target-specific
sizing of the number of contexts might be required.
r~
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-22 17:32 ` Richard Henderson
@ 2014-01-22 17:35 ` Peter Maydell
2014-01-22 17:45 ` Richard Henderson
0 siblings, 1 reply; 21+ messages in thread
From: Peter Maydell @ 2014-01-22 17:35 UTC (permalink / raw)
To: Richard Henderson; +Cc: Xin Tong, QEMU Developers
On 22 January 2014 17:32, Richard Henderson <rth@twiddle.net> wrote:
> On 01/22/2014 08:55 AM, Peter Maydell wrote:
>> Has anybody ever looked at implementing proper TLB contexts?
>
> I've thought about it. The best I could come up with is a pointer within ENV
> that points to the current TLB context. It definitely adds another load insn
> on the fast path, but we should be able to schedule that first, since it
> depends on nothing but the mem_index constant. Depending on the schedule, it
> may require reserving another register on the fast path, which could be a
> problem for i686.
>
> It would also greatly expand the size of ENV.
>
> E.g. Alpha would need to implement 256 contexts to match the hardware. We
> currently get away with pretending to implement contexts by implementing none
> at all, and flushing the TLB at every context change.
I don't really know the details of Alpha, but can you get away with just
"we implement N contexts, and only actually keep the most recently
used N"? This is effectively what we're doing at the moment, with N==1.
(ASIDs on ARM are 16 bit now, so we definitely wouldn't want to keep
an entire TLB for each ASID; if we ever implemented virtualization there's
another 8 bits of VMID and each VMID has its own ASID range...)
thanks
-- PMM
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-22 17:35 ` Peter Maydell
@ 2014-01-22 17:45 ` Richard Henderson
2014-01-22 17:56 ` Xin Tong
0 siblings, 1 reply; 21+ messages in thread
From: Richard Henderson @ 2014-01-22 17:45 UTC (permalink / raw)
To: Peter Maydell; +Cc: Xin Tong, QEMU Developers
On 01/22/2014 09:35 AM, Peter Maydell wrote:
> I don't really know the details of Alpha, but can you get away with just
> "we implement N contexts, and only actually keep the most recently
> used N"? This is effectively what we're doing at the moment, with N==1.
Yes, I suppose we could do that. Rather than have the ASN switching code be a
simple store, it could be a full helper function. At which point we can do
just about anything at all.
r~
^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] outlined TLB lookup on x86
2014-01-22 17:45 ` Richard Henderson
@ 2014-01-22 17:56 ` Xin Tong
0 siblings, 0 replies; 21+ messages in thread
From: Xin Tong @ 2014-01-22 17:56 UTC (permalink / raw)
To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers
I have submitted a patch to the QEMU devel list on implementing a
victim tlb in QEMU. i should have you 2 CC'ed on the patch email so
that you can help review the patch in case no one is reviewing it. The
name of the patch is
[Qemu-devel] [PATCH] cpu: implementing victim TLB for QEMU system emulated TLB
Thank you,
Xin
On Wed, Jan 22, 2014 at 11:45 AM, Richard Henderson <rth@twiddle.net> wrote:
> On 01/22/2014 09:35 AM, Peter Maydell wrote:
>> I don't really know the details of Alpha, but can you get away with just
>> "we implement N contexts, and only actually keep the most recently
>> used N"? This is effectively what we're doing at the moment, with N==1.
>
> Yes, I suppose we could do that. Rather than have the ASN switching code be a
> simple store, it could be a full helper function. At which point we can do
> just about anything at all.
>
>
> r~
^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2014-01-22 17:56 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-27 7:41 [Qemu-devel] outlined TLB lookup on x86 Xin Tong
2013-11-27 13:12 ` Lluís Vilanova
2013-11-28 1:58 ` Xin Tong
2013-11-28 16:12 ` Lluís Vilanova
2013-12-08 10:54 ` Xin Tong
2013-12-17 13:52 ` Xin Tong
2013-12-18 2:22 ` Xin Tong
2014-01-21 14:22 ` Xin Tong
2014-01-21 14:28 ` Peter Maydell
2013-12-09 12:18 ` Xin Tong
2013-12-09 15:31 ` Lluís Vilanova
2013-11-28 2:12 ` Richard Henderson
2013-11-28 3:56 ` Xin Tong
2013-12-08 11:19 ` Avi Kivity
2014-01-22 15:28 ` Xin Tong
2014-01-22 16:34 ` Richard Henderson
2014-01-22 16:55 ` Peter Maydell
2014-01-22 17:32 ` Richard Henderson
2014-01-22 17:35 ` Peter Maydell
2014-01-22 17:45 ` Richard Henderson
2014-01-22 17:56 ` Xin Tong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).