[Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
@ 2016-04-04  8:54 Alex Bennée
  2016-04-04  9:01 ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2016-04-04  8:54 UTC (permalink / raw)
  To: QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov, Richard Henderson, ".Cota",
	Paolo Bonzini

Hi,

While reviewing the recent TB patching cleanup patches I wondered if
there is a cleaner way of handling TB invalidation. Currently we have a
single code generation buffer which slowly fills up with TBs as we
execute code. These TBs are chained together if they exist in the same
physical page (so we always exit to the run-loop if crossing a page
boundary).

We hold a bunch if extra information in the TBs to facilitate looking
things up. We have:

    struct TranslationBlock *phys_hash_next;

to facilitate looking up TBs which have matching hashes in the physical
address lookup. We also have:

    uintptr_t jmp_list_next[2];
    uintptr_t jmp_list_first;

Which are used for unwinding any jump patching should we invalidate a
page and hence don't want code jumping to potentially invalid
translations.

We also have a number of associated jump caches held against each CPU
which is used to optimise re-entry into generated code as we go round
the main run-loop. These also have to be cleanly invalidated as TBs are
marked invalid.

Finally as the TBs are generated on demand the actual code may not be
locally jump-able which makes atomic patching of the jumps trickier to
do.

TB invalidation is almost always due to page mapping changes although
SMC code and debugging are also causes for throwing away translations.
I'm wondering if it is time to add a layer of indirection to simplify
the process?

If we introduce a TranslationRegion which could initially cover a pages
worth of code. It would have its own code generation buffer protected by
an RCU lock to make it easier to swap out on code buffers is a clean
manner:

  Normal Execution (cpu_exec):
    - lookup TranslationRegion
    - take RCU read lock
    - lookup-or-generate TB
    - jump into code
    - exit TB
    - release RCU read lock

  Invalidation of Page:
    - lookup TranslationRegion
    - take RCU write lock
      - create fresh empty region
      - signal cpu_exit to all vCPUs
    - release RCU write lock
    - take RCU read lock
    - lookup-or-generate TB
    - jump into code
    - exit TB
    - release RCU read lock*

* when the last vCPU releases the read lock on the old code it can be
  cleanly thrown away. No fiddly jump patching required.

There are some potential optimisation's that could be made to this system
as well.

Jump patching would become easier on backends with limited jump ranges
as local code is kept together in a shared code buffer.

For one there is no reason the area covered by a TranslationRegion has
to be a page. For example the kernel segment once mapped will never
change. Then all internal TBs could still be chained together.

I'm sure there is scope for localising the jump cache to regions as
there are likely to be only a few entry points to any given page with
the rest all being internal branches for loops and conditionals.

The only thing I can currently think that may be a problem is
potentially causing heap fragmentation by having a large number of code
buffers. This could probably be ameliorated by using custom allocation
routines for the code buffers.

I'm going to have a bit of a play to see what this sort of solution
would look like in the code but I thought I'd sketch the idea out to see
if there are any obvious glaring holes or others things to consider.

Thoughts, objections? Discuss ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
  2016-04-04  8:54 [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer? Alex Bennée
@ 2016-04-04  9:01 ` Paolo Bonzini
  2016-04-04 10:39   ` Sergey Fedorov
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2016-04-04  9:01 UTC (permalink / raw)
  To: Alex Bennée, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell, Sergey Fedorov

On 04/04/2016 10:54, Alex Bennée wrote:
> Thoughts, objections? Discuss ;-)

I think we're putting a lot of carts before the horse.

We have like half a dozen subprojects and none are moving because
there's no clear idea of what to do next and why.  The first thing to do
is to focus on fixing and speeding up user-mode MTTCG (Sergey's work on
upstreaming the patches from Fred and me) and on upstreaming Alvise's
work on ll/sc and TLB flushes.

The second thing to do is to make tb_flush thread safe; here there are
three competing mechanisms (Fred's run_safe_on_cpu, Emilio's mass
invalidation and my idea of using RCU) that we can discuss.  Your idea
here is a fourth one.  It's not a bad one, not at all.  But even if it
has other positive side effects (e.g. easier jump patching), it's a
large project to embark on when there are at least three other
possibilities---two of them with code and the third (mine) being a
subset of yours.

Paolo

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
  2016-04-04  9:01 ` Paolo Bonzini
@ 2016-04-04 10:39   ` Sergey Fedorov
  2016-04-04 11:24     ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Sergey Fedorov @ 2016-04-04 10:39 UTC (permalink / raw)
  To: Paolo Bonzini, Alex Bennée, QEMU Developers, MTTCG Devel
  Cc: Peter Maydell

On 04/04/16 12:01, Paolo Bonzini wrote:
> On 04/04/2016 10:54, Alex Bennée wrote:
>> Thoughts, objections? Discuss ;-)
> I think we're putting a lot of carts before the horse.
>
> We have like half a dozen subprojects and none are moving because
> there's no clear idea of what to do next and why.  The first thing to do
> is to focus on fixing and speeding up user-mode MTTCG (Sergey's work on
> upstreaming the patches from Fred and me) and on upstreaming Alvise's
> work on ll/sc and TLB flushes.
>
> The second thing to do is to make tb_flush thread safe; here there are
> three competing mechanisms (Fred's run_safe_on_cpu, Emilio's mass
> invalidation and my idea of using RCU) that we can discuss.  Your idea
> here is a fourth one.  It's not a bad one, not at all.  But even if it
> has other positive side effects (e.g. easier jump patching), it's a
> large project to embark on when there are at least three other
> possibilities---two of them with code and the third (mine) being a
> subset of yours.
>
>

I would agree with Paolo that this may be a bit ahead of time. But
actually, this is an interesting idea we should definitely keep in mind
while we're focused on more immediate work and concerning the overall
design on MTTCG project.

Thanks,
Sergey

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
  2016-04-04 10:39   ` Sergey Fedorov
@ 2016-04-04 11:24     ` Alex Bennée
  2016-04-04 11:33       ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2016-04-04 11:24 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: MTTCG Devel, Peter Maydell, QEMU Developers, Cota, Paolo Bonzini,
	Richard Henderson

Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 04/04/16 12:01, Paolo Bonzini wrote:
>> On 04/04/2016 10:54, Alex Bennée wrote:
>>> Thoughts, objections? Discuss ;-)
>> I think we're putting a lot of carts before the horse.

It is a very valid objection. I don't want to add another major code
change in while getting everything else lined up. However I thought it
merited discussion if it makes solving some of the other problems
easier.

>>
>> We have like half a dozen subprojects and none are moving because
>> there's no clear idea of what to do next and why.  The first thing to do
>> is to focus on fixing and speeding up user-mode MTTCG (Sergey's work on
>> upstreaming the patches from Fred and me) and on upstreaming Alvise's
>> work on ll/sc and TLB flushes.

FWIW I've dropped the patch that moves tb_find_fast out of tb_lock in
the upcoming base-patches-v2 series because it:
  - weirdly breaks the pxe tests
  - is pretty ugly anyway

This is obviously going to have an impact on performance which bought me
back to should we be thinking more holistically about how to approach
this problem.

In terms of focus I'm currently interested in getting Sergey's cleanups
in (which stand on their own merit of being generally cleaner for TCG)
and having a solid base-patches that is of use to everyone for testing
out the various solutions.

After the base patches are out I was going to rebuild Fred's
multi_tcg_v8 tree as *a* working implementation for ARMv7 based on a
hopefully firm foundation. I'm also looking forward to Emilio's
upcoming tree and Alvise's stop-the-world solution.

>> The second thing to do is to make tb_flush thread safe; here there are
>> three competing mechanisms (Fred's run_safe_on_cpu, Emilio's mass
>> invalidation and my idea of using RCU) that we can discuss.  Your idea
>> here is a fourth one.  It's not a bad one, not at all.  But even if it
>> has other positive side effects (e.g. easier jump patching), it's a
>> large project to embark on when there are at least three other
>> possibilities---two of them with code and the third (mine) being a
>> subset of yours.

 - async_safe_work

  I think this works but I've certainly found corner cases. Under heavy
  load the various vCPU queues do not drain evenly so you end up with
  high memory usage and the later CPUs having 1000s of deferred tasks.

 - Emilio's mass invalidation

  I need to re-read his series to remind myself of what's going on here.
  To be honest as I know he was re-spinning I was going to wait until
  the next iteration.

 - RCU based approaches

  As you say the two approaches are related, one is just an extension of
  the other but with a little more indirection. I'm fairly confident the
  architecture works as I've seen it used in other similar systems but
  making it work for QEMU is potentially messy implementation detail.

> I would agree with Paolo that this may be a bit ahead of time. But
> actually, this is an interesting idea we should definitely keep in mind
> while we're focused on more immediate work and concerning the overall
> design on MTTCG project.

My thoughts was such an approach could obviate the need to clean up the
existing patching code. But it is a big change so there is a certain
wisdom in continuing with the current less radical clean-up even if we
envision a future when we can throw that new code away ;-)

--
Alex Bennée

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
  2016-04-04 11:24     ` Alex Bennée
@ 2016-04-04 11:33       ` Paolo Bonzini
  2016-04-04 12:01         ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Bonzini @ 2016-04-04 11:33 UTC (permalink / raw)
  To: Alex Bennée, Sergey Fedorov
  Cc: MTTCG Devel, Peter Maydell, Cota, QEMU Developers,
	Richard Henderson



On 04/04/2016 13:24, Alex Bennée wrote:
> FWIW I've dropped the patch that moves tb_find_fast out of tb_lock in
> the upcoming base-patches-v2 series because it:
>   - weirdly breaks the pxe tests

I'm happy to take a look.

>   - is pretty ugly anyway

What exactly makes it ugly?

Paolo

> This is obviously going to have an impact on performance which bought me
> back to should we be thinking more holistically about how to approach
> this problem.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
  2016-04-04 11:33       ` Paolo Bonzini
@ 2016-04-04 12:01         ` Alex Bennée
  2016-04-04 12:13           ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2016-04-04 12:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: MTTCG Devel, Peter Maydell, QEMU Developers, Cota, Sergey Fedorov,
	Richard Henderson


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 04/04/2016 13:24, Alex Bennée wrote:
>> FWIW I've dropped the patch that moves tb_find_fast out of tb_lock in
>> the upcoming base-patches-v2 series because it:
>>   - weirdly breaks the pxe tests
>
> I'm happy to take a look.

Be my guest:

https://github.com/stsquad/qemu/tree/mttcg/base-patches-v2-with-tb-fast-delock

It breaks on i386 and x86_64 softmmu targets with make check. I also got
a command line failing with a dump of the pxe bootfile and handbuilt seabios:

./x86_64-softmmu/qemu-system-x86_64 -display vnc=:0 -machine accel=tcg
-netdev user,id=net0,tftp=./,bootfile=tests/pxe-test-disk.raw -device
virtio-net-pci,netdev=net0,romfile=pc-bios/pxe-virtio.rom -chardev
stdio,id=seabios -device isa-debugcon,iobase=0x402,chardev=seabios -bios roms/seabios/out/bios.bin

>
>>   - is pretty ugly anyway
>
> What exactly makes it ugly?

Subjective I guess but the use of tb_lock_recursive() and
tb_lock_reset() seem not as clean as they could be. Is this stuff Fred
re-introduced?

>
> Paolo
>
>> This is obviously going to have an impact on performance which bought me
>> back to should we be thinking more holistically about how to approach
>> this problem.


--
Alex Bennée

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer?
  2016-04-04 12:01         ` Alex Bennée
@ 2016-04-04 12:13           ` Paolo Bonzini
  0 siblings, 0 replies; 7+ messages in thread
From: Paolo Bonzini @ 2016-04-04 12:13 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, Peter Maydell, QEMU Developers, Cota, Sergey Fedorov,
	Richard Henderson

On 04/04/2016 14:01, Alex Bennée wrote:
> > >   - is pretty ugly anyway
> >
> > What exactly makes it ugly?
>
> Subjective I guess but the use of tb_lock_recursive() and
> tb_lock_reset() seem not as clean as they could be. Is this stuff Fred
> re-introduced?

I think tb_lock_recursive is mine, while tb_lock_reset is from Fred.

I concede that those functions are a bit ugly, but that is unrelated to
the "tb_find_fast outside tb_lock" idea (which was from Fred and gives a
~20% or so speedup at -smp 2; more as the number of VCPUs increases).

You could make tb_lock recursive I guess if you want to avoid
tb_lock_recursive() but it really just sweeps it under the carpet.
tb_lock_reset() is needed anyway because it is used after longjmp.

Paolo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-04-04 12:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-04  8:54 [Qemu-devel] Should we introduce a TranslationRegion with its own codegen buffer? Alex Bennée
2016-04-04  9:01 ` Paolo Bonzini
2016-04-04 10:39   ` Sergey Fedorov
2016-04-04 11:24     ` Alex Bennée
2016-04-04 11:33       ` Paolo Bonzini
2016-04-04 12:01         ` Alex Bennée
2016-04-04 12:13           ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).