[Qemu-devel] tcg: reworking tb_invalidated

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] tcg: reworking tb_invalidated_flag
@ 2016-03-30 17:08 Sergey Fedorov
  2016-03-30 18:13 ` Paolo Bonzini
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-30 17:08 UTC (permalink / raw)
  To: QEMU Developers, Paolo Bonzini, Richard Henderson,
	Peter Crosthwaite
  Cc: Sergey Fedorov, Alex Bennée

Hi,

This is a follow-up of the discussion [1]. The main focus is to move
towards thread-safe TB invalidation and translation buffer flush.
In addition, we can get more clean, readable and reliable code.

First, I'd like to summarize how 'tb_invalidated_flag' is used.
Basically, 'tb_invalidated_flag' is used to catch two events:
 * some TB has been invalidated by tb_phys_invalidate();
 * the whole translation buffer has been flushed by tb_flush().

This is important because we need to be sure:
 * the last executed TB can be safely patched by tb_add_jump() to
   directly call the next one in cpu_exec();
 * the original TB should be provided in 'tb->orig_tb' for further
   possible invalidation along with the temporarily generated TB when in
   cpu_exec_nocache().

cpu_exec_nocache() is a simple case because it is not such a hot piece
of code as cpu_exec() and it is only used in system mode which is not
currently used from multiple threads (though it could be in MTTCG).
Supposing it is safe to invalidate an already invalidated TB, it just
needs to check if tb_flush() has been called during tb_gen_code(). It
could be done by resetting 'tb_invalidated_flag' before calling
tb_gen_code() and checking it afterwards. 'tb_lock' should be held. To
make sure this doesn't affect other code relying on a value of the flag,
we could just preserve it inside 'tb_lock' critical section.

cpu_exec() case is a bit more subtle. Regarding tb_phys_invalidate(), it
shouldn't be harmful if an invalidated TB get patched because it is not
going to be executed anymore. It may only be a concern of efficiency.
Though it doesn't seem to happen frequently.

As of catching tb_flush() in cpu_exec() there have been three approaches
proposed.

The first approach is to get rid of 'tb_invalidated_flag' and use
'tb_flush_count'. Capture 'tb_flush_count' inside 'tb_lock' critical
section of cpu_exec() and compare it on each execution loop iteration
before trying to do tb_add_jump(). This would be simple and clear but it
would cost an extra load from a shared variable 'tb_flush_count' each
time we go over the execution loop.

The second approach is to make 'tb_invalidated_flag' per-CPU. This
would be conceptually similar to what we have, but would give us thread
safety. With this approach, we need to be careful to correctly clear and
set the flag.

The third approach is to mark each individual TB as valid/invalid. This
is what Emilio has in his MTTCG series [2]. Following this approach, we
could have very clean code with no extra overhead on the hot path.
However, it would require to mark all TBs as invalid on tb_flush().
Given that tb_flush() is rare, it shouldn't be a significant overhead.
Also, there could be several options how to mark TB valid/invalid:
a dedicated flag could be introduced or some invalid value of
pc/cs_base/flags could be used.

So the question is, what would be the most appropriate solution?

[1] http://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg06180.html
[2] http://lists.nongnu.org/archive/html/qemu-devel/2015-08/msg02582.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-30 17:08 [Qemu-devel] tcg: reworking tb_invalidated_flag Sergey Fedorov
@ 2016-03-30 18:13 ` Paolo Bonzini
  2016-03-31 13:14   ` Sergey Fedorov
  2016-03-30 19:08 ` Richard Henderson
  2016-03-31 10:48 ` Alex Bennée
  2 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2016-03-30 18:13 UTC (permalink / raw)
  To: Sergey Fedorov, QEMU Developers, Richard Henderson,
	Peter Crosthwaite
  Cc: Sergey Fedorov, Alex Bennée



On 30/03/2016 19:08, Sergey Fedorov wrote:
> cpu_exec() case is a bit more subtle. Regarding tb_phys_invalidate(), it
> shouldn't be harmful if an invalidated TB get patched because it is not
> going to be executed anymore. It may only be a concern of efficiency.
> Though it doesn't seem to happen frequently.
> 
> As of catching tb_flush() in cpu_exec() there have been three approaches
> proposed.
> 
> The first approach is to get rid of 'tb_invalidated_flag' and use
> 'tb_flush_count'. Capture 'tb_flush_count' inside 'tb_lock' critical
> section of cpu_exec() and compare it on each execution loop iteration
> before trying to do tb_add_jump(). This would be simple and clear but it
> would cost an extra load from a shared variable 'tb_flush_count' each
> time we go over the execution loop.
> 
> The second approach is to make 'tb_invalidated_flag' per-CPU. This
> would be conceptually similar to what we have, but would give us thread
> safety. With this approach, we need to be careful to correctly clear and
> set the flag.

You can just ensure that setting and clearing it is done under tb_lock.

> The third approach is to mark each individual TB as valid/invalid. This
> is what Emilio has in his MTTCG series [2]. Following this approach, we
> could have very clean code with no extra overhead on the hot path.
> However, it would require to mark all TBs as invalid on tb_flush().
> Given that tb_flush() is rare, it shouldn't be a significant overhead.

I agree; the problem with this solution is that it adds complexity to
tb_flush.  Because TranslationBlocks live in tcg_ctx.tb_ctx.tbs you need
special code to exit all CPUs at tb_flush time, otherwise you risk that
a tb_alloc reuses a TranslationBlock while it is in use by a VCPU.  This
would be complicated, hard to test, and only needed for a very rare
occurrence(*).  Because tb_flush() is rare I believe we should keep it
as simple as possible.

	(*) Both Emilio and Fred have a similar "exit all CPUs"
	    primitive.  Fred used it for tb_flush() and IIRC also
	    for some TLB flush primitives.  However, Alvise has been
	    reworking the TLB flush code to use a "CPU halted" state
	    that's less prone to deadlocks.

> Also, there could be several options how to mark TB valid/invalid:
> a dedicated flag could be introduced or some invalid value of
> pc/cs_base/flags could be used.

This is probably necessary nevertheless in order to make
tb_find_physical run outside tb_lock.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-30 18:13 ` Paolo Bonzini
@ 2016-03-31 13:14   ` Sergey Fedorov
  2016-03-31 13:37     ` Alex Bennée
  2016-03-31 13:40     ` Paolo Bonzini
  0 siblings, 2 replies; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-31 13:14 UTC (permalink / raw)
  To: Paolo Bonzini, Sergey Fedorov, QEMU Developers, Richard Henderson,
	Peter Crosthwaite
  Cc: Alex Bennée

On 30/03/16 21:13, Paolo Bonzini wrote:
>
> On 30/03/2016 19:08, Sergey Fedorov wrote:
>> The second approach is to make 'tb_invalidated_flag' per-CPU. This
>> would be conceptually similar to what we have, but would give us thread
>> safety. With this approach, we need to be careful to correctly clear and
>> set the flag.
> You can just ensure that setting and clearing it is done under tb_lock.

So it could remain sitting in 'tcg_ctx.tb_ctx'. I'm just wondering what
could be real benefits for making it per-CPU then?

> Because TranslationBlocks live in tcg_ctx.tb_ctx.tbs you need
> special code to exit all CPUs at tb_flush time, otherwise you risk that
> a tb_alloc reuses a TranslationBlock while it is in use by a VCPU.

Looks like no matter which approach we use, it's ultimately necessary to
ensure all CPUs have exited from translated code before the translation
buffer may be safely flushed.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 13:14   ` Sergey Fedorov
@ 2016-03-31 13:37     ` Alex Bennée
  2016-03-31 14:06       ` Sergey Fedorov
  2016-03-31 13:40     ` Paolo Bonzini
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Bennée @ 2016-03-31 13:37 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Paolo Bonzini, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Richard Henderson


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 30/03/16 21:13, Paolo Bonzini wrote:
>>
>> On 30/03/2016 19:08, Sergey Fedorov wrote:
>>> The second approach is to make 'tb_invalidated_flag' per-CPU. This
>>> would be conceptually similar to what we have, but would give us thread
>>> safety. With this approach, we need to be careful to correctly clear and
>>> set the flag.
>> You can just ensure that setting and clearing it is done under tb_lock.
>
> So it could remain sitting in 'tcg_ctx.tb_ctx'. I'm just wondering what
> could be real benefits for making it per-CPU then?
>
>> Because TranslationBlocks live in tcg_ctx.tb_ctx.tbs you need
>> special code to exit all CPUs at tb_flush time, otherwise you risk that
>> a tb_alloc reuses a TranslationBlock while it is in use by a VCPU.
>
> Looks like no matter which approach we use, it's ultimately necessary to
> ensure all CPUs have exited from translated code before the translation
> buffer may be safely flushed.

One approach would be to have multiple translation contexts with their
own buffers and then you can safely flush TBs if no vCPUs are currently
executing in those regions. But I suspect that is a much more complex
future optimisation.

Having said that is it safe to flush TBs from a given page if we know
no vCPUs are currently executing in that page? As the execution loop has
to exit the chained TBs as we cross page boundaries we could just keep
account of which vCPUs are currently in which page.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 13:37     ` Alex Bennée
@ 2016-03-31 14:06       ` Sergey Fedorov
  2016-03-31 19:03         ` Sergey Fedorov
  2016-04-01 11:11         ` Alex Bennée
  0 siblings, 2 replies; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-31 14:06 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Paolo Bonzini, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Richard Henderson

On 31/03/16 16:37, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>> Looks like no matter which approach we use, it's ultimately necessary to
>> ensure all CPUs have exited from translated code before the translation
>> buffer may be safely flushed.
> One approach would be to have multiple translation contexts with their
> own buffers and then you can safely flush TBs if no vCPUs are currently
> executing in those regions. But I suspect that is a much more complex
> future optimisation.

Yes, this is much more complex and its performance impact should be
investigated.

> Having said that is it safe to flush TBs from a given page if we know
> no vCPUs are currently executing in that page? As the execution loop has
> to exit the chained TBs as we cross page boundaries we could just keep
> account of which vCPUs are currently in which page.

It should be safe to invalidate a TB while some other CPU is executing
its translated code. But it should be guaranteed that no CPU execute any
old TB after tb_flush() since we're going to start reusing those TBs.

I see how TB cannot be patched if it spans two pages, is there any on
when TCG goto_tb can be generated?

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 14:06       ` Sergey Fedorov
@ 2016-03-31 19:03         ` Sergey Fedorov
  2016-03-31 19:56           ` Paolo Bonzini
  2016-04-01 11:11         ` Alex Bennée
  1 sibling, 1 reply; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-31 19:03 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Paolo Bonzini, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Richard Henderson

On 31/03/16 17:06, Sergey Fedorov wrote:
> It should be safe to invalidate a TB while some other CPU is executing
> its translated code.

Probably it's not safe to invalidate a TB while some other thread is
executing the translated code. Direct jumps to the TB being invalidated
should be reset. In case of using direct jump method, native jump
instruction should be patched in the translated code. There are some
restrictions on modification of concurrently executing code, e.g. see
section "3.4 Atomic Modification of Machine-Code Instructions" in [1].
For instance, only aligned, 8-byte atomic code modification are safe on
AMD processors, otherwise we can wind up executing a corrupted
instruction stream. I can't see i386 TCG backend does some alignment of
the jump target when translating goto_tb TCG op. I suspect other TCG
targets also have their limitations.

Looks like we have to ensure all vCPUs are out of translated code when
doing TB patching either doing tb_add_jump() or tb_phys_invalidate().
Did I missed something?

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37204.pdf

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 19:03         ` Sergey Fedorov
@ 2016-03-31 19:56           ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2016-03-31 19:56 UTC (permalink / raw)
  To: Sergey Fedorov, Alex Bennée
  Cc: Richard Henderson, QEMU Developers, Sergey Fedorov,
	Peter Crosthwaite

On 31/03/2016 21:03, Sergey Fedorov wrote:
> Looks like we have to ensure all vCPUs are out of translated code when
> doing TB patching either doing tb_add_jump() or tb_phys_invalidate().
> Did I missed something?

Almost all TCG targets have naturally aligned instructions, so that's
not a problem; we can assume that 32-bit writes are atomic, though
perhaps we can change them to atomic_set just to be safe.

Only s390 and x86 can have unaligned instructions.  For x86 I suppose
you can use 1 to 3 byte nops so that the first byte of the jump ends up
at ip%4=3.  For s390 you can do the same, I don't know the encoding of
the canonical nop but an "or 0,0" instruction can do and is 16 bits wide
(in this case instructions are 16-bit aligned so you'd want ip%4=2).

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 14:06       ` Sergey Fedorov
  2016-03-31 19:03         ` Sergey Fedorov
@ 2016-04-01 11:11         ` Alex Bennée
  2016-04-01 11:23           ` Sergey Fedorov
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Bennée @ 2016-04-01 11:11 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Paolo Bonzini, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Richard Henderson


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 31/03/16 16:37, Alex Bennée wrote:
>> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>>> Looks like no matter which approach we use, it's ultimately necessary to
>>> ensure all CPUs have exited from translated code before the translation
>>> buffer may be safely flushed.
>> One approach would be to have multiple translation contexts with their
>> own buffers and then you can safely flush TBs if no vCPUs are currently
>> executing in those regions. But I suspect that is a much more complex
>> future optimisation.
>
> Yes, this is much more complex and its performance impact should be
> investigated.
>
>> Having said that is it safe to flush TBs from a given page if we know
>> no vCPUs are currently executing in that page? As the execution loop has
>> to exit the chained TBs as we cross page boundaries we could just keep
>> account of which vCPUs are currently in which page.
>
> It should be safe to invalidate a TB while some other CPU is executing
> its translated code. But it should be guaranteed that no CPU execute any
> old TB after tb_flush() since we're going to start reusing those TBs.
>
> I see how TB cannot be patched if it spans two pages, is there any on
> when TCG goto_tb can be generated?

Do you mean tcg_gen_goto_tb?

AFAIUI all blocks end with goto_tb post-ambles but they should only
directly jump to another TB if they are in the same page.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-04-01 11:11         ` Alex Bennée
@ 2016-04-01 11:23           ` Sergey Fedorov
  0 siblings, 0 replies; 17+ messages in thread
From: Sergey Fedorov @ 2016-04-01 11:23 UTC (permalink / raw)
  To: Alex Bennée
  Cc: Paolo Bonzini, Peter Crosthwaite, QEMU Developers, Sergey Fedorov,
	Richard Henderson

On 01/04/16 14:11, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 31/03/16 16:37, Alex Bennée wrote:
>>> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>>>> Looks like no matter which approach we use, it's ultimately necessary to
>>>> ensure all CPUs have exited from translated code before the translation
>>>> buffer may be safely flushed.
>>> One approach would be to have multiple translation contexts with their
>>> own buffers and then you can safely flush TBs if no vCPUs are currently
>>> executing in those regions. But I suspect that is a much more complex
>>> future optimisation.
>> Yes, this is much more complex and its performance impact should be
>> investigated.
>>
>>> Having said that is it safe to flush TBs from a given page if we know
>>> no vCPUs are currently executing in that page? As the execution loop has
>>> to exit the chained TBs as we cross page boundaries we could just keep
>>> account of which vCPUs are currently in which page.
>> It should be safe to invalidate a TB while some other CPU is executing
>> its translated code. But it should be guaranteed that no CPU execute any
>> old TB after tb_flush() since we're going to start reusing those TBs.
>>
>> I see how TB cannot be patched if it spans two pages, is there any on
>> when TCG goto_tb can be generated?
> Do you mean tcg_gen_goto_tb?
>
> AFAIUI all blocks end with goto_tb post-ambles but they should only
> directly jump to another TB if they are in the same page.

Thanks, I see the checks.

Regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 13:14   ` Sergey Fedorov
  2016-03-31 13:37     ` Alex Bennée
@ 2016-03-31 13:40     ` Paolo Bonzini
  2016-03-31 14:35       ` Sergey Fedorov
  1 sibling, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2016-03-31 13:40 UTC (permalink / raw)
  To: Sergey Fedorov, Sergey Fedorov, QEMU Developers,
	Richard Henderson, Peter Crosthwaite
  Cc: Alex Bennée

On 31/03/2016 15:14, Sergey Fedorov wrote:
> On 30/03/16 21:13, Paolo Bonzini wrote:
>>
>> On 30/03/2016 19:08, Sergey Fedorov wrote:
>>> The second approach is to make 'tb_invalidated_flag' per-CPU. This
>>> would be conceptually similar to what we have, but would give us thread
>>> safety. With this approach, we need to be careful to correctly clear and
>>> set the flag.
>> You can just ensure that setting and clearing it is done under tb_lock.
> 
> So it could remain sitting in 'tcg_ctx.tb_ctx'. I'm just wondering what
> could be real benefits for making it per-CPU then?

All CPUs need to observe it in order to clear their own local next_tb
variable.  It is not enough to do that once, so it has to be per-CPU.

>> Because TranslationBlocks live in tcg_ctx.tb_ctx.tbs you need
>> special code to exit all CPUs at tb_flush time, otherwise you risk that
>> a tb_alloc reuses a TranslationBlock while it is in use by a VCPU.
> 
> Looks like no matter which approach we use, it's ultimately necessary to
> ensure all CPUs have exited from translated code before the translation
> buffer may be safely flushed.

My plan was to use some kind of double buffering, where only half of
code_gen_buffer is in use.  At the end of tb_flush you call cpu_exit()
on all CPUs, so that CPUs stop executing chained TBs from the old half
before they can see one from the new half.

If code_gen_buffer is static you have to preallocate two buffers (and
two tbs arrays) and waste one of them; while it is theoretically
possible to have CPUs still executing from the old half while you finish
the new half, it can be more or less ignored.

If it is dynamic, the previously used areas can be freed with call_rcu,
and you can safely allocate a new code_gen_buffer and tbs array.

I haven't thought much about it; it might require keeping a cache of the
tbs array per CPU, and possibly changing the code under "if
(tcg_ctx.tb_ctx.tb_invalidated_flag)" to simply exit cpu_exec.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 13:40     ` Paolo Bonzini
@ 2016-03-31 14:35       ` Sergey Fedorov
  2016-03-31 14:44         ` Paolo Bonzini
  0 siblings, 1 reply; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-31 14:35 UTC (permalink / raw)
  To: Paolo Bonzini, Sergey Fedorov, QEMU Developers, Richard Henderson,
	Peter Crosthwaite
  Cc: Alex Bennée

On 31/03/16 16:40, Paolo Bonzini wrote:
>
> On 31/03/2016 15:14, Sergey Fedorov wrote:
>> On 30/03/16 21:13, Paolo Bonzini wrote:
>>> On 30/03/2016 19:08, Sergey Fedorov wrote:
>>>> The second approach is to make 'tb_invalidated_flag' per-CPU. This
>>>> would be conceptually similar to what we have, but would give us thread
>>>> safety. With this approach, we need to be careful to correctly clear and
>>>> set the flag.
>>> You can just ensure that setting and clearing it is done under tb_lock.
>> So it could remain sitting in 'tcg_ctx.tb_ctx'. I'm just wondering what
>> could be real benefits for making it per-CPU then?
> All CPUs need to observe it in order to clear their own local next_tb
> variable.  It is not enough to do that once, so it has to be per-CPU.

So for each vCPU thread we have a separate flag to clear it safely. Got
it, thanks.

>
>>> Because TranslationBlocks live in tcg_ctx.tb_ctx.tbs you need
>>> special code to exit all CPUs at tb_flush time, otherwise you risk that
>>> a tb_alloc reuses a TranslationBlock while it is in use by a VCPU.
>> Looks like no matter which approach we use, it's ultimately necessary to
>> ensure all CPUs have exited from translated code before the translation
>> buffer may be safely flushed.
> My plan was to use some kind of double buffering, where only half of
> code_gen_buffer is in use.  At the end of tb_flush you call cpu_exit()
> on all CPUs, so that CPUs stop executing chained TBs from the old half
> before they can see one from the new half.
>
> If code_gen_buffer is static you have to preallocate two buffers (and
> two tbs arrays) and waste one of them; while it is theoretically
> possible to have CPUs still executing from the old half while you finish
> the new half, it can be more or less ignored.
>
> If it is dynamic, the previously used areas can be freed with call_rcu,
> and you can safely allocate a new code_gen_buffer and tbs array.
>
> I haven't thought much about it; it might require keeping a cache of the
> tbs array per CPU, and possibly changing the code under "if
> (tcg_ctx.tb_ctx.tb_invalidated_flag)" to simply exit cpu_exec.

Maybe save this idea for latter? :) We'd better use a simpler approach
at first and then move on and optimize. BTW, a few years ago I came
across an interesting paper on code cache eviction granularities [1].

[1]
http://www.cs.virginia.edu/kim/courses/cs851/papers/hazelwood04mediumgrained.pdf

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 14:35       ` Sergey Fedorov
@ 2016-03-31 14:44         ` Paolo Bonzini
  0 siblings, 0 replies; 17+ messages in thread
From: Paolo Bonzini @ 2016-03-31 14:44 UTC (permalink / raw)
  To: Sergey Fedorov, Sergey Fedorov, QEMU Developers,
	Richard Henderson, Peter Crosthwaite
  Cc: Alex Bennée



On 31/03/2016 16:35, Sergey Fedorov wrote:
>> > My plan was to use some kind of double buffering, where only half of
>> > code_gen_buffer is in use.  At the end of tb_flush you call cpu_exit()
>> > on all CPUs, so that CPUs stop executing chained TBs from the old half
>> > before they can see one from the new half.
>> >
>> > If code_gen_buffer is static you have to preallocate two buffers (and
>> > two tbs arrays) and waste one of them; while it is theoretically
>> > possible to have CPUs still executing from the old half while you finish
>> > the new half, it can be more or less ignored.
>> >
>> > If it is dynamic, the previously used areas can be freed with call_rcu,
>> > and you can safely allocate a new code_gen_buffer and tbs array.
>> >
>> > I haven't thought much about it; it might require keeping a cache of the
>> > tbs array per CPU, and possibly changing the code under "if
>> > (tcg_ctx.tb_ctx.tb_invalidated_flag)" to simply exit cpu_exec.
> Maybe save this idea for latter? :) We'd better use a simpler approach
> at first and then move on and optimize. BTW, a few years ago I came
> across an interesting paper on code cache eviction granularities [1].

It depends on what is simpler.  Emilio and Fred's code was tricky and
touched the central CPU execution loop in cpus.c.

Paolo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-30 17:08 [Qemu-devel] tcg: reworking tb_invalidated_flag Sergey Fedorov
  2016-03-30 18:13 ` Paolo Bonzini
@ 2016-03-30 19:08 ` Richard Henderson
  2016-03-30 21:21   ` Sergey Fedorov
  2016-03-31 10:48 ` Alex Bennée
  2 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2016-03-30 19:08 UTC (permalink / raw)
  To: Sergey Fedorov, QEMU Developers, Paolo Bonzini, Peter Crosthwaite
  Cc: Sergey Fedorov, Alex Bennée

On 03/30/2016 10:08 AM, Sergey Fedorov wrote:
> As of catching tb_flush() in cpu_exec() there have been three approaches
> proposed.
> 
> The first approach is to get rid of 'tb_invalidated_flag' and use
> 'tb_flush_count'. Capture 'tb_flush_count' inside 'tb_lock' critical
> section of cpu_exec() and compare it on each execution loop iteration
> before trying to do tb_add_jump(). This would be simple and clear but it
> would cost an extra load from a shared variable 'tb_flush_count' each
> time we go over the execution loop.
> 
> The second approach is to make 'tb_invalidated_flag' per-CPU. This
> would be conceptually similar to what we have, but would give us thread
> safety. With this approach, we need to be careful to correctly clear and
> set the flag.
> 
> The third approach is to mark each individual TB as valid/invalid. This
> is what Emilio has in his MTTCG series [2]. Following this approach, we
> could have very clean code with no extra overhead on the hot path.
> However, it would require to mark all TBs as invalid on tb_flush().
> Given that tb_flush() is rare, it shouldn't be a significant overhead.
> Also, there could be several options how to mark TB valid/invalid:
> a dedicated flag could be introduced or some invalid value of
> pc/cs_base/flags could be used.

I'm not really fond of this third option.  Yes, tb_flush is rare on some
targets, but for those with software managed TLBs they're much more common.
See e.g. mips and sparc.

Even when tb_flush is rare, there can be 500k-1M TBs live when the flush does
happen.  I simply cannot imagine that individually touching 1M variables
performs as well as setting one global variable, or taking a global lock.


r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-30 19:08 ` Richard Henderson
@ 2016-03-30 21:21   ` Sergey Fedorov
  0 siblings, 0 replies; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-30 21:21 UTC (permalink / raw)
  To: Richard Henderson, Sergey Fedorov, QEMU Developers, Paolo Bonzini,
	Peter Crosthwaite
  Cc: Alex Bennée

On 30/03/16 22:08, Richard Henderson wrote:
> On 03/30/2016 10:08 AM, Sergey Fedorov wrote:
>> As of catching tb_flush() in cpu_exec() there have been three approaches
>> proposed.
>>
>> The first approach is to get rid of 'tb_invalidated_flag' and use
>> 'tb_flush_count'. Capture 'tb_flush_count' inside 'tb_lock' critical
>> section of cpu_exec() and compare it on each execution loop iteration
>> before trying to do tb_add_jump(). This would be simple and clear but it
>> would cost an extra load from a shared variable 'tb_flush_count' each
>> time we go over the execution loop.
>>
>> The second approach is to make 'tb_invalidated_flag' per-CPU. This
>> would be conceptually similar to what we have, but would give us thread
>> safety. With this approach, we need to be careful to correctly clear and
>> set the flag.
>>
>> The third approach is to mark each individual TB as valid/invalid. This
>> is what Emilio has in his MTTCG series [2]. Following this approach, we
>> could have very clean code with no extra overhead on the hot path.
>> However, it would require to mark all TBs as invalid on tb_flush().
>> Given that tb_flush() is rare, it shouldn't be a significant overhead.
>> Also, there could be several options how to mark TB valid/invalid:
>> a dedicated flag could be introduced or some invalid value of
>> pc/cs_base/flags could be used.
> I'm not really fond of this third option.  Yes, tb_flush is rare on some
> targets, but for those with software managed TLBs they're much more common.
> See e.g. mips and sparc.
>
> Even when tb_flush is rare, there can be 500k-1M TBs live when the flush does
> happen.  I simply cannot imagine that individually touching 1M variables
> performs as well as setting one global variable, or taking a global lock.

Ah, you are right, I missed that fact, thanks.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-30 17:08 [Qemu-devel] tcg: reworking tb_invalidated_flag Sergey Fedorov
  2016-03-30 18:13 ` Paolo Bonzini
  2016-03-30 19:08 ` Richard Henderson
@ 2016-03-31 10:48 ` Alex Bennée
  2016-03-31 12:42   ` Sergey Fedorov
  2 siblings, 1 reply; 17+ messages in thread
From: Alex Bennée @ 2016-03-31 10:48 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: Paolo Bonzini, Sergey Fedorov, Peter Crosthwaite, QEMU Developers,
	Richard Henderson


Sergey Fedorov <sergey.fedorov@linaro.org> writes:

> Hi,
>
> This is a follow-up of the discussion [1]. The main focus is to move
> towards thread-safe TB invalidation and translation buffer flush.
> In addition, we can get more clean, readable and reliable code.
>
> First, I'd like to summarize how 'tb_invalidated_flag' is used.
> Basically, 'tb_invalidated_flag' is used to catch two events:
>  * some TB has been invalidated by tb_phys_invalidate();
>  * the whole translation buffer has been flushed by tb_flush().

I know we are system focused at the moment but does linux-user ever
flush groups of TBs, say when mappings change? Or does this trigger a
whole tb_flush?

> This is important because we need to be sure:
>  * the last executed TB can be safely patched by tb_add_jump() to
>    directly call the next one in cpu_exec();
>  * the original TB should be provided in 'tb->orig_tb' for further
>    possible invalidation along with the temporarily generated TB when in
>    cpu_exec_nocache().
>
> cpu_exec_nocache() is a simple case because it is not such a hot piece
> of code as cpu_exec() and it is only used in system mode which is not
> currently used from multiple threads (though it could be in MTTCG).
> Supposing it is safe to invalidate an already invalidated TB, it just
> needs to check if tb_flush() has been called during tb_gen_code(). It
> could be done by resetting 'tb_invalidated_flag' before calling
> tb_gen_code() and checking it afterwards. 'tb_lock' should be held. To
> make sure this doesn't affect other code relying on a value of the flag,
> we could just preserve it inside 'tb_lock' critical section.
>
> cpu_exec() case is a bit more subtle. Regarding tb_phys_invalidate(), it
> shouldn't be harmful if an invalidated TB get patched because it is not
> going to be executed anymore. It may only be a concern of efficiency.
> Though it doesn't seem to happen frequently.
>
> As of catching tb_flush() in cpu_exec() there have been three approaches
> proposed.
>
> The first approach is to get rid of 'tb_invalidated_flag' and use
> 'tb_flush_count'. Capture 'tb_flush_count' inside 'tb_lock' critical
> section of cpu_exec() and compare it on each execution loop iteration
> before trying to do tb_add_jump(). This would be simple and clear but it
> would cost an extra load from a shared variable 'tb_flush_count' each
> time we go over the execution loop.
>
> The second approach is to make 'tb_invalidated_flag' per-CPU. This
> would be conceptually similar to what we have, but would give us thread
> safety. With this approach, we need to be careful to correctly clear and
> set the flag.
>
> The third approach is to mark each individual TB as valid/invalid. This
> is what Emilio has in his MTTCG series [2]. Following this approach, we
> could have very clean code with no extra overhead on the hot path.
> However, it would require to mark all TBs as invalid on tb_flush().
> Given that tb_flush() is rare, it shouldn't be a significant overhead.

I'm with Richard on this, it sounds like something that should be
quantified. That said I'm sure there are mitigations at the cost of
complexity that could help.

Are there times when it might help with eliminating whole blocks of TBs
at a time? Or is a page at a time about it?

> Also, there could be several options how to mark TB valid/invalid:
> a dedicated flag could be introduced or some invalid value of
> pc/cs_base/flags could be used.
>
> So the question is, what would be the most appropriate solution?
>
> [1] http://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg06180.html
> [2] http://lists.nongnu.org/archive/html/qemu-devel/2015-08/msg02582.html


--
Alex Bennée

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 10:48 ` Alex Bennée
@ 2016-03-31 12:42   ` Sergey Fedorov
  2016-03-31 16:25     ` Richard Henderson
  0 siblings, 1 reply; 17+ messages in thread
From: Sergey Fedorov @ 2016-03-31 12:42 UTC (permalink / raw)
  To: Alex Bennée, Sergey Fedorov
  Cc: Paolo Bonzini, Peter Crosthwaite, QEMU Developers,
	Richard Henderson

On 31/03/16 13:48, Alex Bennée wrote:
> I know we are system focused at the moment but does linux-user ever
> flush groups of TBs, say when mappings change? Or does this trigger a
> whole tb_flush?

Yes, e.g. target_mmap() calls tb_invalidate_phys_range().

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Qemu-devel] tcg: reworking tb_invalidated_flag
  2016-03-31 12:42   ` Sergey Fedorov
@ 2016-03-31 16:25     ` Richard Henderson
  0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2016-03-31 16:25 UTC (permalink / raw)
  To: Sergey Fedorov, Alex Bennée, Sergey Fedorov
  Cc: Paolo Bonzini, QEMU Developers, Peter Crosthwaite

On 03/31/2016 05:42 AM, Sergey Fedorov wrote:
> On 31/03/16 13:48, Alex Bennée wrote:
>> I know we are system focused at the moment but does linux-user ever
>> flush groups of TBs, say when mappings change? Or does this trigger a
>> whole tb_flush?
> 
> Yes, e.g. target_mmap() calls tb_invalidate_phys_range().

Mapping over executable code is exceedingly rare, however.
We could do just as well with a full flush, if that's simpler.


r~

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-04-01 11:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-03-30 17:08 [Qemu-devel] tcg: reworking tb_invalidated_flag Sergey Fedorov
2016-03-30 18:13 ` Paolo Bonzini
2016-03-31 13:14   ` Sergey Fedorov
2016-03-31 13:37     ` Alex Bennée
2016-03-31 14:06       ` Sergey Fedorov
2016-03-31 19:03         ` Sergey Fedorov
2016-03-31 19:56           ` Paolo Bonzini
2016-04-01 11:11         ` Alex Bennée
2016-04-01 11:23           ` Sergey Fedorov
2016-03-31 13:40     ` Paolo Bonzini
2016-03-31 14:35       ` Sergey Fedorov
2016-03-31 14:44         ` Paolo Bonzini
2016-03-30 19:08 ` Richard Henderson
2016-03-30 21:21   ` Sergey Fedorov
2016-03-31 10:48 ` Alex Bennée
2016-03-31 12:42   ` Sergey Fedorov
2016-03-31 16:25     ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).