From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36948)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.bennee@linaro.org>) id 1an0H7-0002oe-Rt
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 04:54:22 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.bennee@linaro.org>) id 1an0H3-0000S5-Cl
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 04:54:21 -0400
Received: from mail-lf0-x236.google.com ([2a00:1450:4010:c07::236]:33417)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.bennee@linaro.org>) id 1an0H2-0000Rp-UR
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 04:54:17 -0400
Received: by mail-lf0-x236.google.com with SMTP id p188so131988926lfd.0
	for <qemu-devel@nongnu.org>; Mon, 04 Apr 2016 01:54:16 -0700 (PDT)
From: Alex =?utf-8?Q?Benn=C3=A9e?= <alex.bennee@linaro.org>
Date: Mon, 04 Apr 2016 09:54:14 +0100
Message-ID: <87y48tu53d.fsf@linaro.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Subject: [Qemu-devel] Should we introduce a TranslationRegion with its own
	codegen buffer?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: QEMU Developers <qemu-devel@nongnu.org>, MTTCG Devel <mttcg@listserver.greensocs.com>
Cc: Peter Maydell <peter.maydell@linaro.org>, Sergey Fedorov <serge.fdrv@gmail.com>, Richard Henderson <rth@twiddle.net>, ".Cota"@braap.org, Paolo Bonzini <pbonzini@redhat.com>

Hi,

While reviewing the recent TB patching cleanup patches I wondered if
there is a cleaner way of handling TB invalidation. Currently we have a
single code generation buffer which slowly fills up with TBs as we
execute code. These TBs are chained together if they exist in the same
physical page (so we always exit to the run-loop if crossing a page
boundary).

We hold a bunch if extra information in the TBs to facilitate looking
things up. We have:

    struct TranslationBlock *phys_hash_next;

to facilitate looking up TBs which have matching hashes in the physical
address lookup. We also have:

    uintptr_t jmp_list_next[2];
    uintptr_t jmp_list_first;

Which are used for unwinding any jump patching should we invalidate a
page and hence don't want code jumping to potentially invalid
translations.

We also have a number of associated jump caches held against each CPU
which is used to optimise re-entry into generated code as we go round
the main run-loop. These also have to be cleanly invalidated as TBs are
marked invalid.

Finally as the TBs are generated on demand the actual code may not be
locally jump-able which makes atomic patching of the jumps trickier to
do.

TB invalidation is almost always due to page mapping changes although
SMC code and debugging are also causes for throwing away translations.
I'm wondering if it is time to add a layer of indirection to simplify
the process?

If we introduce a TranslationRegion which could initially cover a pages
worth of code. It would have its own code generation buffer protected by
an RCU lock to make it easier to swap out on code buffers is a clean
manner:

  Normal Execution (cpu_exec):
    - lookup TranslationRegion
    - take RCU read lock
    - lookup-or-generate TB
    - jump into code
    - exit TB
    - release RCU read lock

  Invalidation of Page:
    - lookup TranslationRegion
    - take RCU write lock
      - create fresh empty region
      - signal cpu_exit to all vCPUs
    - release RCU write lock
    - take RCU read lock
    - lookup-or-generate TB
    - jump into code
    - exit TB
    - release RCU read lock*

* when the last vCPU releases the read lock on the old code it can be
  cleanly thrown away. No fiddly jump patching required.

There are some potential optimisation's that could be made to this system
as well.

Jump patching would become easier on backends with limited jump ranges
as local code is kept together in a shared code buffer.

For one there is no reason the area covered by a TranslationRegion has
to be a page. For example the kernel segment once mapped will never
change. Then all internal TBs could still be chained together.

I'm sure there is scope for localising the jump cache to regions as
there are likely to be only a few entry points to any given page with
the rest all being internal branches for loops and conditionals.

The only thing I can currently think that may be a problem is
potentially causing heap fragmentation by having a large number of code
buffers. This could probably be ameliorated by using custom allocation
routines for the code buffers.

I'm going to have a bit of a play to see what this sort of solution
would look like in the code but I thought I'd sketch the idea out to see
if there are any obvious glaring holes or others things to consider.

Thoughts, objections? Discuss ;-)

--
Alex Bennée