From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57523)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1Yrq9g-0000yG-9K
	for qemu-devel@nongnu.org; Mon, 11 May 2015 12:02:12 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1Yrq9b-000723-55
	for qemu-devel@nongnu.org; Mon, 11 May 2015 12:02:08 -0400
Received: from greensocs.com ([193.104.36.180]:39449)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <fred.konrad@greensocs.com>) id 1Yrq9a-00070W-Sg
	for qemu-devel@nongnu.org; Mon, 11 May 2015 12:02:03 -0400
Message-ID: <5550D275.7010807@greensocs.com>
Date: Mon, 11 May 2015 18:01:57 +0200
From: Frederic Konrad <fred.konrad@greensocs.com>
MIME-Version: 1.0
References: <1431118934-20900-1-git-send-email-cota@braap.org>
In-Reply-To: <1431118934-20900-1-git-send-email-cota@braap.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC 0/8] Helper-based Atomic Instruction
	Emulation (AIE)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Emilio G. Cota" <cota@braap.org>, qemu-devel@nongnu.org
Cc: mttcg@listserver.greensocs.com, Peter Maydell <peter.maydell@linaro.org>, Alvise Rigo <a.rigo@virtualopensystems.com>, Paolo Bonzini <pbonzini@redhat.com>, alex.bennee@linaro.org, Richard Henderson <rth@twiddle.net>

On 08/05/2015 23:02, Emilio G. Cota wrote:
> Hi all,
>
> These are patches I've been working on for some time now.
> Since emulation of atomic instructions is recently getting
> attention([1], [2]), I'm submitting them for comment.
>
> [1] http://thread.gmane.org/gmane.comp.emulators.qemu/314406
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>
> Main features of this design:
>
> - Performance and scalability are the main design goal: guest code should
>    scale as much as it would scale running natively on the host.
>
>    For this, a host lock is (if necessary) assigned to each 16-byte
>    aligned chunk of the physical address space.
>    The assignment (i.e. lock allocation + registration) only happens
>    after an atomic operation on a particular physical address
>    is performed. To keep track of this sparse set of locks,
>    a lockless radix tree is used, so lookups are fast and scalable.
>
> - Translation helpers are employed to call the 'aie' module, which is
>    the common code that accesses the radix tree, locking the appropriate
>    entry depending on the access' physical address.
>
> - No special host atomic instructions (e.g. cmpxchg16b) are required;
>    mutexes and include/qemu/atomic.h is all that's needed.
>
> - Usermode and full-system are supported with the same code. Note that
>    the newly-added tiny_set module is necessary to properly emulate LL/SC,
>    since the number of "cpus" (i.e. threads) is unbounded in usermode--
>    for full-system mode a bitmap would have been sufficient.
>
> - ARM: Stores concurrent with LL/SC primitives are initially not dealt
>    with.
>    This is my choice, since I'm assuming most sane code will only
>    handle data atomically using LL/SC primitives. However, SWP can
>    be used, so whenevery a SWP instruction is issued, stores start checking
>    that stores do not clash with concurrent SWP instructions. This is
>    implemented via pre/post-store helpers. I've stress-tested this with a
>    heavily contended guest lock (64 cores), and it works fine. Executing
>    non-trivial pre/post-store helpers adds a 5% perf overhead to linux
>    bootup, and is negligible on regular programs. Anyway most
>    sane code doesn't use SWP (linux bootup certainly doesn't.), so this
>    overhead is rarely seen.
>
> - x86: Instead of acquiring the same host lock every time LOCK is found,
>    the acquisition of an AIE lock (via the radix tree) is done when the
>    address of the ensuing load/store is known.
>    Loads perform this check at compile-time.
>    Stores are emulated using the same trick as in ARM; non-atomic stores
>    are executed as atomic stores iff there's a prior atomic operation that
>    has been executed on their target address. This for instance ensures
>    that a regular store cannot race with a cmpxchg.
>    This has very small overhead (negligible with OpenSSL's bntest in
>    user-only), and scales as native code.
>
> - Barriers: not emulated yet. They're needed to correctly run non-trivial
>    lockless code (I'm using concurrencykit's testbenches).
>    The strongly-ordered-guest-on-weakly-ordered-host problem remains; my
>    guess is that we'll have to sacrifice single-threaded performance to
>    make it work (e.g. using pre-post ld/st helpers).
>
> - 64-bit guest on 32-bit host: Not supported yet. Note that 64-bit
>    loads/stores on a 32-bit guest are not atomic, yet 64-bit code might
>    have been written assuming that they are. Checks for this will be needed.
>
> - Other ISAs: not done yet, but they should be like either ARM or x86.
>
> - License of new files: is there a preferred license for new code?
I think it's GPL V2 or later.

Fred

>
> - Please tolerate the lack of comments in code and commit logs, when
>    preparing this RFC I thought it's better to put all the info
>    here. If this wasn't an RFC I'd have done it differently.
>
> Thanks for reading this far, comments welcome!
>
> 		Emilio