From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57523) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yrq9g-0000yG-9K for qemu-devel@nongnu.org; Mon, 11 May 2015 12:02:12 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yrq9b-000723-55 for qemu-devel@nongnu.org; Mon, 11 May 2015 12:02:08 -0400 Received: from greensocs.com ([193.104.36.180]:39449) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yrq9a-00070W-Sg for qemu-devel@nongnu.org; Mon, 11 May 2015 12:02:03 -0400 Message-ID: <5550D275.7010807@greensocs.com> Date: Mon, 11 May 2015 18:01:57 +0200 From: Frederic Konrad MIME-Version: 1.0 References: <1431118934-20900-1-git-send-email-cota@braap.org> In-Reply-To: <1431118934-20900-1-git-send-email-cota@braap.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC 0/8] Helper-based Atomic Instruction Emulation (AIE) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Emilio G. Cota" , qemu-devel@nongnu.org Cc: mttcg@listserver.greensocs.com, Peter Maydell , Alvise Rigo , Paolo Bonzini , alex.bennee@linaro.org, Richard Henderson On 08/05/2015 23:02, Emilio G. Cota wrote: > Hi all, > > These are patches I've been working on for some time now. > Since emulation of atomic instructions is recently getting > attention([1], [2]), I'm submitting them for comment. > > [1] http://thread.gmane.org/gmane.comp.emulators.qemu/314406 > [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 > > Main features of this design: > > - Performance and scalability are the main design goal: guest code should > scale as much as it would scale running natively on the host. > > For this, a host lock is (if necessary) assigned to each 16-byte > aligned chunk of the physical address space. > The assignment (i.e. lock allocation + registration) only happens > after an atomic operation on a particular physical address > is performed. To keep track of this sparse set of locks, > a lockless radix tree is used, so lookups are fast and scalable. > > - Translation helpers are employed to call the 'aie' module, which is > the common code that accesses the radix tree, locking the appropriate > entry depending on the access' physical address. > > - No special host atomic instructions (e.g. cmpxchg16b) are required; > mutexes and include/qemu/atomic.h is all that's needed. > > - Usermode and full-system are supported with the same code. Note that > the newly-added tiny_set module is necessary to properly emulate LL/SC, > since the number of "cpus" (i.e. threads) is unbounded in usermode-- > for full-system mode a bitmap would have been sufficient. > > - ARM: Stores concurrent with LL/SC primitives are initially not dealt > with. > This is my choice, since I'm assuming most sane code will only > handle data atomically using LL/SC primitives. However, SWP can > be used, so whenevery a SWP instruction is issued, stores start checking > that stores do not clash with concurrent SWP instructions. This is > implemented via pre/post-store helpers. I've stress-tested this with a > heavily contended guest lock (64 cores), and it works fine. Executing > non-trivial pre/post-store helpers adds a 5% perf overhead to linux > bootup, and is negligible on regular programs. Anyway most > sane code doesn't use SWP (linux bootup certainly doesn't.), so this > overhead is rarely seen. > > - x86: Instead of acquiring the same host lock every time LOCK is found, > the acquisition of an AIE lock (via the radix tree) is done when the > address of the ensuing load/store is known. > Loads perform this check at compile-time. > Stores are emulated using the same trick as in ARM; non-atomic stores > are executed as atomic stores iff there's a prior atomic operation that > has been executed on their target address. This for instance ensures > that a regular store cannot race with a cmpxchg. > This has very small overhead (negligible with OpenSSL's bntest in > user-only), and scales as native code. > > - Barriers: not emulated yet. They're needed to correctly run non-trivial > lockless code (I'm using concurrencykit's testbenches). > The strongly-ordered-guest-on-weakly-ordered-host problem remains; my > guess is that we'll have to sacrifice single-threaded performance to > make it work (e.g. using pre-post ld/st helpers). > > - 64-bit guest on 32-bit host: Not supported yet. Note that 64-bit > loads/stores on a 32-bit guest are not atomic, yet 64-bit code might > have been written assuming that they are. Checks for this will be needed. > > - Other ISAs: not done yet, but they should be like either ARM or x86. > > - License of new files: is there a preferred license for new code? I think it's GPL V2 or later. Fred > > - Please tolerate the lack of comments in code and commit logs, when > preparing this RFC I thought it's better to put all the info > here. If this wasn't an RFC I'd have done it differently. > > Thanks for reading this far, comments welcome! > > Emilio