From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55224)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1ZP8vr-00087u-0N
	for qemu-devel@nongnu.org; Tue, 11 Aug 2015 08:45:33 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1ZP8vm-0005H5-Ne
	for qemu-devel@nongnu.org; Tue, 11 Aug 2015 08:45:30 -0400
Received: from mail-wi0-x229.google.com ([2a00:1450:400c:c05::229]:38407)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1ZP8vm-0005GU-CK
	for qemu-devel@nongnu.org; Tue, 11 Aug 2015 08:45:26 -0400
Received: by wicja10 with SMTP id ja10so67309860wic.1
	for <qemu-devel@nongnu.org>; Tue, 11 Aug 2015 05:45:25 -0700 (PDT)
Sender: Paolo Bonzini <paolo.bonzini@gmail.com>
References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <55C9EE60.80004@redhat.com>
Date: Tue, 11 Aug 2015 14:45:20 +0200
MIME-Version: 1.0
In-Reply-To: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: fred.konrad@greensocs.com, qemu-devel@nongnu.org, mttcg@listserver.greensocs.com
Cc: alex.bennee@linaro.org, mark.burton@greensocs.com, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com

On 10/08/2015 17:26, fred.konrad@greensocs.com wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
> 
> This is the 7th round of the MTTCG patch series.

Here is a list of issues that I found:

- tb_lock usage in tb_find_fast is complicated and introduces the need
for other complicated code such as the tb_invalidate callback.  Instead,
the tb locking should reuse the cpu-exec.c code for user-mode emulation,
with additional locking in the spots identified by Fred.

- tb_lock uses a recursive lock, but this is not necessary.  Did I ever
say I dislike recursive mutexes? :)  The wrappers
tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for
us, so it's not hard to do without it.

- code_bitmap is not protected by any mutex
(tb_invalidate_phys_page_fast is called with the iothread mutex taken,
but other users of code_bitmap do not use it).  Writes should be
protected by the tb_lock, reads by either tb_lock or RCU.

- memory barriers are probably requested around accesses to
->exit_request.  ->thread_kicked also needs to be accessed with atomics,
because async_run_{,safe_}on_cpu can be called outside the big QEMU lock.

- the whole signal-based qemu_cpu_kick can just go away.  Just setting
tcg_exit_req and exit_request will kick the TCG thread.  The hairy Win32
SuspendThread/ResumeThread goes away too.  I suggest doing it now,
because proving it unnecessary is easier than proving it correct.

- user-mode emulation is broken (does not compile)

- the big QEMU lock is not taken anywhere for MMIO accesses that require
it (i.e. basically all of them)

- some code wants to be called _outside_ the big QEMU lock, for example
because it longjmps back to cpu_exec.  For example, I suspect that the
notdirty callbacks must be marked with memory_region_clear_global_locking.

I've started looking at them (and documenting the locking conventions
for functions), and I hope to post it to some git repo later this week.

Paolo

> 
> It can be cloned from:
> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7.
> 
> This patch-set try to address the different issues in the global picture of
> MTTCG, presented on the wiki.
> 
> == Needed patch for our work ==
> 
> Some preliminaries are needed for our work:
>  * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to
>    the CPUState.
>  * We need to run some work safely when all VCPUs are outside their execution
>    loop. This is done with the async_run_safe_work_on_cpu function introduced
>    in this series.
>  * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of
>    atomic instruction.
> 
> == Code generation and cache ==
> 
> As Qemu stands, there is no protection at all against two threads attempting to
> generate code at the same time or modifying a TranslationBlock.
> The "protect TBContext with tb_lock" patch address the issue of code generation
> and makes all the tb_* function thread safe (except tb_flush).
> This raised the question of one or multiple caches. We choosed to use one
> unified cache because it's easier as a first step and since the structure of
> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we
> don't see the benefit of having two pools of tbs.
> 
> == Dirty tracking ==
> 
> Protecting the IOs:
> To allows all VCPUs threads to run at the same time we need to drop the
> global_mutex as soon as possible. The io access need to take the mutex. This is
> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258
> will be upstreamed.
> 
> Invalidation of TranslationBlocks:
> We can have all VCPUs running during an invalidation. Each VCPU is able to clean
> it's jump cache itself as it is in CPUState so that can be handled by a simple
> call to async_run_on_cpu. However tb_invalidate also writes to the
> TranslationBlock which is shared as we have only one pool.
> Hence this part of invalidate requires all VCPUs to exit before it can be done.
> Hence the async_run_safe_work_on_cpu is introduced to handle this case.
> 
> == Atomic instruction ==
> 
> For now only ARM on x64 is supported by using an cmpxchg instruction.
> Specifically the limitation of this approach is that it is harder to support
> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit
> cmpxchg (we believe this could be the case for some PPC cores).  For now this
> case is not correctly handled. The existing atomic patch will attempt to execute
> the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is
> to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit
> hosts.
> This atomic instruction part has been tested with Alexander's atomic stress repo
> available here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
> 
> The execution is a little slower than upstream probably because of the different
> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock
> reduce considerably the difference.
> 
> == Testing ==
> 
> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show
> a good performance progression: it takes basically 18s upstream to complete vs
> 10s with MTTCG.
> 
> Testing image is available here:
> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3
> 
> Then simply:
> ./configure --target-list=arm-softmmu
> make -j8
> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage
> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic
> --append "console=ttyAMA0"
> 
> login: root
> 
> The dhrystone command is the last one in the history.
> "dhrystone 10000000 & dhrystone 10000000"
> 
> The atomic spinlock benchmark from Alexander shows that atomic basically work.
> Just follow the instruction here:
> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html
> 
> == Known issues ==
> 
> * GDB stub:
>   GDB stub is not tested right now it will probably requires some changes to
>   work.
> 
> * deadlock on exit:
>   When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue
>   execution.
>   http://git.greensocs.com/fkonrad/mttcg/issues/1
> 
> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG code.
>   Strangely this happen only with "-smp 4" and 2 in the DTB.
>   http://git.greensocs.com/fkonrad/mttcg/issues/2
> 
> Changes V6 -> V7:
>   * global_lock:
>      * Don't protect softmmu read/write helper as it's now done in
>        adress_space_rw.
>   * tcg_exec_flag:
>      * Make the flag atomically test and set through an API.
>   * introduce async_safe_work:
>      * move qemu_cpu_kick_thread to avoid prototype declaration.
>      * use the work_mutex.
>   * async_work:
>      * protect it with a mutex (work_mutex) against concurent access.
>   * tb_lock:
>      * protect tcg_malloc_internal as well.
>   * signal the VCPU even of current_cpu is NULL.
>   * added PSCI patch.
>   * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899).
> 
> Changes V5 -> V6:
>   * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate.
>   * Introduce QemuSpin from Guillaume which allow a faster atomic instruction
>     (6s to pass Alexander's atomic test instead of 30s before).
>   * Don't take tb_lock before tb_find_fast.
>   * Handle tb_flush with async_safe_work.
>   * Handle tb_invalidate with async_work and async_safe_work.
>   * Drop the tlb_flush_request mechanism and use async_work as well.
>   * Fix the wrong lenght in atomic patch.
>   * Fix the wrong return address for exception in atomic patch.
> 
> Alex Bennée (1):
>   target-arm/psci.c: wake up sleeping CPUs (MTTCG)
> 
> Guillaume Delbergue (1):
>   add support for spin lock on POSIX systems exclusively
> 
> KONRAD Frederic (17):
>   cpus: protect queued_work_* with work_mutex.
>   cpus: add tcg_exec_flag.
>   cpus: introduce async_run_safe_work_on_cpu.
>   replace spinlock by QemuMutex.
>   remove unused spinlock.
>   protect TBContext with tb_lock.
>   tcg: remove tcg_halt_cond global variable.
>   Drop global lock during TCG code execution
>   cpu: remove exit_request global.
>   tcg: switch on multithread.
>   Use atomic cmpxchg to atomically check the exclusive value in a STREX
>   add a callback when tb_invalidate is called.
>   cpu: introduce tlb_flush*_all.
>   arm: use tlb_flush*_all
>   translate-all: introduces tb_flush_safe.
>   translate-all: (wip) use tb_flush_safe when we can't alloc more tb.
>   mttcg: signal the associated cpu anyway.
> 
>  cpu-exec.c                  |  98 +++++++++------
>  cpus.c                      | 295 +++++++++++++++++++++++++-------------------
>  cputlb.c                    |  81 ++++++++++++
>  include/exec/exec-all.h     |   8 +-
>  include/exec/spinlock.h     |  49 --------
>  include/qemu/thread-posix.h |   4 +
>  include/qemu/thread-win32.h |   4 +
>  include/qemu/thread.h       |   7 ++
>  include/qom/cpu.h           |  57 +++++++++
>  linux-user/main.c           |   6 +-
>  qom/cpu.c                   |  20 +++
>  target-arm/cpu.c            |  21 ++++
>  target-arm/cpu.h            |   6 +
>  target-arm/helper.c         |  58 +++------
>  target-arm/helper.h         |   4 +
>  target-arm/op_helper.c      | 128 ++++++++++++++++++-
>  target-arm/psci.c           |   2 +
>  target-arm/translate.c      | 101 +++------------
>  target-i386/mem_helper.c    |  16 ++-
>  target-i386/misc_helper.c   |  27 +++-
>  tcg/i386/tcg-target.c       |   8 ++
>  tcg/tcg.h                   |  14 ++-
>  translate-all.c             | 217 +++++++++++++++++++++++++++-----
>  util/qemu-thread-posix.c    |  45 +++++++
>  util/qemu-thread-win32.c    |  30 +++++
>  vl.c                        |   6 +
>  26 files changed, 934 insertions(+), 378 deletions(-)
>  delete mode 100644 include/exec/spinlock.h
>