From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55224) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZP8vr-00087u-0N for qemu-devel@nongnu.org; Tue, 11 Aug 2015 08:45:33 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZP8vm-0005H5-Ne for qemu-devel@nongnu.org; Tue, 11 Aug 2015 08:45:30 -0400 Received: from mail-wi0-x229.google.com ([2a00:1450:400c:c05::229]:38407) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZP8vm-0005GU-CK for qemu-devel@nongnu.org; Tue, 11 Aug 2015 08:45:26 -0400 Received: by wicja10 with SMTP id ja10so67309860wic.1 for ; Tue, 11 Aug 2015 05:45:25 -0700 (PDT) Sender: Paolo Bonzini References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com> From: Paolo Bonzini Message-ID: <55C9EE60.80004@redhat.com> Date: Tue, 11 Aug 2015 14:45:20 +0200 MIME-Version: 1.0 In-Reply-To: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: fred.konrad@greensocs.com, qemu-devel@nongnu.org, mttcg@listserver.greensocs.com Cc: alex.bennee@linaro.org, mark.burton@greensocs.com, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com On 10/08/2015 17:26, fred.konrad@greensocs.com wrote: > From: KONRAD Frederic > > This is the 7th round of the MTTCG patch series. Here is a list of issues that I found: - tb_lock usage in tb_find_fast is complicated and introduces the need for other complicated code such as the tb_invalidate callback. Instead, the tb locking should reuse the cpu-exec.c code for user-mode emulation, with additional locking in the spots identified by Fred. - tb_lock uses a recursive lock, but this is not necessary. Did I ever say I dislike recursive mutexes? :) The wrappers tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for us, so it's not hard to do without it. - code_bitmap is not protected by any mutex (tb_invalidate_phys_page_fast is called with the iothread mutex taken, but other users of code_bitmap do not use it). Writes should be protected by the tb_lock, reads by either tb_lock or RCU. - memory barriers are probably requested around accesses to ->exit_request. ->thread_kicked also needs to be accessed with atomics, because async_run_{,safe_}on_cpu can be called outside the big QEMU lock. - the whole signal-based qemu_cpu_kick can just go away. Just setting tcg_exit_req and exit_request will kick the TCG thread. The hairy Win32 SuspendThread/ResumeThread goes away too. I suggest doing it now, because proving it unnecessary is easier than proving it correct. - user-mode emulation is broken (does not compile) - the big QEMU lock is not taken anywhere for MMIO accesses that require it (i.e. basically all of them) - some code wants to be called _outside_ the big QEMU lock, for example because it longjmps back to cpu_exec. For example, I suspect that the notdirty callbacks must be marked with memory_region_clear_global_locking. I've started looking at them (and documenting the locking conventions for functions), and I hope to post it to some git repo later this week. Paolo > > It can be cloned from: > git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7. > > This patch-set try to address the different issues in the global picture of > MTTCG, presented on the wiki. > > == Needed patch for our work == > > Some preliminaries are needed for our work: > * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to > the CPUState. > * We need to run some work safely when all VCPUs are outside their execution > loop. This is done with the async_run_safe_work_on_cpu function introduced > in this series. > * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of > atomic instruction. > > == Code generation and cache == > > As Qemu stands, there is no protection at all against two threads attempting to > generate code at the same time or modifying a TranslationBlock. > The "protect TBContext with tb_lock" patch address the issue of code generation > and makes all the tb_* function thread safe (except tb_flush). > This raised the question of one or multiple caches. We choosed to use one > unified cache because it's easier as a first step and since the structure of > QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we > don't see the benefit of having two pools of tbs. > > == Dirty tracking == > > Protecting the IOs: > To allows all VCPUs threads to run at the same time we need to drop the > global_mutex as soon as possible. The io access need to take the mutex. This is > likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258 > will be upstreamed. > > Invalidation of TranslationBlocks: > We can have all VCPUs running during an invalidation. Each VCPU is able to clean > it's jump cache itself as it is in CPUState so that can be handled by a simple > call to async_run_on_cpu. However tb_invalidate also writes to the > TranslationBlock which is shared as we have only one pool. > Hence this part of invalidate requires all VCPUs to exit before it can be done. > Hence the async_run_safe_work_on_cpu is introduced to handle this case. > > == Atomic instruction == > > For now only ARM on x64 is supported by using an cmpxchg instruction. > Specifically the limitation of this approach is that it is harder to support > 64bit ARM on a host architecture that is multi-core, but only supports 32 bit > cmpxchg (we believe this could be the case for some PPC cores). For now this > case is not correctly handled. The existing atomic patch will attempt to execute > the 64 bit cmpxchg functionality in a non thread safe fashion. Our intention is > to provide a new multi-thread ARM atomic patch for 64bit ARM on effective 32bit > hosts. > This atomic instruction part has been tested with Alexander's atomic stress repo > available here: > https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html > > The execution is a little slower than upstream probably because of the different > VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spin_lock > reduce considerably the difference. > > == Testing == > > A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux guest show > a good performance progression: it takes basically 18s upstream to complete vs > 10s with MTTCG. > > Testing image is available here: > https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3 > > Then simply: > ./configure --target-list=arm-softmmu > make -j8 > ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage > -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic > --append "console=ttyAMA0" > > login: root > > The dhrystone command is the last one in the history. > "dhrystone 10000000 & dhrystone 10000000" > > The atomic spinlock benchmark from Alexander shows that atomic basically work. > Just follow the instruction here: > https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html > > == Known issues == > > * GDB stub: > GDB stub is not tested right now it will probably requires some changes to > work. > > * deadlock on exit: > When exiting QEMU Ctrl-C some VCPU's thread are not able to exit and continue > execution. > http://git.greensocs.com/fkonrad/mttcg/issues/1 > > * memory_region_rom_device_set_romd from pflash01 just crashes the TCG code. > Strangely this happen only with "-smp 4" and 2 in the DTB. > http://git.greensocs.com/fkonrad/mttcg/issues/2 > > Changes V6 -> V7: > * global_lock: > * Don't protect softmmu read/write helper as it's now done in > adress_space_rw. > * tcg_exec_flag: > * Make the flag atomically test and set through an API. > * introduce async_safe_work: > * move qemu_cpu_kick_thread to avoid prototype declaration. > * use the work_mutex. > * async_work: > * protect it with a mutex (work_mutex) against concurent access. > * tb_lock: > * protect tcg_malloc_internal as well. > * signal the VCPU even of current_cpu is NULL. > * added PSCI patch. > * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899). > > Changes V5 -> V6: > * Introduce async_safe_work to do the tb_flush and some part of tb_invalidate. > * Introduce QemuSpin from Guillaume which allow a faster atomic instruction > (6s to pass Alexander's atomic test instead of 30s before). > * Don't take tb_lock before tb_find_fast. > * Handle tb_flush with async_safe_work. > * Handle tb_invalidate with async_work and async_safe_work. > * Drop the tlb_flush_request mechanism and use async_work as well. > * Fix the wrong lenght in atomic patch. > * Fix the wrong return address for exception in atomic patch. > > Alex Bennée (1): > target-arm/psci.c: wake up sleeping CPUs (MTTCG) > > Guillaume Delbergue (1): > add support for spin lock on POSIX systems exclusively > > KONRAD Frederic (17): > cpus: protect queued_work_* with work_mutex. > cpus: add tcg_exec_flag. > cpus: introduce async_run_safe_work_on_cpu. > replace spinlock by QemuMutex. > remove unused spinlock. > protect TBContext with tb_lock. > tcg: remove tcg_halt_cond global variable. > Drop global lock during TCG code execution > cpu: remove exit_request global. > tcg: switch on multithread. > Use atomic cmpxchg to atomically check the exclusive value in a STREX > add a callback when tb_invalidate is called. > cpu: introduce tlb_flush*_all. > arm: use tlb_flush*_all > translate-all: introduces tb_flush_safe. > translate-all: (wip) use tb_flush_safe when we can't alloc more tb. > mttcg: signal the associated cpu anyway. > > cpu-exec.c | 98 +++++++++------ > cpus.c | 295 +++++++++++++++++++++++++------------------- > cputlb.c | 81 ++++++++++++ > include/exec/exec-all.h | 8 +- > include/exec/spinlock.h | 49 -------- > include/qemu/thread-posix.h | 4 + > include/qemu/thread-win32.h | 4 + > include/qemu/thread.h | 7 ++ > include/qom/cpu.h | 57 +++++++++ > linux-user/main.c | 6 +- > qom/cpu.c | 20 +++ > target-arm/cpu.c | 21 ++++ > target-arm/cpu.h | 6 + > target-arm/helper.c | 58 +++------ > target-arm/helper.h | 4 + > target-arm/op_helper.c | 128 ++++++++++++++++++- > target-arm/psci.c | 2 + > target-arm/translate.c | 101 +++------------ > target-i386/mem_helper.c | 16 ++- > target-i386/misc_helper.c | 27 +++- > tcg/i386/tcg-target.c | 8 ++ > tcg/tcg.h | 14 ++- > translate-all.c | 217 +++++++++++++++++++++++++++----- > util/qemu-thread-posix.c | 45 +++++++ > util/qemu-thread-win32.c | 30 +++++ > vl.c | 6 + > 26 files changed, 934 insertions(+), 378 deletions(-) > delete mode 100644 include/exec/spinlock.h >