From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55830) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZPXoV-0000WD-Sr for qemu-devel@nongnu.org; Wed, 12 Aug 2015 11:19:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZPXoR-0007DB-5V for qemu-devel@nongnu.org; Wed, 12 Aug 2015 11:19:35 -0400 Received: from greensocs.com ([193.104.36.180]:58492) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZPXoQ-0007Co-Q5 for qemu-devel@nongnu.org; Wed, 12 Aug 2015 11:19:31 -0400 Message-ID: <55CB63FA.1070200@greensocs.com> Date: Wed, 12 Aug 2015 17:19:22 +0200 From: Frederic Konrad MIME-Version: 1.0 References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com> <55C9EE60.80004@redhat.com> <55C9FFC0.7040502@greensocs.com> In-Reply-To: <55C9FFC0.7040502@greensocs.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini , qemu-devel@nongnu.org, mttcg@listserver.greensocs.com Cc: mark.burton@greensocs.com, alex.bennee@linaro.org, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com On 11/08/2015 15:59, Frederic Konrad wrote: > On 11/08/2015 14:45, Paolo Bonzini wrote: >> On 10/08/2015 17:26, fred.konrad@greensocs.com wrote: >>> From: KONRAD Frederic >>> >>> This is the 7th round of the MTTCG patch series. > Thanks to look at this. >> Here is a list of issues that I found: >> >> - tb_lock usage in tb_find_fast is complicated and introduces the need >> for other complicated code such as the tb_invalidate callback. Instead= , >> the tb locking should reuse the cpu-exec.c code for user-mode emulatio= n, >> with additional locking in the spots identified by Fred. > The reason for this is that locking around tb_find_fast just kills the=20 > performance. > >> >> - tb_lock uses a recursive lock, but this is not necessary. Did I eve= r >> say I dislike recursive mutexes? :) The wrappers >> tb_lock()/tb_unlock()/tb_lock_reset() can catch recursive locking for >> us, so it's not hard to do without it. > True this is definitely not nice but seems some code eg: tb_invalidate=20 > are called from > different place with or without the tb_lock. We might probably be able=20 > to lock the > caller. > >> >> - code_bitmap is not protected by any mutex >> (tb_invalidate_phys_page_fast is called with the iothread mutex taken, >> but other users of code_bitmap do not use it). Writes should be >> protected by the tb_lock, reads by either tb_lock or RCU. > Yes good point I missed this one. > >> - memory barriers are probably requested around accesses to >> ->exit_request. ->thread_kicked also needs to be accessed with atomic= s, >> because async_run_{,safe_}on_cpu can be called outside the big QEMU=20 >> lock. >> >> - the whole signal-based qemu_cpu_kick can just go away. Just setting >> tcg_exit_req and exit_request will kick the TCG thread. The hairy Win= 32 >> SuspendThread/ResumeThread goes away too. I suggest doing it now, >> because proving it unnecessary is easier than proving it correct. > Just setting tcg_exit_req and exit_request and signal the=20 > cpu->halt_cond I guess? BTW that affect KVM as well. Seems this mechanism is used as well with qemu_cpu_kick_self().. Which is a little strange as it seems the SIGIPI=20 trigger a dummy signal handler? memset(&sigact, 0, sizeof(sigact)); sigact.sa_handler =3D dummy_signal; sigaction(SIG_IPI, &sigact, NULL); > >> - user-mode emulation is broken (does not compile) >> >> - the big QEMU lock is not taken anywhere for MMIO accesses that requi= re >> it (i.e. basically all of them) > Isn't that handled by prepare_mmio_access? > >> >> - some code wants to be called _outside_ the big QEMU lock, for exampl= e >> because it longjmps back to cpu_exec. For example, I suspect that the >> notdirty callbacks must be marked with=20 >> memory_region_clear_global_locking. > > Fred >> >> I've started looking at them (and documenting the locking conventions >> for functions), and I hope to post it to some git repo later this week= . >> >> Paolo >> >>> It can be cloned from: >>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7. >>> >>> This patch-set try to address the different issues in the global=20 >>> picture of >>> MTTCG, presented on the wiki. >>> >>> =3D=3D Needed patch for our work =3D=3D >>> >>> Some preliminaries are needed for our work: >>> * current_cpu doesn't make sense in mttcg so a tcg_executing flag=20 >>> is added to >>> the CPUState. >>> * We need to run some work safely when all VCPUs are outside their=20 >>> execution >>> loop. This is done with the async_run_safe_work_on_cpu function=20 >>> introduced >>> in this series. >>> * QemuSpin lock is introduced (on posix only yet) to allow a=20 >>> faster handling of >>> atomic instruction. >>> >>> =3D=3D Code generation and cache =3D=3D >>> >>> As Qemu stands, there is no protection at all against two threads=20 >>> attempting to >>> generate code at the same time or modifying a TranslationBlock. >>> The "protect TBContext with tb_lock" patch address the issue of code=20 >>> generation >>> and makes all the tb_* function thread safe (except tb_flush). >>> This raised the question of one or multiple caches. We choosed to=20 >>> use one >>> unified cache because it's easier as a first step and since the=20 >>> structure of >>> QEMU effectively has a =E2=80=98local=E2=80=99 cache per CPU in the f= orm of the jump=20 >>> cache, we >>> don't see the benefit of having two pools of tbs. >>> >>> =3D=3D Dirty tracking =3D=3D >>> >>> Protecting the IOs: >>> To allows all VCPUs threads to run at the same time we need to drop t= he >>> global_mutex as soon as possible. The io access need to take the=20 >>> mutex. This is >>> likely to change when=20 >>> http://thread.gmane.org/gmane.comp.emulators.qemu/345258 >>> will be upstreamed. >>> >>> Invalidation of TranslationBlocks: >>> We can have all VCPUs running during an invalidation. Each VCPU is=20 >>> able to clean >>> it's jump cache itself as it is in CPUState so that can be handled=20 >>> by a simple >>> call to async_run_on_cpu. However tb_invalidate also writes to the >>> TranslationBlock which is shared as we have only one pool. >>> Hence this part of invalidate requires all VCPUs to exit before it=20 >>> can be done. >>> Hence the async_run_safe_work_on_cpu is introduced to handle this cas= e. >>> >>> =3D=3D Atomic instruction =3D=3D >>> >>> For now only ARM on x64 is supported by using an cmpxchg instruction. >>> Specifically the limitation of this approach is that it is harder to=20 >>> support >>> 64bit ARM on a host architecture that is multi-core, but only=20 >>> supports 32 bit >>> cmpxchg (we believe this could be the case for some PPC cores). For=20 >>> now this >>> case is not correctly handled. The existing atomic patch will=20 >>> attempt to execute >>> the 64 bit cmpxchg functionality in a non thread safe fashion. Our=20 >>> intention is >>> to provide a new multi-thread ARM atomic patch for 64bit ARM on=20 >>> effective 32bit >>> hosts. >>> This atomic instruction part has been tested with Alexander's atomic=20 >>> stress repo >>> available here: >>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html >>> >>> The execution is a little slower than upstream probably because of=20 >>> the different >>> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to=20 >>> spin_lock >>> reduce considerably the difference. >>> >>> =3D=3D Testing =3D=3D >>> >>> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux=20 >>> guest show >>> a good performance progression: it takes basically 18s upstream to=20 >>> complete vs >>> 10s with MTTCG. >>> >>> Testing image is available here: >>> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3 >>> >>> Then simply: >>> ./configure --target-list=3Darm-softmmu >>> make -j8 >>> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage >>> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic >>> --append "console=3DttyAMA0" >>> >>> login: root >>> >>> The dhrystone command is the last one in the history. >>> "dhrystone 10000000 & dhrystone 10000000" >>> >>> The atomic spinlock benchmark from Alexander shows that atomic=20 >>> basically work. >>> Just follow the instruction here: >>> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html >>> >>> =3D=3D Known issues =3D=3D >>> >>> * GDB stub: >>> GDB stub is not tested right now it will probably requires some=20 >>> changes to >>> work. >>> >>> * deadlock on exit: >>> When exiting QEMU Ctrl-C some VCPU's thread are not able to exit=20 >>> and continue >>> execution. >>> http://git.greensocs.com/fkonrad/mttcg/issues/1 >>> >>> * memory_region_rom_device_set_romd from pflash01 just crashes the=20 >>> TCG code. >>> Strangely this happen only with "-smp 4" and 2 in the DTB. >>> http://git.greensocs.com/fkonrad/mttcg/issues/2 >>> >>> Changes V6 -> V7: >>> * global_lock: >>> * Don't protect softmmu read/write helper as it's now done in >>> adress_space_rw. >>> * tcg_exec_flag: >>> * Make the flag atomically test and set through an API. >>> * introduce async_safe_work: >>> * move qemu_cpu_kick_thread to avoid prototype declaration. >>> * use the work_mutex. >>> * async_work: >>> * protect it with a mutex (work_mutex) against concurent access= . >>> * tb_lock: >>> * protect tcg_malloc_internal as well. >>> * signal the VCPU even of current_cpu is NULL. >>> * added PSCI patch. >>> * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899)= . >>> >>> Changes V5 -> V6: >>> * Introduce async_safe_work to do the tb_flush and some part of=20 >>> tb_invalidate. >>> * Introduce QemuSpin from Guillaume which allow a faster atomic=20 >>> instruction >>> (6s to pass Alexander's atomic test instead of 30s before). >>> * Don't take tb_lock before tb_find_fast. >>> * Handle tb_flush with async_safe_work. >>> * Handle tb_invalidate with async_work and async_safe_work. >>> * Drop the tlb_flush_request mechanism and use async_work as well. >>> * Fix the wrong lenght in atomic patch. >>> * Fix the wrong return address for exception in atomic patch. >>> >>> Alex Benn=C3=A9e (1): >>> target-arm/psci.c: wake up sleeping CPUs (MTTCG) >>> >>> Guillaume Delbergue (1): >>> add support for spin lock on POSIX systems exclusively >>> >>> KONRAD Frederic (17): >>> cpus: protect queued_work_* with work_mutex. >>> cpus: add tcg_exec_flag. >>> cpus: introduce async_run_safe_work_on_cpu. >>> replace spinlock by QemuMutex. >>> remove unused spinlock. >>> protect TBContext with tb_lock. >>> tcg: remove tcg_halt_cond global variable. >>> Drop global lock during TCG code execution >>> cpu: remove exit_request global. >>> tcg: switch on multithread. >>> Use atomic cmpxchg to atomically check the exclusive value in a=20 >>> STREX >>> add a callback when tb_invalidate is called. >>> cpu: introduce tlb_flush*_all. >>> arm: use tlb_flush*_all >>> translate-all: introduces tb_flush_safe. >>> translate-all: (wip) use tb_flush_safe when we can't alloc more tb= . >>> mttcg: signal the associated cpu anyway. >>> >>> cpu-exec.c | 98 +++++++++------ >>> cpus.c | 295=20 >>> +++++++++++++++++++++++++------------------- >>> cputlb.c | 81 ++++++++++++ >>> include/exec/exec-all.h | 8 +- >>> include/exec/spinlock.h | 49 -------- >>> include/qemu/thread-posix.h | 4 + >>> include/qemu/thread-win32.h | 4 + >>> include/qemu/thread.h | 7 ++ >>> include/qom/cpu.h | 57 +++++++++ >>> linux-user/main.c | 6 +- >>> qom/cpu.c | 20 +++ >>> target-arm/cpu.c | 21 ++++ >>> target-arm/cpu.h | 6 + >>> target-arm/helper.c | 58 +++------ >>> target-arm/helper.h | 4 + >>> target-arm/op_helper.c | 128 ++++++++++++++++++- >>> target-arm/psci.c | 2 + >>> target-arm/translate.c | 101 +++------------ >>> target-i386/mem_helper.c | 16 ++- >>> target-i386/misc_helper.c | 27 +++- >>> tcg/i386/tcg-target.c | 8 ++ >>> tcg/tcg.h | 14 ++- >>> translate-all.c | 217 +++++++++++++++++++++++++++----- >>> util/qemu-thread-posix.c | 45 +++++++ >>> util/qemu-thread-win32.c | 30 +++++ >>> vl.c | 6 + >>> 26 files changed, 934 insertions(+), 378 deletions(-) >>> delete mode 100644 include/exec/spinlock.h >>> > >