From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57712) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZOw5V-0005aP-0Z for qemu-devel@nongnu.org; Mon, 10 Aug 2015 19:02:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZOw5R-0001nA-Pj for qemu-devel@nongnu.org; Mon, 10 Aug 2015 19:02:36 -0400 Received: from greensocs.com ([193.104.36.180]:34923) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZOw5R-0001mg-Ar for qemu-devel@nongnu.org; Mon, 10 Aug 2015 19:02:33 -0400 Message-ID: <55C92D7E.80604@greensocs.com> Date: Tue, 11 Aug 2015 01:02:22 +0200 From: Frederic Konrad MIME-Version: 1.0 References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com> <87bnefgfgx.fsf@linaro.org> In-Reply-To: <87bnefgfgx.fsf@linaro.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?UTF-8?B?QWxleCBCZW5uw6ll?= Cc: mttcg@listserver.greensocs.com, mark.burton@greensocs.com, qemu-devel@nongnu.org, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com, pbonzini@redhat.com On 10/08/2015 20:34, Alex Benn=C3=A9e wrote: > fred.konrad@greensocs.com writes: > >> From: KONRAD Frederic >> >> This is the 7th round of the MTTCG patch series. >> >> >> It can be cloned from: >> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7. > I'm not seeing this yet, did you remember to push? oops sorry done! > >> This patch-set try to address the different issues in the global pictu= re of >> MTTCG, presented on the wiki. >> >> =3D=3D Needed patch for our work =3D=3D >> >> Some preliminaries are needed for our work: >> * current_cpu doesn't make sense in mttcg so a tcg_executing flag is= added to >> the CPUState. >> * We need to run some work safely when all VCPUs are outside their e= xecution >> loop. This is done with the async_run_safe_work_on_cpu function in= troduced >> in this series. >> * QemuSpin lock is introduced (on posix only yet) to allow a faster = handling of >> atomic instruction. >> >> =3D=3D Code generation and cache =3D=3D >> >> As Qemu stands, there is no protection at all against two threads atte= mpting to >> generate code at the same time or modifying a TranslationBlock. >> The "protect TBContext with tb_lock" patch address the issue of code g= eneration >> and makes all the tb_* function thread safe (except tb_flush). >> This raised the question of one or multiple caches. We choosed to use = one >> unified cache because it's easier as a first step and since the struct= ure of >> QEMU effectively has a =E2=80=98local=E2=80=99 cache per CPU in the fo= rm of the jump cache, we >> don't see the benefit of having two pools of tbs. >> >> =3D=3D Dirty tracking =3D=3D >> >> Protecting the IOs: >> To allows all VCPUs threads to run at the same time we need to drop th= e >> global_mutex as soon as possible. The io access need to take the mutex= . This is >> likely to change when http://thread.gmane.org/gmane.comp.emulators.qem= u/345258 >> will be upstreamed. >> >> Invalidation of TranslationBlocks: >> We can have all VCPUs running during an invalidation. Each VCPU is abl= e to clean >> it's jump cache itself as it is in CPUState so that can be handled by = a simple >> call to async_run_on_cpu. However tb_invalidate also writes to the >> TranslationBlock which is shared as we have only one pool. >> Hence this part of invalidate requires all VCPUs to exit before it can= be done. >> Hence the async_run_safe_work_on_cpu is introduced to handle this case= . >> >> =3D=3D Atomic instruction =3D=3D >> >> For now only ARM on x64 is supported by using an cmpxchg instruction. >> Specifically the limitation of this approach is that it is harder to s= upport >> 64bit ARM on a host architecture that is multi-core, but only supports= 32 bit >> cmpxchg (we believe this could be the case for some PPC cores). For n= ow this >> case is not correctly handled. The existing atomic patch will attempt = to execute >> the 64 bit cmpxchg functionality in a non thread safe fashion. Our int= ention is >> to provide a new multi-thread ARM atomic patch for 64bit ARM on effect= ive 32bit >> hosts. >> This atomic instruction part has been tested with Alexander's atomic s= tress repo >> available here: >> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html >> >> The execution is a little slower than upstream probably because of the= different >> VCPU fight for the mutex. Swaping arm_exclusive_lock from mutex to spi= n_lock >> reduce considerably the difference. >> >> =3D=3D Testing =3D=3D >> >> A simple double dhrystone test in SMP 2 with vexpress-a15 in a linux g= uest show >> a good performance progression: it takes basically 18s upstream to com= plete vs >> 10s with MTTCG. >> >> Testing image is available here: >> https://cloud.greensocs.com/index.php/s/CfHSLzDH5pmTkW3 >> >> Then simply: >> ./configure --target-list=3Darm-softmmu >> make -j8 >> ./arm-softmmu/qemu-system-arm -M vexpress-a15 -smp 2 -kernel zImage >> -initrd rootfs.ext2 -dtb vexpress-v2p-ca15-tc1.dtb --nographic >> --append "console=3DttyAMA0" >> >> login: root >> >> The dhrystone command is the last one in the history. >> "dhrystone 10000000 & dhrystone 10000000" >> >> The atomic spinlock benchmark from Alexander shows that atomic basical= ly work. >> Just follow the instruction here: >> https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg05585.html >> >> =3D=3D Known issues =3D=3D >> >> * GDB stub: >> GDB stub is not tested right now it will probably requires some cha= nges to >> work. >> >> * deadlock on exit: >> When exiting QEMU Ctrl-C some VCPU's thread are not able to exit an= d continue >> execution. >> http://git.greensocs.com/fkonrad/mttcg/issues/1 >> >> * memory_region_rom_device_set_romd from pflash01 just crashes the TCG= code. >> Strangely this happen only with "-smp 4" and 2 in the DTB. >> http://git.greensocs.com/fkonrad/mttcg/issues/2 >> >> Changes V6 -> V7: >> * global_lock: >> * Don't protect softmmu read/write helper as it's now done in >> adress_space_rw. >> * tcg_exec_flag: >> * Make the flag atomically test and set through an API. >> * introduce async_safe_work: >> * move qemu_cpu_kick_thread to avoid prototype declaration. >> * use the work_mutex. >> * async_work: >> * protect it with a mutex (work_mutex) against concurent access. >> * tb_lock: >> * protect tcg_malloc_internal as well. >> * signal the VCPU even of current_cpu is NULL. >> * added PSCI patch. >> * rebased on v2.4.0-rc0 (6169b60285fe1ff730d840a49527e721bfb30899). >> >> Changes V5 -> V6: >> * Introduce async_safe_work to do the tb_flush and some part of tb_= invalidate. >> * Introduce QemuSpin from Guillaume which allow a faster atomic ins= truction >> (6s to pass Alexander's atomic test instead of 30s before). >> * Don't take tb_lock before tb_find_fast. >> * Handle tb_flush with async_safe_work. >> * Handle tb_invalidate with async_work and async_safe_work. >> * Drop the tlb_flush_request mechanism and use async_work as well. >> * Fix the wrong lenght in atomic patch. >> * Fix the wrong return address for exception in atomic patch. >> >> Alex Benn=C3=A9e (1): >> target-arm/psci.c: wake up sleeping CPUs (MTTCG) >> >> Guillaume Delbergue (1): >> add support for spin lock on POSIX systems exclusively >> >> KONRAD Frederic (17): >> cpus: protect queued_work_* with work_mutex. >> cpus: add tcg_exec_flag. >> cpus: introduce async_run_safe_work_on_cpu. >> replace spinlock by QemuMutex. >> remove unused spinlock. >> protect TBContext with tb_lock. >> tcg: remove tcg_halt_cond global variable. >> Drop global lock during TCG code execution >> cpu: remove exit_request global. >> tcg: switch on multithread. >> Use atomic cmpxchg to atomically check the exclusive value in a STR= EX >> add a callback when tb_invalidate is called. >> cpu: introduce tlb_flush*_all. >> arm: use tlb_flush*_all >> translate-all: introduces tb_flush_safe. >> translate-all: (wip) use tb_flush_safe when we can't alloc more tb. >> mttcg: signal the associated cpu anyway. >> >> cpu-exec.c | 98 +++++++++------ >> cpus.c | 295 +++++++++++++++++++++++++---------= ---------- >> cputlb.c | 81 ++++++++++++ >> include/exec/exec-all.h | 8 +- >> include/exec/spinlock.h | 49 -------- >> include/qemu/thread-posix.h | 4 + >> include/qemu/thread-win32.h | 4 + >> include/qemu/thread.h | 7 ++ >> include/qom/cpu.h | 57 +++++++++ >> linux-user/main.c | 6 +- >> qom/cpu.c | 20 +++ >> target-arm/cpu.c | 21 ++++ >> target-arm/cpu.h | 6 + >> target-arm/helper.c | 58 +++------ >> target-arm/helper.h | 4 + >> target-arm/op_helper.c | 128 ++++++++++++++++++- >> target-arm/psci.c | 2 + >> target-arm/translate.c | 101 +++------------ >> target-i386/mem_helper.c | 16 ++- >> target-i386/misc_helper.c | 27 +++- >> tcg/i386/tcg-target.c | 8 ++ >> tcg/tcg.h | 14 ++- >> translate-all.c | 217 +++++++++++++++++++++++++++----- >> util/qemu-thread-posix.c | 45 +++++++ >> util/qemu-thread-win32.c | 30 +++++ >> vl.c | 6 + >> 26 files changed, 934 insertions(+), 378 deletions(-) >> delete mode 100644 include/exec/spinlock.h