From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:39000) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zjo7W-0003r6-12 for qemu-devel@nongnu.org; Wed, 07 Oct 2015 08:46:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zjo7S-0005G9-N7 for qemu-devel@nongnu.org; Wed, 07 Oct 2015 08:46:57 -0400 Received: from lhrrgout.huawei.com ([194.213.3.17]:17481) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zjo7S-0005Eh-BY for qemu-devel@nongnu.org; Wed, 07 Oct 2015 08:46:54 -0400 References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com> <1439273709.14448.102.camel@kernel.crashing.org> <55C995CB.3080300@greensocs.com> From: Claudio Fontana Message-ID: <56151427.4080809@huawei.com> Date: Wed, 7 Oct 2015 14:46:31 +0200 MIME-Version: 1.0 In-Reply-To: <55C995CB.3080300@greensocs.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Frederic Konrad , Benjamin Herrenschmidt Cc: mttcg@greensocs.com, mark.burton@greensocs.com, qemu-devel@nongnu.org, a.rigo@virtualopensystems.com, guillaume.delbergue@greensocs.com, pbonzini@redhat.com, alex.bennee@linaro.org Hello Frederic, On 11.08.2015 08:27, Frederic Konrad wrote: > On 11/08/2015 08:15, Benjamin Herrenschmidt wrote: >> On Mon, 2015-08-10 at 17:26 +0200, fred.konrad@greensocs.com wrote: >>> From: KONRAD Frederic >>> >>> This is the 7th round of the MTTCG patch series. >>> >>> >>> It can be cloned from: >>> git@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7. would it be possible to rebase on latest qemu? I wonder if mttcg is diverging a bit too much from mainline, which will make it more difficult to rebase later..(Or did I get confused about all these repos?) Thank you! Claudio >>> >>> This patch-set try to address the different issues in the global picture of >>> MTTCG, presented on the wiki. >>> >>> == Needed patch for our work == >>> >>> Some preliminaries are needed for our work: >>> * current_cpu doesn't make sense in mttcg so a tcg_executing flag is added to >>> the CPUState. >> Can't you just make it a TLS ? > > True that can be done as well. But the tcg_exec_flags has a second meaning saying > "you can't start executing code right now because I want to do a safe_work". >> >>> * We need to run some work safely when all VCPUs are outside their execution >>> loop. This is done with the async_run_safe_work_on_cpu function introduced >>> in this series. >>> * QemuSpin lock is introduced (on posix only yet) to allow a faster handling of >>> atomic instruction. >> How do you handle the memory model ? IE , ARM and PPC are OO while x86 >> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating >> x86 on ARM or PPC will lead to problems unless you generate memory >> barriers with every load/store .. > > For the moment we are trying to do the first case. >> >> At least on POWER7 and later on PPC we have the possibility of setting >> the attribute "Strong Access Ordering" with mremap/mprotect (I dont' >> remember which one) which gives us x86-like memory semantics... >> >> I don't know if ARM supports something similar. On the other hand, when >> emulating ARM on PPC or vice-versa, we can probably get away with no >> barriers. >> >> Do you expose some kind of guest memory model info to the TCG backend so >> it can decide how to handle these things ? >> >>> == Code generation and cache == >>> >>> As Qemu stands, there is no protection at all against two threads attempting to >>> generate code at the same time or modifying a TranslationBlock. >>> The "protect TBContext with tb_lock" patch address the issue of code generation >>> and makes all the tb_* function thread safe (except tb_flush). >>> This raised the question of one or multiple caches. We choosed to use one >>> unified cache because it's easier as a first step and since the structure of >>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump cache, we >>> don't see the benefit of having two pools of tbs. >>> >>> == Dirty tracking == >>> >>> Protecting the IOs: >>> To allows all VCPUs threads to run at the same time we need to drop the >>> global_mutex as soon as possible. The io access need to take the mutex. This is >>> likely to change when http://thread.gmane.org/gmane.comp.emulators.qemu/345258 >>> will be upstreamed. >>> >>> Invalidation of TranslationBlocks: >>> We can have all VCPUs running during an invalidation. Each VCPU is able to clean >>> it's jump cache itself as it is in CPUState so that can be handled by a simple >>> call to async_run_on_cpu. However tb_invalidate also writes to the >>> TranslationBlock which is shared as we have only one pool. >>> Hence this part of invalidate requires all VCPUs to exit before it can be done. >>> Hence the async_run_safe_work_on_cpu is introduced to handle this case. >> What about the host MMU emulation ? Is that multithreaded ? It has >> potential issues when doing things like dirty bit updates into guest >> memory, those need to be done atomically. Also TLB invalidations on ARM >> and PPC are global, so they will need to invalidate the remote SW TLBs >> as well. >> >> Do you have a mechanism to synchronize with another thread ? IE, make it >> pop out of TCG if already in and prevent it from getting in ? That way >> you can "remotely" invalidate its TLB... > Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs to > resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec. > >> >>> == Atomic instruction == >>> >>> For now only ARM on x64 is supported by using an cmpxchg instruction. >>> Specifically the limitation of this approach is that it is harder to support >>> 64bit ARM on a host architecture that is multi-core, but only supports 32 bit >>> cmpxchg (we believe this could be the case for some PPC cores). >> Right, on the other hand 64-bit will do fine. But then x86 has 2-value >> atomics nowadays, doesn't it ? And that will be hard to emulate on >> anything. You might need to have some kind of global hashed lock list >> used by atomics (hash the physical address) as a fallback if you don't >> have a 1:1 match between host and guest capabilities. > VOS did a "Slow path for atomic instruction translation" series you can find here: > https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html > > Which will be used in the end. > > Thanks, > Fred >> >> Cheers, >> Ben.