From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55156) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZP5ly-0002hA-Lg for qemu-devel@nongnu.org; Tue, 11 Aug 2015 05:23:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZP5lp-00038o-TZ for qemu-devel@nongnu.org; Tue, 11 Aug 2015 05:23:06 -0400 Received: from gate.crashing.org ([63.228.1.57]:51199) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZP5lp-00038d-Ia for qemu-devel@nongnu.org; Tue, 11 Aug 2015 05:22:57 -0400 Message-ID: <1439284965.14448.114.camel@kernel.crashing.org> From: Benjamin Herrenschmidt Date: Tue, 11 Aug 2015 19:22:45 +1000 In-Reply-To: <87614mgt08.fsf@linaro.org> References: <1439220437-23957-1-git-send-email-fred.konrad@greensocs.com> <1439273709.14448.102.camel@kernel.crashing.org> <87614mgt08.fsf@linaro.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH V7 00/19] Multithread TCG. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex =?ISO-8859-1?Q?Benn=E9e?= Cc: mttcg@greensocs.com, mark.burton@greensocs.com, a.rigo@virtualopensystems.com, qemu-devel@nongnu.org, guillaume.delbergue@greensocs.com, pbonzini@redhat.com, fred.konrad@greensocs.com On Tue, 2015-08-11 at 08:54 +0100, Alex Benn=C3=A9e wrote: >=20 > > How do you handle the memory model ? IE , ARM and PPC are OO while x8= 6 > > is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulati= ng > > x86 on ARM or PPC will lead to problems unless you generate memory > > barriers with every load/store .. >=20 > This is the next chunk of work. We have Alvise's LL/SC patches which > allow us to do proper emulation of ARMs Load/store exclusive behaviour > and any weak order target will have to use such constructs. God no ! You don't want to use ll/sc for dealing with weak ordering, you want to use barriers... ll/sc will allow you to deal with front-end things such as atomic inc/dec etc... =20 > Currently the plan is to introduce a barrier TCG op which will translat= e > to the strongest backend barrier available. I would advocate at least two barriers, full barrier and write barrier, so at least when emulating ARM or PPC on x86, we don't actually send fences on every load/stores. IE. the x86 memory model is *not* fully ordered, so a ARM or PPC full barrier must translate into a x86 fence afaik (or whatever is the x86 name of its full barrier), but you don't want to translate all ARM/PPC weaker barriers into those. > Even x86 should be using barriers to ensure cross-core visibility whic= h > then leaves LS re-ordering on the same core. Only for store + load, which is afaik the only case where x86 re-orders. But in any case, expose to the target (TGC target) the ordering expectations of the source so that we can use whatever facilities might be at hand to avoid some of those barriers, for example the SAO mapping attribute I mentioned. I'll try to look at your patch more closely when I get a chance and see if I can produce a ppc target but don't hold your breath, I'm a bit swamped at the moment. Ben. > > At least on POWER7 and later on PPC we have the possibility of settin= g > > the attribute "Strong Access Ordering" with mremap/mprotect (I dont' > > remember which one) which gives us x86-like memory semantics... > > > > I don't know if ARM supports something similar. On the other hand, wh= en > > emulating ARM on PPC or vice-versa, we can probably get away with no > > barriers. > > > > Do you expose some kind of guest memory model info to the TCG backend= so > > it can decide how to handle these things ? > > > >> =3D=3D Code generation and cache =3D=3D > >>=20 > >> As Qemu stands, there is no protection at all against two threads at= tempting to > >> generate code at the same time or modifying a TranslationBlock. > >> The "protect TBContext with tb_lock" patch address the issue of code= generation > >> and makes all the tb_* function thread safe (except tb_flush). > >> This raised the question of one or multiple caches. We choosed to us= e one > >> unified cache because it's easier as a first step and since the stru= cture of > >> QEMU effectively has a =E2=80=98local=E2=80=99 cache per CPU in the = form of the jump cache, we > >> don't see the benefit of having two pools of tbs. > >>=20 > >> =3D=3D Dirty tracking =3D=3D > >>=20 > >> Protecting the IOs: > >> To allows all VCPUs threads to run at the same time we need to drop = the > >> global_mutex as soon as possible. The io access need to take the mut= ex. This is > >> likely to change when http://thread.gmane.org/gmane.comp.emulators.q= emu/345258 > >> will be upstreamed. > >>=20 > >> Invalidation of TranslationBlocks: > >> We can have all VCPUs running during an invalidation. Each VCPU is a= ble to clean > >> it's jump cache itself as it is in CPUState so that can be handled b= y a simple > >> call to async_run_on_cpu. However tb_invalidate also writes to the > >> TranslationBlock which is shared as we have only one pool. > >> Hence this part of invalidate requires all VCPUs to exit before it c= an be done. > >> Hence the async_run_safe_work_on_cpu is introduced to handle this ca= se. > > > > What about the host MMU emulation ? Is that multithreaded ? It has > > potential issues when doing things like dirty bit updates into guest > > memory, those need to be done atomically. Also TLB invalidations on A= RM > > and PPC are global, so they will need to invalidate the remote SW TLB= s > > as well. > > > > Do you have a mechanism to synchronize with another thread ? IE, make= it > > pop out of TCG if already in and prevent it from getting in ? That wa= y > > you can "remotely" invalidate its TLB... > > > >> =3D=3D Atomic instruction =3D=3D > >>=20 > >> For now only ARM on x64 is supported by using an cmpxchg instruction. > >> Specifically the limitation of this approach is that it is harder to= support > >> 64bit ARM on a host architecture that is multi-core, but only suppor= ts 32 bit > >> cmpxchg (we believe this could be the case for some PPC cores). > > > > Right, on the other hand 64-bit will do fine. But then x86 has 2-valu= e > > atomics nowadays, doesn't it ? And that will be hard to emulate on > > anything. You might need to have some kind of global hashed lock list > > used by atomics (hash the physical address) as a fallback if you don'= t > > have a 1:1 match between host and guest capabilities. > > > > Cheers, > > Ben.