Re: [Qemu-devel] RFC Multi-threaded TCG design document

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Mark Burton <mark.burton@greensocs.com>
Cc: mttcg@listserver.greensocs.com,
	"Peter Maydell" <peter.maydell@linaro.org>,
	"Alexander Graf" <agraf@suse.de>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	"Guillaume Delbergue" <guillaume.delbergue@greensocs.com>,
	pbonzini@redhat.com,
	"KONRAD Frédéric" <fred.konrad@greensocs.com>
Subject: Re: [Qemu-devel] RFC Multi-threaded TCG design document
Date: Mon, 15 Jun 2015 13:36:07 +0100	[thread overview]
Message-ID: <871thdgnd4.fsf@linaro.org> (raw)
In-Reply-To: <366DB43E-F447-4149-B7FA-CA02432741FA@greensocs.com>


Mark Burton <mark.burton@greensocs.com> writes:

> I think we SHOUDL use the wiki - and keep it current. A lot of what you have is in the wiki too, but I’d like to see the wiki updated.
> We will add our stuff there too…

I'll do a pass today and update it to point to lists, discussions and
WIP trees.

>
> Cheers
> Mark.
>
>
>
>> On 15 Jun 2015, at 12:06, Alex Bennée <alex.bennee@linaro.org> wrote:
>> 
>> 
>> Frederic Konrad <fred.konrad@greensocs.com> writes:
>> 
>>> On 12/06/2015 18:37, Alex Bennée wrote:
>>>> Hi,
>>> 
>>> Hi Alex,
>>> 
>>> I've completed some of the points below. We will also work on a design 
>>> decisions
>>> document to add to this one.
>>> 
>>> We probably want to merge that with what we did on the wiki?
>>> http://wiki.qemu.org/Features/tcg-multithread
>> 
>> Well hopefully there is cross-over as I started with the wiki as a basic
>> ;-)
>> 
>> Do we want to just keep the wiki as the live design document or put
>> pointers to the current drafts? I'm hoping eventually the page will just
>> point to the design in the doc directory at git.qemu.org.
>> 
>>>> One thing that Peter has been asking for is a design document for the
>>>> way we are going to approach multi-threaded TCG emulation. I started
>>>> with the information that was captured on the wiki and tried to build on
>>>> that. It's almost certainly incomplete but I thought it would be worth
>>>> posting for wider discussion early rather than later.
>>>> 
>>>> One obvious omission at the moment is the lack of discussion about other
>>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>>> dirty page tracking bits, I'm sure there is more).
>>>> 
>>>> I've also deliberately tried to avoid documenting the design decisions
>>>> made in the current Greensoc's patch series. This is so we can
>>>> concentrate on the big picture before getting side-tracked into the
>>>> implementation details.
>>>> 
>>>> I have now started digging into the Greensocs code in earnest and the
>>>> plan is eventually the design and the implementation will converge on a
>>>> final documented complete solution ;-)
>>>> 
>>>> Anyway as ever I look forward to the comments and discussion:
>>>> 
>>>> STATUS: DRAFTING
>>>> 
>>>> Introduction
>>>> ============
>>>> 
>>>> This document outlines the design for multi-threaded TCG emulation.
>>>> The original TCG implementation was single threaded and dealt with
>>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>>> lot of things but became increasingly limited as systems being
>>>> emulated gained additional cores and per-core performance gains for host
>>>> systems started to level off.
>>>> 
>>>> Memory Consistency
>>>> ==================
>>>> 
>>>> Between emulated guests and host systems there are a range of memory
>>>> consistency models. While emulating weakly ordered systems on strongly
>>>> ordered hosts shouldn't cause any problems the same is not true for
>>>> the reverse setup.
>>>> 
>>>> The proposed design currently does not address the problem of
>>>> emulating strong ordering on a weakly ordered host although even on
>>>> strongly ordered systems software should be using synchronisation
>>>> primitives to ensure correct operation.
>>>> 
>>>> Memory Barriers
>>>> ---------------
>>>> 
>>>> Barriers (sometimes known as fences) provide a mechanism for software
>>>> to enforce a particular ordering of memory operations from the point
>>>> of view of external observers (e.g. another processor core). They can
>>>> apply to any memory operations as well as just loads or stores.
>>>> 
>>>> The Linux kernel has an excellent write-up on the various forms of
>>>> memory barrier and the guarantees they can provide [1].
>>>> 
>>>> Barriers are often wrapped around synchronisation primitives to
>>>> provide explicit memory ordering semantics. However they can be used
>>>> by themselves to provide safe lockless access by ensuring for example
>>>> a signal flag will always be set after a payload.
>>>> 
>>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>> 
>>>> This would enforce a strong load/store ordering so all loads/stores
>>>> complete at the memory barrier. On single-core non-SMP strongly
>>>> ordered backends this could become a NOP.
>>>> 
>>>> There may be a case for further refinement if this causes performance
>>>> bottlenecks.
>>>> 
>>>> Memory Control and Maintenance
>>>> ------------------------------
>>>> 
>>>> This includes a class of instructions for controlling system cache
>>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>>> are often seen when code modification has taken place to ensure the
>>>> changes take effect.
>>>> 
>>>> Synchronisation Primitives
>>>> --------------------------
>>>> 
>>>> There are two broad types of synchronisation primitives found in
>>>> modern ISAs: atomic instructions and exclusive regions.
>>>> 
>>>> The first type offer a simple atomic instruction which will guarantee
>>>> some sort of test and conditional store will be truly atomic w.r.t.
>>>> other cores sharing access to the memory. The classic example is the
>>>> x86 cmpxchg instruction.
>>>> 
>>>> The second type offer a pair of load/store instructions which offer a
>>>> guarantee that an region of memory has not been touched between the
>>>> load and store instructions. An example of this is ARM's ldrex/strex
>>>> pair where the strex instruction will return a flag indicating a
>>>> successful store only if no other CPU has accessed the memory region
>>>> since the ldrex.
>>>> 
>>>> Traditionally TCG has generated a series of operations that work
>>>> because they are within the context of a single translation block so
>>>> will have completed before another CPU is scheduled. However with
>>>> the ability to have multiple threads running to emulate multiple CPUs
>>>> we will need to explicitly expose these semantics.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>>  - atomics
>>>>    - Introduce some atomic TCG ops for the common semantics
>>>>    - The default fallback helper function will use qemu_atomics
>>>>    - Each backend can then add a more efficient implementation
>>>>  - load/store exclusive
>>>>    [AJB:
>>>>         There are currently a number proposals of interest:
>>>>      - Greensocs tweaks to ldst ex (using locks)
>>>>      - Slow-path for atomic instruction translation [2]
>>>>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>>     ]
>>>> 
>>>> 
>>>> Shared Data Structures
>>>> ======================
>>>> 
>>>> Global TCG State
>>>> ----------------
>>>> 
>>>> We need to protect the entire code generation cycle including any post
>>>> generation patching of the translated code. This also implies a shared
>>>> translation buffer which contains code running on all cores. Any
>>>> execution path that comes to the main run loop will need to hold a
>>>> mutex for code generation. This also includes times when we need flush
>>>> code or jumps from the tb_cache.
>>>> 
>>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>>> and jump cache modification
>>> Actually from my point of view jump cache modification requires more than a
>>> lock as other VCPU thread can be executing code during the modification.
>>> 
>>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>>> tb_invalidate which need all CPU to be halted anyway.
>> 
>> How about:
>> 
>> DESIGN REQUIREMENT:
>>       - Code generation and patching will be protected by a lock
>>       - Jump cache modification will assert all CPUs are halted
>> 
>>>> 
>>>> Memory maps and TLBs
>>>> --------------------
>>>> 
>>>> The memory handling code is fairly critical to the speed of memory
>>>> access in the emulated system.
>>>> 
>>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>>   - Dirty page tracking (for code gen, migration and display)
>>>>   - Virtual TLB (for translating guest address->real address)
>>>> 
>>>> There is a both a fast path walked by the generated code and a slow
>>>> path when resolution is required. When the TLB tables are updated we
>>>> need to ensure they are done in a safe way by bringing all executing
>>>> threads to a halt before making the modifications.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - TLB Flush All/Page
>>>>     - can be across-CPUs
>>>>     - will need all other CPUs brought to a halt
>>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>>     - This is a per-CPU table - by definition can't race
>>>>     - updated by it's own thread when the slow-path is forced
>>> Actually as we have  approximately the same behaviour for all of this memory
>>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
>>> all playing with
>>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>>> generic "exit and do something" mechanism for the CPU threads.
>>> So every VCPU threads has a list of thing to do when they exit (such as 
>>> clearing it's
>>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
>>> entry for
>>> tb_invalidate).
>> 
>> Sounds like I should write an additional section to describe the process
>> of halting CPUs and carrying out deferred per-CPU actions as well as
>> ensuring we can tell when they are all halted.
>> 
>>>> Emulated hardware state
>>>> -----------------------
>>>> 
>>>> Currently the hardware emulation has no protection against
>>>> multiple-accesses. However guest systems accessing emulated hardware
>>>> should be carrying out their own locking to prevent multiple CPUs
>>>> confusing the hardware. Of course there is no guarantee the there
>>>> couldn't be a broken guest that doesn't lock so you could get racing
>>>> accesses to the hardware.
>>>> 
>>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>>> result of a guest triggered transaction.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>>> That might be done with the global mutex as it is today?
>>> We need changes here anyway to have VCPU threads running in parallel.
>> 
>> I'm not sure re-using the global mutex is a good idea. I've had to hack
>> the global mutex to allow recursive locking to get around the virtio
>> hang I discovered last week. While it works I'm uneasy making such a
>> radical change upstream given how widely the global mutex is used hence
>> the suggestion to have an explicit IOMem mutex.
>> 
>> Actually I'm surprised the iothread muxtex just re-uses the global one.
>> I guess I need to talk to the IO guys as to why they took that
>> decision.
>> 
>>> 
>>> Thanks,
>> 
>> Thanks for your quick review :-)
>> 
>>> Fred
>>> 
>>>> IO Subsystem
>>>> ------------
>>>> 
>>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>>> be no additional locking required once we reach the Block Driver.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - The dataplane should continue to be protected by the iothread locks
>>>> 
>>>> 
>>>> References
>>>> ==========
>>>> 
>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>> 
>>>> 
>>>> 
>> 
>> -- 
>> Alex Bennée
>
>
> 	 +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
> 	+33 (0)603762104
> 	mark.burton

-- 
Alex Bennée

next prev parent reply	other threads:[~2015-06-15 12:35 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
2015-06-15  9:13 ` Frederic Konrad
2015-06-15 10:06   ` Alex Bennée
2015-06-15 10:51     ` Mark Burton
2015-06-15 12:36       ` Alex Bennée [this message]
2015-06-15 14:25       ` Alex Bennée
2015-06-15 13:06 ` alvise rigo
2015-06-15 14:25   ` Alex Bennée
2015-06-17 11:58 ` Paolo Bonzini
2015-06-17 15:57   ` Alex Bennée
2015-06-17 16:13     ` Paolo Bonzini
2015-06-17 16:57 ` Dr. David Alan Gilbert
2015-06-17 18:23   ` Mark Burton
2015-06-17 21:45     ` Frederic Konrad

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871thdgnd4.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=agraf@suse.de \
    --cc=fred.konrad@greensocs.com \
    --cc=guillaume.delbergue@greensocs.com \
    --cc=mark.burton@greensocs.com \
    --cc=mttcg@listserver.greensocs.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).