From: "Alex Bennée" <alex.bennee@linaro.org>
To: Mark Burton <mark.burton@greensocs.com>
Cc: mttcg@listserver.greensocs.com,
"Peter Maydell" <peter.maydell@linaro.org>,
"Alexander Graf" <agraf@suse.de>,
"QEMU Developers" <qemu-devel@nongnu.org>,
"Guillaume Delbergue" <guillaume.delbergue@greensocs.com>,
pbonzini@redhat.com,
"KONRAD Frédéric" <fred.konrad@greensocs.com>
Subject: Re: [Qemu-devel] RFC Multi-threaded TCG design document
Date: Mon, 15 Jun 2015 13:36:07 +0100 [thread overview]
Message-ID: <871thdgnd4.fsf@linaro.org> (raw)
In-Reply-To: <366DB43E-F447-4149-B7FA-CA02432741FA@greensocs.com>
Mark Burton <mark.burton@greensocs.com> writes:
> I think we SHOUDL use the wiki - and keep it current. A lot of what you have is in the wiki too, but I’d like to see the wiki updated.
> We will add our stuff there too…
I'll do a pass today and update it to point to lists, discussions and
WIP trees.
>
> Cheers
> Mark.
>
>
>
>> On 15 Jun 2015, at 12:06, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Frederic Konrad <fred.konrad@greensocs.com> writes:
>>
>>> On 12/06/2015 18:37, Alex Bennée wrote:
>>>> Hi,
>>>
>>> Hi Alex,
>>>
>>> I've completed some of the points below. We will also work on a design
>>> decisions
>>> document to add to this one.
>>>
>>> We probably want to merge that with what we did on the wiki?
>>> http://wiki.qemu.org/Features/tcg-multithread
>>
>> Well hopefully there is cross-over as I started with the wiki as a basic
>> ;-)
>>
>> Do we want to just keep the wiki as the live design document or put
>> pointers to the current drafts? I'm hoping eventually the page will just
>> point to the design in the doc directory at git.qemu.org.
>>
>>>> One thing that Peter has been asking for is a design document for the
>>>> way we are going to approach multi-threaded TCG emulation. I started
>>>> with the information that was captured on the wiki and tried to build on
>>>> that. It's almost certainly incomplete but I thought it would be worth
>>>> posting for wider discussion early rather than later.
>>>>
>>>> One obvious omission at the moment is the lack of discussion about other
>>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>>> dirty page tracking bits, I'm sure there is more).
>>>>
>>>> I've also deliberately tried to avoid documenting the design decisions
>>>> made in the current Greensoc's patch series. This is so we can
>>>> concentrate on the big picture before getting side-tracked into the
>>>> implementation details.
>>>>
>>>> I have now started digging into the Greensocs code in earnest and the
>>>> plan is eventually the design and the implementation will converge on a
>>>> final documented complete solution ;-)
>>>>
>>>> Anyway as ever I look forward to the comments and discussion:
>>>>
>>>> STATUS: DRAFTING
>>>>
>>>> Introduction
>>>> ============
>>>>
>>>> This document outlines the design for multi-threaded TCG emulation.
>>>> The original TCG implementation was single threaded and dealt with
>>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>>> lot of things but became increasingly limited as systems being
>>>> emulated gained additional cores and per-core performance gains for host
>>>> systems started to level off.
>>>>
>>>> Memory Consistency
>>>> ==================
>>>>
>>>> Between emulated guests and host systems there are a range of memory
>>>> consistency models. While emulating weakly ordered systems on strongly
>>>> ordered hosts shouldn't cause any problems the same is not true for
>>>> the reverse setup.
>>>>
>>>> The proposed design currently does not address the problem of
>>>> emulating strong ordering on a weakly ordered host although even on
>>>> strongly ordered systems software should be using synchronisation
>>>> primitives to ensure correct operation.
>>>>
>>>> Memory Barriers
>>>> ---------------
>>>>
>>>> Barriers (sometimes known as fences) provide a mechanism for software
>>>> to enforce a particular ordering of memory operations from the point
>>>> of view of external observers (e.g. another processor core). They can
>>>> apply to any memory operations as well as just loads or stores.
>>>>
>>>> The Linux kernel has an excellent write-up on the various forms of
>>>> memory barrier and the guarantees they can provide [1].
>>>>
>>>> Barriers are often wrapped around synchronisation primitives to
>>>> provide explicit memory ordering semantics. However they can be used
>>>> by themselves to provide safe lockless access by ensuring for example
>>>> a signal flag will always be set after a payload.
>>>>
>>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>>
>>>> This would enforce a strong load/store ordering so all loads/stores
>>>> complete at the memory barrier. On single-core non-SMP strongly
>>>> ordered backends this could become a NOP.
>>>>
>>>> There may be a case for further refinement if this causes performance
>>>> bottlenecks.
>>>>
>>>> Memory Control and Maintenance
>>>> ------------------------------
>>>>
>>>> This includes a class of instructions for controlling system cache
>>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>>> are often seen when code modification has taken place to ensure the
>>>> changes take effect.
>>>>
>>>> Synchronisation Primitives
>>>> --------------------------
>>>>
>>>> There are two broad types of synchronisation primitives found in
>>>> modern ISAs: atomic instructions and exclusive regions.
>>>>
>>>> The first type offer a simple atomic instruction which will guarantee
>>>> some sort of test and conditional store will be truly atomic w.r.t.
>>>> other cores sharing access to the memory. The classic example is the
>>>> x86 cmpxchg instruction.
>>>>
>>>> The second type offer a pair of load/store instructions which offer a
>>>> guarantee that an region of memory has not been touched between the
>>>> load and store instructions. An example of this is ARM's ldrex/strex
>>>> pair where the strex instruction will return a flag indicating a
>>>> successful store only if no other CPU has accessed the memory region
>>>> since the ldrex.
>>>>
>>>> Traditionally TCG has generated a series of operations that work
>>>> because they are within the context of a single translation block so
>>>> will have completed before another CPU is scheduled. However with
>>>> the ability to have multiple threads running to emulate multiple CPUs
>>>> we will need to explicitly expose these semantics.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>> - atomics
>>>> - Introduce some atomic TCG ops for the common semantics
>>>> - The default fallback helper function will use qemu_atomics
>>>> - Each backend can then add a more efficient implementation
>>>> - load/store exclusive
>>>> [AJB:
>>>> There are currently a number proposals of interest:
>>>> - Greensocs tweaks to ldst ex (using locks)
>>>> - Slow-path for atomic instruction translation [2]
>>>> - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>> ]
>>>>
>>>>
>>>> Shared Data Structures
>>>> ======================
>>>>
>>>> Global TCG State
>>>> ----------------
>>>>
>>>> We need to protect the entire code generation cycle including any post
>>>> generation patching of the translated code. This also implies a shared
>>>> translation buffer which contains code running on all cores. Any
>>>> execution path that comes to the main run loop will need to hold a
>>>> mutex for code generation. This also includes times when we need flush
>>>> code or jumps from the tb_cache.
>>>>
>>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>>> and jump cache modification
>>> Actually from my point of view jump cache modification requires more than a
>>> lock as other VCPU thread can be executing code during the modification.
>>>
>>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>>> tb_invalidate which need all CPU to be halted anyway.
>>
>> How about:
>>
>> DESIGN REQUIREMENT:
>> - Code generation and patching will be protected by a lock
>> - Jump cache modification will assert all CPUs are halted
>>
>>>>
>>>> Memory maps and TLBs
>>>> --------------------
>>>>
>>>> The memory handling code is fairly critical to the speed of memory
>>>> access in the emulated system.
>>>>
>>>> - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>> - Dirty page tracking (for code gen, migration and display)
>>>> - Virtual TLB (for translating guest address->real address)
>>>>
>>>> There is a both a fast path walked by the generated code and a slow
>>>> path when resolution is required. When the TLB tables are updated we
>>>> need to ensure they are done in a safe way by bringing all executing
>>>> threads to a halt before making the modifications.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>>
>>>> - TLB Flush All/Page
>>>> - can be across-CPUs
>>>> - will need all other CPUs brought to a halt
>>>> - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>> - This is a per-CPU table - by definition can't race
>>>> - updated by it's own thread when the slow-path is forced
>>> Actually as we have approximately the same behaviour for all of this memory
>>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are
>>> all playing with
>>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>>> generic "exit and do something" mechanism for the CPU threads.
>>> So every VCPU threads has a list of thing to do when they exit (such as
>>> clearing it's
>>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one
>>> entry for
>>> tb_invalidate).
>>
>> Sounds like I should write an additional section to describe the process
>> of halting CPUs and carrying out deferred per-CPU actions as well as
>> ensuring we can tell when they are all halted.
>>
>>>> Emulated hardware state
>>>> -----------------------
>>>>
>>>> Currently the hardware emulation has no protection against
>>>> multiple-accesses. However guest systems accessing emulated hardware
>>>> should be carrying out their own locking to prevent multiple CPUs
>>>> confusing the hardware. Of course there is no guarantee the there
>>>> couldn't be a broken guest that doesn't lock so you could get racing
>>>> accesses to the hardware.
>>>>
>>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>>> result of a guest triggered transaction.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>>
>>>> - Access to IO Memory should be serialised by an IOMem mutex
>>>> - The mutex should be recursive (e.g. allowing pid to relock itself)
>>> That might be done with the global mutex as it is today?
>>> We need changes here anyway to have VCPU threads running in parallel.
>>
>> I'm not sure re-using the global mutex is a good idea. I've had to hack
>> the global mutex to allow recursive locking to get around the virtio
>> hang I discovered last week. While it works I'm uneasy making such a
>> radical change upstream given how widely the global mutex is used hence
>> the suggestion to have an explicit IOMem mutex.
>>
>> Actually I'm surprised the iothread muxtex just re-uses the global one.
>> I guess I need to talk to the IO guys as to why they took that
>> decision.
>>
>>>
>>> Thanks,
>>
>> Thanks for your quick review :-)
>>
>>> Fred
>>>
>>>> IO Subsystem
>>>> ------------
>>>>
>>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>>> be no additional locking required once we reach the Block Driver.
>>>>
>>>> DESIGN REQUIREMENTS:
>>>>
>>>> - The dataplane should continue to be protected by the iothread locks
>>>>
>>>>
>>>> References
>>>> ==========
>>>>
>>>> [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>>
>>>>
>>>>
>>
>> --
>> Alex Bennée
>
>
> +44 (0)20 7100 3485 x 210
> +33 (0)5 33 52 01 77x 210
>
> +33 (0)603762104
> mark.burton
--
Alex Bennée
next prev parent reply other threads:[~2015-06-15 12:35 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-12 16:37 [Qemu-devel] RFC Multi-threaded TCG design document Alex Bennée
2015-06-15 9:13 ` Frederic Konrad
2015-06-15 10:06 ` Alex Bennée
2015-06-15 10:51 ` Mark Burton
2015-06-15 12:36 ` Alex Bennée [this message]
2015-06-15 14:25 ` Alex Bennée
2015-06-15 13:06 ` alvise rigo
2015-06-15 14:25 ` Alex Bennée
2015-06-17 11:58 ` Paolo Bonzini
2015-06-17 15:57 ` Alex Bennée
2015-06-17 16:13 ` Paolo Bonzini
2015-06-17 16:57 ` Dr. David Alan Gilbert
2015-06-17 18:23 ` Mark Burton
2015-06-17 21:45 ` Frederic Konrad
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871thdgnd4.fsf@linaro.org \
--to=alex.bennee@linaro.org \
--cc=agraf@suse.de \
--cc=fred.konrad@greensocs.com \
--cc=guillaume.delbergue@greensocs.com \
--cc=mark.burton@greensocs.com \
--cc=mttcg@listserver.greensocs.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).