qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
@ 2015-08-24  0:23 Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
                   ` (39 more replies)
  0 siblings, 40 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Hi all,

Here is MTTCG code I've been working on out-of-tree for the last few months.

The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg

The highlights of the patchset are as follows:

- The first 5 patches are direct fixes to bugs only in the mttcg
  branch.

- Patches 6-12 fix issues in the master branch.

- The remaining patches are really the meat of this patchset.
  The main features are:

  * Support of MTTCG for both user and system mode.

  * Design: per-CPU TB jump list protected by a seqlock,
    if the TB is not found there then check on the global, RCU-protected 'hash table'
    (i.e. fixed number of buckets), if not there then grab lock, check again,
    and if it's not there then add generate the code and add the TB to the hash table.

    It makes sense that Paolo's recent work on the mttcg branch ended up
    being almost identical to this--it's simple and it scales well.

  * tb_lock must be held every time code is generated. The rationale is
    that most of the time QEMU is executing code, not generating it.

  * tb_flush: do it once all other CPUs have been put to sleep by calling
    rcu_synchronize().
    We also instrument tb_lock to make sure that only one tb_flush request can
    happen at a given time.  For this a mechanism to schedule work is added to
    supersede cpu_sched_safe_work, which cannot work in usermode.  Here I've
    toyed with an alternative version that doesn't force the flushing CPU to
    exit, but in order to make this work we have save/restore the RCU read
    lock while tb_lock is held in order to avoid deadlocks. This isn't too
    pretty but it's good to know that the option is there.

  * I focused on x86 since it is a complex ISA and we support many cores via -smp.
    I work on a 64-core machine so concurrency bugs show up relatively easily.

    Atomics are modeled using spinlocks, i.e. one host lock per guest cache line.
    Note that spinlocks are way better than mutexes for this--perf on 64-cores
    is 2X with spinlocks on highly concurrent workloads (synchrobench, see below).

    Advantages:

    + Scalability. No unrelated atomics (e.g. atomics on the same page)
      can interfere with each other. Of course if the guest code
      has false sharing (i.e. atomics on the same cache line), then
      there's not much the host can do about that.
      This is an improved version over what I sent in May:
        https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg01641.html
      Performance numbers are below.

    + No requirements on the capabilities of the host machine, e.g.
      no need for a host cmpxchg instruction. That is, we'd have no problem
      running x86 code on a weaker host (say ARM/PPC) although of course we'd
      have to sprinkle quite a few memory barriers.  Note that the current
      MTTCG relies on cmpxchg(), which would be insufficient to run x86 code
      on ARM/PPC since that cmpxchg could very well race with a regular store
      (whereas in x86 it cannot).

    + Works unchanged for both system and user modes. As far as I can
      tell the TLB-based approach that Alvise is working on couldn't
      be used without the TLB--correct me if I'm wrong, it's been
      quite some time since I looked at that work.

    Disadvantages:
    - Overhead is added to every guest store. Depending on how frequent
      stores are, this can end up being significant single-threaded
      overhead (I've measured from a few % to up to ~50%).

      Note that this overhead applies to strong memory models such
      as x86, since the ISA can deal with concurrent stores and atomic
      instructions. Weaker memory models such as ARM/PPC's wouldn't have this
      overhead.

  * Performance
    I've used four C/C++ benchmarks from synchrobench:
      https://github.com/gramoli/synchrobench
    I'm running them with these arguments: -u 0 -f 1 -d 10000 -t $n_threads
    Here are two comparisons;
    * usermode vs. native     http://imgur.com/RggzgyU
    * qemu-system vs qemu-KVM http://imgur.com/H9iH06B
    (full-system is run with -m 4096).

    Throughput is normalised for each of the four configurations over their
    throughput with 1 thread.

    For single-thread performance overhead of instrumenting writes I used
    two apps from PARSEC, all of them with the 'large' input:

    [Note that for the multithreaded tests I did not use PARSEC; it doesn't
     scale at all on large systems]

    blackscholes 1 thread, ~8% of stores per instruction:
    pbonzini/mttcg+Patches1-5:	62.922099012 seconds ( +-  0.05% )
    +entire patchset:		67.680987626 seconds ( +-  0.35% )
    That's about an 8% perf overhead.

    swaptions 1 thread, ~7% of stores per instruction:
    pbonzini/mttcg+Patches1-5:	144.542495834 seconds ( +-  0.49% )
    +entire patchset:		157.673401200 seconds ( +-  0.25% )
    That's about an 9% perf overhead.

    All tests use taskset appropriately to pack threads into CPUs in the
    same NUMA node, if possible.
    All tests are run on a 64-core (4x16) AMD Opteron 6376 with turbo core
    disabled.

  * Known Issues
    - In system mode, when run with a high number of threads, segfaults on
      translated code happen every now and then.
      Is there anything useful I can do with the segfaulting address? For example:
      (gdb) bt
      #0  0x00007fbf8013d89f in ?? ()
      #1  0x0000000000000000 in ?? ()

      Also, are there any things that should be protected by tb_lock but
      aren't? The only potential issue I've thought of so far is direct jumps
      racing with tb_phys_invalidate, but need to analyze in more detail.

  * Future work
  - Run on PowerPC host to look at how bad the barrier sprinkling has to be.
    I have access to a host so should do this in the next few days. However,
    ppc-usermode doesn't work in multithreaded--help would be appreciated,
    see this thread:
      http://lists.gnu.org/archive/html/qemu-ppc/2015-06/msg00164.html

  - Support more ISAs. I have done ARM, SPARC and PPC, but haven't
    tested them much so I'm keeping them out of this patchset.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2015-09-21 20:59 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
2015-09-07 15:33   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h Emilio G. Cota
2015-09-07 15:49   ` Alex Bennée
2015-09-07 16:11     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec() Emilio G. Cota
2015-08-24  1:03   ` Paolo Bonzini
2015-08-25  0:41     ` [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication Emilio G. Cota
2015-08-25  0:41       ` [Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
2015-08-25  0:41       ` [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion Emilio G. Cota
2015-08-26  0:22         ` Paolo Bonzini
2015-08-25  0:41       ` [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
2015-08-25 18:07         ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock Emilio G. Cota
2015-09-07 15:50   ` Alex Bennée
2015-09-07 16:12     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions Emilio G. Cota
2015-08-24  1:04   ` Paolo Bonzini
2015-08-25  2:30     ` Emilio G. Cota
2015-08-25 19:30       ` Emilio G. Cota
2015-08-25 22:53         ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry Emilio G. Cota
2015-09-07 15:50   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically Emilio G. Cota
2015-09-07 15:53   ` Alex Bennée
2015-09-07 16:13     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork Emilio G. Cota
2015-09-08 17:34   ` Alex Bennée
2015-09-08 19:03     ` Emilio G. Cota
2015-09-09  9:35       ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/ Emilio G. Cota
2015-09-10 11:18   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map Emilio G. Cota
2015-09-10 11:59   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups Emilio G. Cota
2015-09-10 13:22   ` Alex Bennée
2015-09-10 17:46     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create) Emilio G. Cota
2015-08-25  0:45   ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry Emilio G. Cota
2015-09-10 13:49   ` Alex Bennée
2015-09-10 17:50     ` Emilio G. Cota
2015-09-21  5:01   ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses Emilio G. Cota
2015-08-24  2:02   ` Paolo Bonzini
2015-08-25  2:47     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module Emilio G. Cota
2015-09-10 14:25   ` Alex Bennée
2015-09-10 18:00     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 16/38] aie: add module for Atomic Instruction Emulation Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 17/38] aie: add target helpers Emilio G. Cota
2015-09-17 15:14   ` Alex Bennée
2015-09-21  5:18   ` Paolo Bonzini
2015-09-21 20:59     ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 18/38] tcg: add fences Emilio G. Cota
2015-09-10 15:28   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb() Emilio G. Cota
2015-09-10 16:01   ` Alex Bennée
2015-09-10 18:05     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 20/38] tcg/i386: implement fences Emilio G. Cota
2015-08-24  1:32   ` Paolo Bonzini
2015-08-25  3:02     ` Emilio G. Cota
2015-08-25 22:55       ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE Emilio G. Cota
2015-09-17 15:30   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically Emilio G. Cota
2015-08-24  1:09   ` Paolo Bonzini
2015-08-25 20:36     ` Emilio G. Cota
2015-08-25 22:52       ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling Emilio G. Cota
2015-09-09 10:13   ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop Emilio G. Cota
2015-08-24  2:01   ` Paolo Bonzini
2015-08-25 21:16     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req Emilio G. Cota
2015-08-24  2:01   ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock Emilio G. Cota
2015-08-24  1:14   ` Paolo Bonzini
2015-08-25 21:46     ` Emilio G. Cota
2015-08-25 22:49       ` Paolo Bonzini
2015-09-04  8:50   ` Paolo Bonzini
2015-09-04 10:04     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 27/38] cpu-exec: convert tb_invalidated_flag into a per-TB flag Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 28/38] cpu-exec: use RCU to perform lockless TB lookups Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 29/38] tcg: export have_tb_lock Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 30/38] translate-all: add tb_lock assertions Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode Emilio G. Cota
2015-08-24  1:07   ` Paolo Bonzini
2015-08-25 21:54     ` Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 32/38] cpu list: convert to RCU QLIST Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep Emilio G. Cota
2015-08-24  1:24   ` Paolo Bonzini
2015-08-25 22:18     ` Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 34/38] translate-all: use tcg_sched_work for tb_flush Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all Emilio G. Cota
2015-08-24  1:29   ` Paolo Bonzini
2015-08-25 22:31     ` Emilio G. Cota
2015-08-26  0:25       ` Paolo Bonzini
2015-09-01 16:10   ` Alex Bennée
2015-09-01 19:38     ` Emilio G. Cota
2015-09-01 20:18       ` Peter Maydell
2015-08-24  0:24 ` [Qemu-devel] [RFC 36/38] cputlb: use tcg_sched_work for tlb_flush_page_all Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 37/38] cpus: remove async_run_safe_work_on_cpu Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE" Emilio G. Cota
2015-08-24  1:29   ` Paolo Bonzini
2015-08-24  2:01 ` [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Paolo Bonzini
2015-08-25 22:36   ` Emilio G. Cota
2015-08-24 16:08 ` Artyom Tarasenko
2015-08-24 20:16   ` Emilio G. Cota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).