[Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
@ 2015-08-24  0:23 Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
                   ` (39 more replies)
  0 siblings, 40 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Hi all,

Here is MTTCG code I've been working on out-of-tree for the last few months.

The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg

The highlights of the patchset are as follows:

- The first 5 patches are direct fixes to bugs only in the mttcg
  branch.

- Patches 6-12 fix issues in the master branch.

- The remaining patches are really the meat of this patchset.
  The main features are:

  * Support of MTTCG for both user and system mode.

  * Design: per-CPU TB jump list protected by a seqlock,
    if the TB is not found there then check on the global, RCU-protected 'hash table'
    (i.e. fixed number of buckets), if not there then grab lock, check again,
    and if it's not there then add generate the code and add the TB to the hash table.

    It makes sense that Paolo's recent work on the mttcg branch ended up
    being almost identical to this--it's simple and it scales well.

  * tb_lock must be held every time code is generated. The rationale is
    that most of the time QEMU is executing code, not generating it.

  * tb_flush: do it once all other CPUs have been put to sleep by calling
    rcu_synchronize().
    We also instrument tb_lock to make sure that only one tb_flush request can
    happen at a given time.  For this a mechanism to schedule work is added to
    supersede cpu_sched_safe_work, which cannot work in usermode.  Here I've
    toyed with an alternative version that doesn't force the flushing CPU to
    exit, but in order to make this work we have save/restore the RCU read
    lock while tb_lock is held in order to avoid deadlocks. This isn't too
    pretty but it's good to know that the option is there.

  * I focused on x86 since it is a complex ISA and we support many cores via -smp.
    I work on a 64-core machine so concurrency bugs show up relatively easily.

    Atomics are modeled using spinlocks, i.e. one host lock per guest cache line.
    Note that spinlocks are way better than mutexes for this--perf on 64-cores
    is 2X with spinlocks on highly concurrent workloads (synchrobench, see below).

    Advantages:

    + Scalability. No unrelated atomics (e.g. atomics on the same page)
      can interfere with each other. Of course if the guest code
      has false sharing (i.e. atomics on the same cache line), then
      there's not much the host can do about that.
      This is an improved version over what I sent in May:
        https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg01641.html
      Performance numbers are below.

    + No requirements on the capabilities of the host machine, e.g.
      no need for a host cmpxchg instruction. That is, we'd have no problem
      running x86 code on a weaker host (say ARM/PPC) although of course we'd
      have to sprinkle quite a few memory barriers.  Note that the current
      MTTCG relies on cmpxchg(), which would be insufficient to run x86 code
      on ARM/PPC since that cmpxchg could very well race with a regular store
      (whereas in x86 it cannot).

    + Works unchanged for both system and user modes. As far as I can
      tell the TLB-based approach that Alvise is working on couldn't
      be used without the TLB--correct me if I'm wrong, it's been
      quite some time since I looked at that work.

    Disadvantages:
    - Overhead is added to every guest store. Depending on how frequent
      stores are, this can end up being significant single-threaded
      overhead (I've measured from a few % to up to ~50%).

      Note that this overhead applies to strong memory models such
      as x86, since the ISA can deal with concurrent stores and atomic
      instructions. Weaker memory models such as ARM/PPC's wouldn't have this
      overhead.

  * Performance
    I've used four C/C++ benchmarks from synchrobench:
      https://github.com/gramoli/synchrobench
    I'm running them with these arguments: -u 0 -f 1 -d 10000 -t $n_threads
    Here are two comparisons;
    * usermode vs. native     http://imgur.com/RggzgyU
    * qemu-system vs qemu-KVM http://imgur.com/H9iH06B
    (full-system is run with -m 4096).

    Throughput is normalised for each of the four configurations over their
    throughput with 1 thread.

    For single-thread performance overhead of instrumenting writes I used
    two apps from PARSEC, all of them with the 'large' input:

    [Note that for the multithreaded tests I did not use PARSEC; it doesn't
     scale at all on large systems]

    blackscholes 1 thread, ~8% of stores per instruction:
    pbonzini/mttcg+Patches1-5:	62.922099012 seconds ( +-  0.05% )
    +entire patchset:		67.680987626 seconds ( +-  0.35% )
    That's about an 8% perf overhead.

    swaptions 1 thread, ~7% of stores per instruction:
    pbonzini/mttcg+Patches1-5:	144.542495834 seconds ( +-  0.49% )
    +entire patchset:		157.673401200 seconds ( +-  0.25% )
    That's about an 9% perf overhead.

    All tests use taskset appropriately to pack threads into CPUs in the
    same NUMA node, if possible.
    All tests are run on a 64-core (4x16) AMD Opteron 6376 with turbo core
    disabled.

  * Known Issues
    - In system mode, when run with a high number of threads, segfaults on
      translated code happen every now and then.
      Is there anything useful I can do with the segfaulting address? For example:
      (gdb) bt
      #0  0x00007fbf8013d89f in ?? ()
      #1  0x0000000000000000 in ?? ()

      Also, are there any things that should be protected by tb_lock but
      aren't? The only potential issue I've thought of so far is direct jumps
      racing with tb_phys_invalidate, but need to analyze in more detail.

  * Future work
  - Run on PowerPC host to look at how bad the barrier sprinkling has to be.
    I have access to a host so should do this in the next few days. However,
    ppc-usermode doesn't work in multithreaded--help would be appreciated,
    see this thread:
      http://lists.gnu.org/archive/html/qemu-ppc/2015-06/msg00164.html

  - Support more ISAs. I have done ARM, SPARC and PPC, but haven't
    tested them much so I'm keeping them out of this patchset.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-07 15:33   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h Emilio G. Cota
                   ` (38 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cpu-exec.c b/cpu-exec.c
index f53475c..b8a11e1 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -330,6 +330,7 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
         if (!tb) {
             tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
         }
+        mmap_unlock();
     }
 
     /* we add the TB in the virtual pc hash table */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow
  2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
@ 2015-09-07 15:33   ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-07 15:33 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/cpu-exec.c b/cpu-exec.c
> index f53475c..b8a11e1 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -330,6 +330,7 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>          if (!tb) {
>              tb = tb_gen_code(cpu, pc, cs_base, flags, 0);
>          }
> +        mmap_unlock();
>      }
>  
>      /* we add the TB in the virtual pc hash table */

Fix the commit comment s/lock/unlock/ and you can have:

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-07 15:49   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec() Emilio G. Cota
                   ` (37 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

So that the declaration of tb_lock can be found.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 hw/i386/kvmvapic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index 1c3b5b6..a9a33fd 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -13,6 +13,7 @@
 #include "sysemu/kvm.h"
 #include "hw/i386/apic_internal.h"
 #include "hw/sysbus.h"
+#include "tcg/tcg.h"
 
 #define VAPIC_IO_PORT           0x7e
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h
  2015-08-24  0:23 ` [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h Emilio G. Cota
@ 2015-09-07 15:49   ` Alex Bennée
  2015-09-07 16:11     ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-07 15:49 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> So that the declaration of tb_lock can be found.

OK this confused me somewhat as exposing TCG details to the hardware
emulation seems broken. However I notice the code your adding to isn't
in Fred's latest series although the tb_gen_code ugliness is in master
so will need addressing.

I'm not sure why x86 is the special snowflake in this case. It looks
like a performance hack for emulation which I guess the other guests
don't do.

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  hw/i386/kvmvapic.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
> index 1c3b5b6..a9a33fd 100644
> --- a/hw/i386/kvmvapic.c
> +++ b/hw/i386/kvmvapic.c
> @@ -13,6 +13,7 @@
>  #include "sysemu/kvm.h"
>  #include "hw/i386/apic_internal.h"
>  #include "hw/sysbus.h"
> +#include "tcg/tcg.h"
>  
>  #define VAPIC_IO_PORT           0x7e

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h
  2015-09-07 15:49   ` Alex Bennée
@ 2015-09-07 16:11     ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-07 16:11 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	Frederic Konrad



On 07/09/2015 17:49, Alex Bennée wrote:
> I'm not sure why x86 is the special snowflake in this case. It looks
> like a performance hack for emulation which I guess the other guests
> don't do.

Yes, upon MMIO accesses it modifies the instruction that caused the MMIO
itself, and then re-executes it.  Nice, huh?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec()
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  1:03   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock Emilio G. Cota
                   ` (36 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

So that it applies to usermode as well.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c | 2 ++
 cpus.c     | 1 -
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index b8a11e1..2b9a447 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -386,6 +386,8 @@ int cpu_exec(CPUState *cpu)
     uintptr_t next_tb;
     SyncClocks sc;
 
+    current_cpu = cpu;
+
 #ifndef CONFIG_USER_ONLY
     /* FIXME: user-mode emulation probably needs a similar mechanism as well,
      * for example for tb_flush.
diff --git a/cpus.c b/cpus.c
index 5484ce6..0fe6576 100644
--- a/cpus.c
+++ b/cpus.c
@@ -1079,7 +1079,6 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     cpu->thread_id = qemu_get_thread_id();
     cpu->created = true;
     cpu->can_do_io = 1;
-    current_cpu = cpu;
 
     qemu_cond_signal(&qemu_cpu_cond);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec()
  2015-08-24  0:23 ` [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec() Emilio G. Cota
@ 2015-08-24  1:03   ` Paolo Bonzini
  2015-08-25  0:41     ` [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:03 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> So that it applies to usermode as well.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c | 2 ++
>  cpus.c     | 1 -
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index b8a11e1..2b9a447 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -386,6 +386,8 @@ int cpu_exec(CPUState *cpu)
>      uintptr_t next_tb;
>      SyncClocks sc;
>  
> +    current_cpu = cpu;
> +
>  #ifndef CONFIG_USER_ONLY
>      /* FIXME: user-mode emulation probably needs a similar mechanism as well,
>       * for example for tb_flush.
> diff --git a/cpus.c b/cpus.c
> index 5484ce6..0fe6576 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -1079,7 +1079,6 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
>      cpu->thread_id = qemu_get_thread_id();
>      cpu->created = true;
>      cpu->can_do_io = 1;
> -    current_cpu = cpu;
>  
>      qemu_cond_signal(&qemu_cpu_cond);

Please set it somewhere in linux-user/ and bsd-user/ instead, I would
like to keep the TCG code more similar to KVM/Xen/qtest.  Probably the whole

    qemu_thread_get_self(cpu->thread);

    cpu->thread_id = qemu_get_thread_id();
    cpu->created = true;
    cpu->can_do_io = 1;
    current_cpu = cpu;

should be moved into a new function (rcu_register_thread too?).

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication
  2015-08-24  1:03   ` Paolo Bonzini
@ 2015-08-25  0:41     ` Emilio G. Cota
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
                         ` (2 more replies)
  0 siblings, 3 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  0:41 UTC (permalink / raw)
  To: pbonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpus.c | 32 +++++++++++++-------------------
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/cpus.c b/cpus.c
index 81dda93..fd9e903 100644
--- a/cpus.c
+++ b/cpus.c
@@ -922,18 +922,23 @@ static void qemu_kvm_wait_io_event(CPUState *cpu)
     qemu_wait_io_event_common(cpu);
 }
 
+/* call with BQL held */
+static void qemu_cpu_thread_init_common(CPUState *cpu)
+{
+    rcu_register_thread();
+    qemu_thread_get_self(cpu->thread);
+    cpu->thread_id = qemu_get_thread_id();
+    cpu->can_do_io = 1;
+    current_cpu = cpu;
+}
+
 static void *qemu_kvm_cpu_thread_fn(void *arg)
 {
     CPUState *cpu = arg;
     int r;
 
-    rcu_register_thread();
-
     qemu_mutex_lock_iothread();
-    qemu_thread_get_self(cpu->thread);
-    cpu->thread_id = qemu_get_thread_id();
-    cpu->can_do_io = 1;
-    current_cpu = cpu;
+    qemu_cpu_thread_init_common(cpu);
 
     r = kvm_init_vcpu(cpu);
     if (r < 0) {
@@ -970,13 +975,8 @@ static void *qemu_dummy_cpu_thread_fn(void *arg)
     sigset_t waitset;
     int r;
 
-    rcu_register_thread();
-
     qemu_mutex_lock_iothread();
-    qemu_thread_get_self(cpu->thread);
-    cpu->thread_id = qemu_get_thread_id();
-    cpu->can_do_io = 1;
-    current_cpu = cpu;
+    qemu_cpu_thread_init_common(cpu);
 
     sigemptyset(&waitset);
     sigaddset(&waitset, SIG_IPI);
@@ -1009,15 +1009,9 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
 {
     CPUState *cpu = arg;
 
-    rcu_register_thread();
-
     qemu_mutex_lock_iothread();
-    qemu_thread_get_self(cpu->thread);
-
-    cpu->thread_id = qemu_get_thread_id();
+    qemu_cpu_thread_init_common(cpu);
     cpu->created = true;
-    cpu->can_do_io = 1;
-    current_cpu = cpu;
 
     qemu_cond_signal(&qemu_cpu_cond);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop()
  2015-08-25  0:41     ` [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication Emilio G. Cota
@ 2015-08-25  0:41       ` Emilio G. Cota
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion Emilio G. Cota
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
  2 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  0:41 UTC (permalink / raw)
  To: pbonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

There are as many versions of cpu_loop as architectures supported,
so introduce here a helper that is common to all of them.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/main.c    | 2 +-
 linux-user/qemu.h    | 6 ++++++
 linux-user/syscall.c | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/linux-user/main.c b/linux-user/main.c
index 3e10bd8..24c53ad 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -4405,7 +4405,7 @@ int main(int argc, char **argv, char **envp)
         }
         gdb_handlesig(cpu, 0);
     }
-    cpu_loop(env);
+    do_cpu_loop(env);
     /* never exits */
     return 0;
 }
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index e8606b2..8af5e01 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -201,6 +201,12 @@ void init_qemu_uname_release(void);
 void fork_start(void);
 void fork_end(int child);
 
+static inline void do_cpu_loop(CPUArchState *env)
+{
+    current_cpu = ENV_GET_CPU(env);
+    cpu_loop(env);
+}
+
 /* Creates the initial guest address space in the host memory space using
  * the given host start address hint and size.  The guest_start parameter
  * specifies the start address of the guest space.  guest_base will be the
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index c7062ab..701c8fa 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -4533,7 +4533,7 @@ static void *clone_func(void *arg)
     /* Wait until the parent has finshed initializing the tls state.  */
     pthread_mutex_lock(&clone_lock);
     pthread_mutex_unlock(&clone_lock);
-    cpu_loop(env);
+    do_cpu_loop(env);
     /* never exits */
     return NULL;
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion
  2015-08-25  0:41     ` [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication Emilio G. Cota
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
@ 2015-08-25  0:41       ` Emilio G. Cota
  2015-08-26  0:22         ` Paolo Bonzini
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
  2 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  0:41 UTC (permalink / raw)
  To: pbonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

Note that the right place to call rcu_register_thread() is
do_cpu_loop() and not just in clone_func(), since the
original 'main' thread needs to call rcu_register_thread()
as well.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/qemu.h    | 1 +
 linux-user/syscall.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 8af5e01..08e6609 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -203,6 +203,7 @@ void fork_end(int child);
 
 static inline void do_cpu_loop(CPUArchState *env)
 {
+    rcu_register_thread();
     current_cpu = ENV_GET_CPU(env);
     cpu_loop(env);
 }
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 701c8fa..84909b4 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -5614,6 +5614,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
             thread_cpu = NULL;
             object_unref(OBJECT(cpu));
             g_free(ts);
+            rcu_unregister_thread();
             pthread_exit(NULL);
         }
 #ifdef TARGET_GPROF
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion Emilio G. Cota
@ 2015-08-26  0:22         ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-26  0:22 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad



On 25/08/2015 02:41, Emilio G. Cota wrote:
> Note that the right place to call rcu_register_thread() is
> do_cpu_loop() and not just in clone_func(), since the
> original 'main' thread needs to call rcu_register_thread()
> as well.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

It doesn't actually, see rcu_init in util/rcu.c.  I'm still queueing the
original patch.

Paolo

> ---
>  linux-user/qemu.h    | 1 +
>  linux-user/syscall.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/linux-user/qemu.h b/linux-user/qemu.h
> index 8af5e01..08e6609 100644
> --- a/linux-user/qemu.h
> +++ b/linux-user/qemu.h
> @@ -203,6 +203,7 @@ void fork_end(int child);
>  
>  static inline void do_cpu_loop(CPUArchState *env)
>  {
> +    rcu_register_thread();
>      current_cpu = ENV_GET_CPU(env);
>      cpu_loop(env);
>  }
> diff --git a/linux-user/syscall.c b/linux-user/syscall.c
> index 701c8fa..84909b4 100644
> --- a/linux-user/syscall.c
> +++ b/linux-user/syscall.c
> @@ -5614,6 +5614,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
>              thread_cpu = NULL;
>              object_unref(OBJECT(cpu));
>              g_free(ts);
> +            rcu_unregister_thread();
>              pthread_exit(NULL);
>          }
>  #ifdef TARGET_GPROF
> -- 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop()
  2015-08-25  0:41     ` [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication Emilio G. Cota
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion Emilio G. Cota
@ 2015-08-25  0:41       ` Emilio G. Cota
  2015-08-25 18:07         ` Emilio G. Cota
  2 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  0:41 UTC (permalink / raw)
  To: pbonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

Note: cannot compile bsd-user here (linux), please compile-test.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 bsd-user/main.c | 2 +-
 bsd-user/qemu.h | 6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/bsd-user/main.c b/bsd-user/main.c
index ee68daa..0bea358 100644
--- a/bsd-user/main.c
+++ b/bsd-user/main.c
@@ -1133,7 +1133,7 @@ int main(int argc, char **argv)
         gdbserver_start (gdbstub_port);
         gdb_handlesig(cpu, 0);
     }
-    cpu_loop(env);
+    do_cpu_loop(env);
     /* never exits */
     return 0;
 }
diff --git a/bsd-user/qemu.h b/bsd-user/qemu.h
index 5902614..751efd5 100644
--- a/bsd-user/qemu.h
+++ b/bsd-user/qemu.h
@@ -163,6 +163,12 @@ int get_osversion(void);
 void fork_start(void);
 void fork_end(int child);
 
+static inline void do_cpu_loop(CPUArchState *env)
+{
+    current_cpu = ENV_GET_CPU(env);
+    cpu_loop(env);
+}
+
 #include "qemu/log.h"
 
 /* strace.c */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop()
  2015-08-25  0:41       ` [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
@ 2015-08-25 18:07         ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 18:07 UTC (permalink / raw)
  To: pbonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Mon, Aug 24, 2015 at 20:41:10 -0400, Emilio G. Cota wrote:
> Note: cannot compile bsd-user here (linux), please compile-test.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
(snip)
> diff --git a/bsd-user/qemu.h b/bsd-user/qemu.h
> index 5902614..751efd5 100644
> --- a/bsd-user/qemu.h
> +++ b/bsd-user/qemu.h
> @@ -163,6 +163,12 @@ int get_osversion(void);
>  void fork_start(void);
>  void fork_end(int child);
>  
> +static inline void do_cpu_loop(CPUArchState *env)
> +{

Here we should also call rcu_register_thread().

> +    current_cpu = ENV_GET_CPU(env);
> +    cpu_loop(env);
> +}

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (2 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec() Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-07 15:50   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions Emilio G. Cota
                   ` (35 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This is a thread-local variable and therefore all changes
to it will be seen in order by the owning thread. There is
no need for it to be volatile.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/translate-all.c b/translate-all.c
index 901a35e..31239db 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -130,7 +130,7 @@ static void *l1_map[V_L1_SIZE];
 TCGContext tcg_ctx;
 
 /* translation block context */
-__thread volatile int have_tb_lock;
+__thread int have_tb_lock;
 
 void tb_lock(void)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock
  2015-08-24  0:23 ` [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock Emilio G. Cota
@ 2015-09-07 15:50   ` Alex Bennée
  2015-09-07 16:12     ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-07 15:50 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> This is a thread-local variable and therefore all changes
> to it will be seen in order by the owning thread. There is
> no need for it to be volatile.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  translate-all.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/translate-all.c b/translate-all.c
> index 901a35e..31239db 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -130,7 +130,7 @@ static void *l1_map[V_L1_SIZE];
>  TCGContext tcg_ctx;
>  
>  /* translation block context */
> -__thread volatile int have_tb_lock;
> +__thread int have_tb_lock;

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

but really should be folded into the original patches.

>  
>  void tb_lock(void)
>  {

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock
  2015-09-07 15:50   ` Alex Bennée
@ 2015-09-07 16:12     ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-07 16:12 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	Frederic Konrad



On 07/09/2015 17:50, Alex Bennée wrote:
>> >  /* translation block context */
>> > -__thread volatile int have_tb_lock;
>> > +__thread int have_tb_lock;
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> 
> but really should be folded into the original patches.
> 

Yup, v2 will come soon.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (3 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  1:04   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry Emilio G. Cota
                   ` (34 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

On some parallel workloads this gives up to a 15% speed improvement.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/thread-posix.h | 47 ++++++++++++++++++++++++++++++++++++++++++
 include/qemu/thread.h       |  6 ------
 util/qemu-thread-posix.c    | 50 +++++----------------------------------------
 3 files changed, 52 insertions(+), 51 deletions(-)

diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
index 8ce8f01..7d3a9f1 100644
--- a/include/qemu/thread-posix.h
+++ b/include/qemu/thread-posix.h
@@ -37,4 +37,51 @@ struct QemuThread {
     pthread_t thread;
 };
 
+void qemu_spin_error_exit(int err, const char *msg);
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_init(&spin->lock, 0);
+    if (err) {
+        qemu_spin_error_exit(err, __func__);
+    }
+}
+
+static inline void qemu_spin_destroy(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_destroy(&spin->lock);
+    if (err) {
+        qemu_spin_error_exit(err, __func__);
+    }
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_lock(&spin->lock);
+    if (err) {
+        qemu_spin_error_exit(err, __func__);
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    return pthread_spin_trylock(&spin->lock);
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    int err;
+
+    err = pthread_spin_unlock(&spin->lock);
+    if (err) {
+        qemu_spin_error_exit(err, __func__);
+    }
+}
+
 #endif
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index f5d1259..003daab 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -26,12 +26,6 @@ void qemu_mutex_lock(QemuMutex *mutex);
 int qemu_mutex_trylock(QemuMutex *mutex);
 void qemu_mutex_unlock(QemuMutex *mutex);
 
-void qemu_spin_init(QemuSpin *spin);
-void qemu_spin_destroy(QemuSpin *spin);
-void qemu_spin_lock(QemuSpin *spin);
-int qemu_spin_trylock(QemuSpin *spin);
-void qemu_spin_unlock(QemuSpin *spin);
-
 void qemu_cond_init(QemuCond *cond);
 void qemu_cond_destroy(QemuCond *cond);
 
diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 224bacc..04dae0f 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -48,6 +48,11 @@ static void error_exit(int err, const char *msg)
     abort();
 }
 
+void qemu_spin_error_exit(int err, const char *msg)
+{
+    error_exit(err, msg);
+}
+
 void qemu_mutex_init(QemuMutex *mutex)
 {
     int err;
@@ -89,51 +94,6 @@ void qemu_mutex_unlock(QemuMutex *mutex)
         error_exit(err, __func__);
 }
 
-void qemu_spin_init(QemuSpin *spin)
-{
-    int err;
-
-    err = pthread_spin_init(&spin->lock, 0);
-    if (err) {
-        error_exit(err, __func__);
-    }
-}
-
-void qemu_spin_destroy(QemuSpin *spin)
-{
-    int err;
-
-    err = pthread_spin_destroy(&spin->lock);
-    if (err) {
-        error_exit(err, __func__);
-    }
-}
-
-void qemu_spin_lock(QemuSpin *spin)
-{
-    int err;
-
-    err = pthread_spin_lock(&spin->lock);
-    if (err) {
-        error_exit(err, __func__);
-    }
-}
-
-int qemu_spin_trylock(QemuSpin *spin)
-{
-    return pthread_spin_trylock(&spin->lock);
-}
-
-void qemu_spin_unlock(QemuSpin *spin)
-{
-    int err;
-
-    err = pthread_spin_unlock(&spin->lock);
-    if (err) {
-        error_exit(err, __func__);
-    }
-}
-
 void qemu_cond_init(QemuCond *cond)
 {
     int err;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions
  2015-08-24  0:23 ` [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions Emilio G. Cota
@ 2015-08-24  1:04   ` Paolo Bonzini
  2015-08-25  2:30     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:04 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> On some parallel workloads this gives up to a 15% speed improvement.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/thread-posix.h | 47 ++++++++++++++++++++++++++++++++++++++++++
>  include/qemu/thread.h       |  6 ------
>  util/qemu-thread-posix.c    | 50 +++++----------------------------------------
>  3 files changed, 52 insertions(+), 51 deletions(-)
> 
> diff --git a/include/qemu/thread-posix.h b/include/qemu/thread-posix.h
> index 8ce8f01..7d3a9f1 100644
> --- a/include/qemu/thread-posix.h
> +++ b/include/qemu/thread-posix.h
> @@ -37,4 +37,51 @@ struct QemuThread {
>      pthread_t thread;
>  };
>  
> +void qemu_spin_error_exit(int err, const char *msg);
> +
> +static inline void qemu_spin_init(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_init(&spin->lock, 0);
> +    if (err) {
> +        qemu_spin_error_exit(err, __func__);
> +    }
> +}
> +
> +static inline void qemu_spin_destroy(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_destroy(&spin->lock);
> +    if (err) {
> +        qemu_spin_error_exit(err, __func__);
> +    }
> +}
> +
> +static inline void qemu_spin_lock(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_lock(&spin->lock);
> +    if (err) {
> +        qemu_spin_error_exit(err, __func__);
> +    }
> +}
> +
> +static inline int qemu_spin_trylock(QemuSpin *spin)
> +{
> +    return pthread_spin_trylock(&spin->lock);
> +}
> +
> +static inline void qemu_spin_unlock(QemuSpin *spin)
> +{
> +    int err;
> +
> +    err = pthread_spin_unlock(&spin->lock);
> +    if (err) {
> +        qemu_spin_error_exit(err, __func__);
> +    }
> +}
> +
>  #endif
> diff --git a/include/qemu/thread.h b/include/qemu/thread.h
> index f5d1259..003daab 100644
> --- a/include/qemu/thread.h
> +++ b/include/qemu/thread.h
> @@ -26,12 +26,6 @@ void qemu_mutex_lock(QemuMutex *mutex);
>  int qemu_mutex_trylock(QemuMutex *mutex);
>  void qemu_mutex_unlock(QemuMutex *mutex);
>  
> -void qemu_spin_init(QemuSpin *spin);
> -void qemu_spin_destroy(QemuSpin *spin);
> -void qemu_spin_lock(QemuSpin *spin);
> -int qemu_spin_trylock(QemuSpin *spin);
> -void qemu_spin_unlock(QemuSpin *spin);
> -
>  void qemu_cond_init(QemuCond *cond);
>  void qemu_cond_destroy(QemuCond *cond);
>  
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index 224bacc..04dae0f 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -48,6 +48,11 @@ static void error_exit(int err, const char *msg)
>      abort();
>  }
>  
> +void qemu_spin_error_exit(int err, const char *msg)
> +{
> +    error_exit(err, msg);
> +}
> +
>  void qemu_mutex_init(QemuMutex *mutex)
>  {
>      int err;
> @@ -89,51 +94,6 @@ void qemu_mutex_unlock(QemuMutex *mutex)
>          error_exit(err, __func__);
>  }
>  
> -void qemu_spin_init(QemuSpin *spin)
> -{
> -    int err;
> -
> -    err = pthread_spin_init(&spin->lock, 0);
> -    if (err) {
> -        error_exit(err, __func__);
> -    }
> -}
> -
> -void qemu_spin_destroy(QemuSpin *spin)
> -{
> -    int err;
> -
> -    err = pthread_spin_destroy(&spin->lock);
> -    if (err) {
> -        error_exit(err, __func__);
> -    }
> -}
> -
> -void qemu_spin_lock(QemuSpin *spin)
> -{
> -    int err;
> -
> -    err = pthread_spin_lock(&spin->lock);
> -    if (err) {
> -        error_exit(err, __func__);
> -    }
> -}
> -
> -int qemu_spin_trylock(QemuSpin *spin)
> -{
> -    return pthread_spin_trylock(&spin->lock);
> -}
> -
> -void qemu_spin_unlock(QemuSpin *spin)
> -{
> -    int err;
> -
> -    err = pthread_spin_unlock(&spin->lock);
> -    if (err) {
> -        error_exit(err, __func__);
> -    }
> -}
> -
>  void qemu_cond_init(QemuCond *cond)
>  {
>      int err;
> 

Applied, but in the end the spinlock will probably simply use a simple
test-and-test-and-set lock, or an MCS lock.  There is no need to use
pthreads for this.


Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions
  2015-08-24  1:04   ` Paolo Bonzini
@ 2015-08-25  2:30     ` Emilio G. Cota
  2015-08-25 19:30       ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  2:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:04:46 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > On some parallel workloads this gives up to a 15% speed improvement.
> > 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  include/qemu/thread-posix.h | 47 ++++++++++++++++++++++++++++++++++++++++++
> >  include/qemu/thread.h       |  6 ------
> >  util/qemu-thread-posix.c    | 50 +++++----------------------------------------
> >  3 files changed, 52 insertions(+), 51 deletions(-)
(snip)
> Applied, but in the end the spinlock will probably simply use a simple
> test-and-test-and-set lock, or an MCS lock.  There is no need to use
> pthreads for this.

Agreed.

In fact in my tests I sometimes use concurrencykit [http://concurrencykit.org/] to
test lock alternatives (would love to be able to just add ck as a submodule
of qemu, but they do not support as many architectures as qemu does).

Note that fair locks (such as MCS) for user-space programs are not
necessarily a good idea when preemption is considered--and for usermode
we'd be forced (if we allowed MCS's to nest) to use per-lock stack variables
given that the number of threads is unbounded, which is pretty ugly.

If contention is a problem, a simple, fast spinlock combined with an exponential
backoff is already pretty good. Fairness is not a requirement (the cache
substrate of a NUMA machine isn't necessarily fair, is it?); scalability is.
If the algorithm in the guest requires fairness, the guest must use a fair lock
(e.g. MCS), and that works as intended when run natively or under qemu.

I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a
program that spawns N threads and each thread performs an 2**M atomic increments
on the same variable. That is, a degenerate worst-case kind of contention.
N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment:

  http://imgur.com/XpYctyT
  With backoff, the per-access latency grows roughly linearly with the number of
  cores, i.e. this is scalable. The other two are clearly superlinear.

The fastest spinlock as per ck's documentation (for uncontended cases) is
the fetch-and-swap lock. I just re-ran the usermode experiments from yesterday
with fas and fas+exp.backoff:

  http://imgur.com/OK2WZg8
  There really isn't much difference among the three candidates.

In light of these results I see very little against going for a solution
with exponential backoff.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions
  2015-08-25  2:30     ` Emilio G. Cota
@ 2015-08-25 19:30       ` Emilio G. Cota
  2015-08-25 22:53         ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 19:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Mon, Aug 24, 2015 at 22:30:03 -0400, Emilio G. Cota wrote:
> On Sun, Aug 23, 2015 at 18:04:46 -0700, Paolo Bonzini wrote:
> > On 23/08/2015 17:23, Emilio G. Cota wrote:
> (snip)
> > Applied, but in the end the spinlock will probably simply use a simple
> > test-and-test-and-set lock, or an MCS lock.  There is no need to use
> > pthreads for this.
(snip)
> Note that fair locks (such as MCS) for user-space programs are not
> necessarily a good idea when preemption is considered--and for usermode
> we'd be forced (if we allowed MCS's to nest) to use per-lock stack variables
> given that the number of threads is unbounded, which is pretty ugly.
> 
> If contention is a problem, a simple, fast spinlock combined with an exponential
> backoff is already pretty good. Fairness is not a requirement (the cache
> substrate of a NUMA machine isn't necessarily fair, is it?); scalability is.
> If the algorithm in the guest requires fairness, the guest must use a fair lock
> (e.g. MCS), and that works as intended when run natively or under qemu.
> 
> I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a
> program that spawns N threads and each thread performs an 2**M atomic increments
> on the same variable. That is, a degenerate worst-case kind of contention.
> N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment:
> 
>   http://imgur.com/XpYctyT
>   With backoff, the per-access latency grows roughly linearly with the number of
>   cores, i.e. this is scalable. The other two are clearly superlinear.

Just tried MCS, CLH and ticket spinlocks (with and without backoff).

They take essentially forever for this (admittedly worst-case) test; this
is not suprising when we realise that we're essentially trying to do
in software what the coherence protocol in the cache does (in hardware)
for us when using greedy spinlocks. I'm not even showing numbers because
around N=9 each experiment starts taking waay too long for me to wait.

> In light of these results I see very little against going for a solution
> with exponential backoff.

The only reason I can think of against this option is that we'd be altering
the dynamic behaviour of the emulated code. For example, code that is
not clearly not scalable (such as the example above) would be made
to "scale" by the use of the backoffs (as a result the CPU that gets a lock
is very likely to grab it again right after unlocking). This makes me
a uneasy; it makes intuitive sense to me that unscalable code should
not scale under QEMU, instead of us playing tricks that would confuse
users.

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions
  2015-08-25 19:30       ` Emilio G. Cota
@ 2015-08-25 22:53         ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-25 22:53 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark burton, a rigo, qemu-devel, guillaume delbergue,
	alex bennee, Frederic Konrad


> > I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a
> > program that spawns N threads and each thread performs an 2**M atomic
> > increments
> > on the same variable. That is, a degenerate worst-case kind of contention.
> > N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment:
> > 
> >   http://imgur.com/XpYctyT
> >   With backoff, the per-access latency grows roughly linearly with the
> >   number of
> >   cores, i.e. this is scalable. The other two are clearly superlinear.
> 
> Just tried MCS, CLH and ticket spinlocks (with and without backoff).
> They take essentially forever for this (admittedly worst-case) test;

[snip interesting stuff]

Yeah, fair spinlocks in userspace wasn't a smart suggestion. :)

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (4 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-07 15:50   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically Emilio G. Cota
                   ` (33 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/seqlock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index 3ff118a..f1256f5 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -62,7 +62,7 @@ static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
     return ret;
 }
 
-static int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
+static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
 {
     /* Read other fields before reading final sequence.  */
     smp_rmb();
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry
  2015-08-24  0:23 ` [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry Emilio G. Cota
@ 2015-09-07 15:50   ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-07 15:50 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/seqlock.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index 3ff118a..f1256f5 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -62,7 +62,7 @@ static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
>      return ret;
>  }
>  
> -static int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
> +static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
>  {
>      /* Read other fields before reading final sequence.  */
>      smp_rmb();

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

You could send this to qemu-trivial now if you want.

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (5 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-07 15:53   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork Emilio G. Cota
                   ` (32 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

With this change we make sure that the compiler will not
optimise the read of the sequence number in any way.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/seqlock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
index f1256f5..70b01fd 100644
--- a/include/qemu/seqlock.h
+++ b/include/qemu/seqlock.h
@@ -55,18 +55,18 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
 static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
 {
     /* Always fail if a write is in progress.  */
-    unsigned ret = sl->sequence & ~1;
+    unsigned ret = atomic_read(&sl->sequence);
 
     /* Read sequence before reading other fields.  */
     smp_rmb();
-    return ret;
+    return ret & ~1;
 }
 
 static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
 {
     /* Read other fields before reading final sequence.  */
     smp_rmb();
-    return unlikely(sl->sequence != start);
+    return unlikely(atomic_read(&sl->sequence) != start);
 }
 
 #endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically
  2015-08-24  0:23 ` [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically Emilio G. Cota
@ 2015-09-07 15:53   ` Alex Bennée
  2015-09-07 16:13     ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-07 15:53 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> With this change we make sure that the compiler will not
> optimise the read of the sequence number in any way.

What was it doing? Using atomic_read to work around a compiler bug seems
a bit heavy handed if true atomicity isn't needed.

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/seqlock.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
> index f1256f5..70b01fd 100644
> --- a/include/qemu/seqlock.h
> +++ b/include/qemu/seqlock.h
> @@ -55,18 +55,18 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
>  static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
>  {
>      /* Always fail if a write is in progress.  */
> -    unsigned ret = sl->sequence & ~1;
> +    unsigned ret = atomic_read(&sl->sequence);
>  
>      /* Read sequence before reading other fields.  */
>      smp_rmb();
> -    return ret;
> +    return ret & ~1;
>  }
>  
>  static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
>  {
>      /* Read other fields before reading final sequence.  */
>      smp_rmb();
> -    return unlikely(sl->sequence != start);
> +    return unlikely(atomic_read(&sl->sequence) != start);
>  }
>  
>  #endif

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically
  2015-09-07 15:53   ` Alex Bennée
@ 2015-09-07 16:13     ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-07 16:13 UTC (permalink / raw)
  To: Alex Bennée, Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	Frederic Konrad



On 07/09/2015 17:53, Alex Bennée wrote:
>> > With this change we make sure that the compiler will not
>> > optimise the read of the sequence number in any way.
> What was it doing? Using atomic_read to work around a compiler bug seems
> a bit heavy handed if true atomicity isn't needed.

This is really the equivalent of C11 atomic_relaxed, so it isn't heavy
handed.  We really should move towards using atomic_read/atomic_set
around smp_rmb/smp_wmb/smp_mb.

Paolo

>> >
>> > Signed-off-by: Emilio G. Cota <cota@braap.org>
>> > ---
>> >  include/qemu/seqlock.h | 6 +++---
>> >  1 file changed, 3 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/include/qemu/seqlock.h b/include/qemu/seqlock.h
>> > index f1256f5..70b01fd 100644
>> > --- a/include/qemu/seqlock.h
>> > +++ b/include/qemu/seqlock.h
>> > @@ -55,18 +55,18 @@ static inline void seqlock_write_unlock(QemuSeqLock *sl)
>> >  static inline unsigned seqlock_read_begin(QemuSeqLock *sl)
>> >  {
>> >      /* Always fail if a write is in progress.  */
>> > -    unsigned ret = sl->sequence & ~1;
>> > +    unsigned ret = atomic_read(&sl->sequence);
>> >  
>> >      /* Read sequence before reading other fields.  */
>> >      smp_rmb();
>> > -    return ret;
>> > +    return ret & ~1;
>> >  }
>> >  
>> >  static inline int seqlock_read_retry(const QemuSeqLock *sl, unsigned start)
>> >  {
>> >      /* Read other fields before reading final sequence.  */
>> >      smp_rmb();
>> > -    return unlikely(sl->sequence != start);
>> > +    return unlikely(atomic_read(&sl->sequence) != start);

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (6 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-08 17:34   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/ Emilio G. Cota
                   ` (31 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

We were unlocking this lock after fork, which is wrong since
only the thread that holds a mutex is allowed to unlock it.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 util/rcu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/util/rcu.c b/util/rcu.c
index 8ba304d..47c2bce 100644
--- a/util/rcu.c
+++ b/util/rcu.c
@@ -335,6 +335,11 @@ static void rcu_init_unlock(void)
     qemu_mutex_unlock(&rcu_registry_lock);
     qemu_mutex_unlock(&rcu_sync_lock);
 }
+
+static void rcu_init_child(void)
+{
+    qemu_mutex_init(&rcu_registry_lock);
+}
 #endif
 
 void rcu_after_fork(void)
@@ -346,7 +351,7 @@ void rcu_after_fork(void)
 static void __attribute__((__constructor__)) rcu_init(void)
 {
 #ifdef CONFIG_POSIX
-    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
+    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_child);
 #endif
     rcu_init_complete();
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork
  2015-08-24  0:23 ` [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork Emilio G. Cota
@ 2015-09-08 17:34   ` Alex Bennée
  2015-09-08 19:03     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-08 17:34 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> We were unlocking this lock after fork, which is wrong since
> only the thread that holds a mutex is allowed to unlock it.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  util/rcu.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/util/rcu.c b/util/rcu.c
> index 8ba304d..47c2bce 100644
> --- a/util/rcu.c
> +++ b/util/rcu.c
> @@ -335,6 +335,11 @@ static void rcu_init_unlock(void)
>      qemu_mutex_unlock(&rcu_registry_lock);
>      qemu_mutex_unlock(&rcu_sync_lock);
>  }
> +
> +static void rcu_init_child(void)
> +{
> +    qemu_mutex_init(&rcu_registry_lock);
> +}
>  #endif
>  
>  void rcu_after_fork(void)
> @@ -346,7 +351,7 @@ void rcu_after_fork(void)
>  static void __attribute__((__constructor__)) rcu_init(void)
>  {
>  #ifdef CONFIG_POSIX
> -    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
> +    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_child);
>  #endif

Hmm previously we unlocked both rcu_sync_lock and rcu_registry_lock, is
it somehow different in it's locking rules? If I'm reading the
pthread_atfork man page right couldn't we just do:

    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_lock);

>      rcu_init_complete();
>  }

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork
  2015-09-08 17:34   ` Alex Bennée
@ 2015-09-08 19:03     ` Emilio G. Cota
  2015-09-09  9:35       ` Alex Bennée
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-09-08 19:03 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad

On Tue, Sep 08, 2015 at 18:34:38 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
(snip)
> > +static void rcu_init_child(void)
> > +{
> > +    qemu_mutex_init(&rcu_registry_lock);
> > +}
> >  #endif
> >  
> >  void rcu_after_fork(void)
> > @@ -346,7 +351,7 @@ void rcu_after_fork(void)
> >  static void __attribute__((__constructor__)) rcu_init(void)
> >  {
> >  #ifdef CONFIG_POSIX
> > -    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
> > +    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_child);
> >  #endif
> 
> Hmm previously we unlocked both rcu_sync_lock and rcu_registry_lock, is
> it somehow different in it's locking rules? If I'm reading the
> pthread_atfork man page right couldn't we just do:
> 
>     pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_lock);

That'd cause the child to deadlock.

Before forking the parent locks those mutexes; then it forks. The child sees
those mutexes as 'locked', and whatever memory the parent writes to _after_
fork is not seen by the child. So trying to 'lock' those mutexes from the child
is never going to succeed, because they're seeing as already locked.

The original idea ("unlock" from the child) is OK, but doesn't work when
using PTHREAD_MUTEX_ERRORCHECK. Yes, this was removed (24fa90499f) but
it's too useful to not cherry-pick it when developing MTTCG.

What I think remains to be fixed is the possibility of corruption when
any call_rcu calls are pending while forking. The child should not mess
with these; only the parent should. See call_rcu_after_fork_{parent,child}
from liburcu for details.

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork
  2015-09-08 19:03     ` Emilio G. Cota
@ 2015-09-09  9:35       ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-09  9:35 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> On Tue, Sep 08, 2015 at 18:34:38 +0100, Alex Bennée wrote:
>> Emilio G. Cota <cota@braap.org> writes:
> (snip)
>> > +static void rcu_init_child(void)
>> > +{
>> > +    qemu_mutex_init(&rcu_registry_lock);
>> > +}
>> >  #endif
>> >  
>> >  void rcu_after_fork(void)
>> > @@ -346,7 +351,7 @@ void rcu_after_fork(void)
>> >  static void __attribute__((__constructor__)) rcu_init(void)
>> >  {
>> >  #ifdef CONFIG_POSIX
>> > -    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_unlock);
>> > +    pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_child);
>> >  #endif
>> 
>> Hmm previously we unlocked both rcu_sync_lock and rcu_registry_lock, is
>> it somehow different in it's locking rules? If I'm reading the
>> pthread_atfork man page right couldn't we just do:
>> 
>>     pthread_atfork(rcu_init_lock, rcu_init_unlock, rcu_init_lock);
>
> That'd cause the child to deadlock.
>
> Before forking the parent locks those mutexes; then it forks. The child sees
> those mutexes as 'locked', and whatever memory the parent writes to _after_
> fork is not seen by the child. So trying to 'lock' those mutexes from the child
> is never going to succeed, because they're seeing as already locked.

Doh apologies I misread the code. What caught my eye is the old code
locks/unlocks both rcu_registry_lock and rcu_sync_lock so shouldn't the
child function reinit both?

static void rcu_init_child(void)
{
    qemu_mutex_init(&rcu_sync_lock);
    qemu_mutex_init(&rcu_registry_lock);
}


>
> The original idea ("unlock" from the child) is OK, but doesn't work when
> using PTHREAD_MUTEX_ERRORCHECK. Yes, this was removed (24fa90499f) but
> it's too useful to not cherry-pick it when developing MTTCG.
>
> What I think remains to be fixed is the possibility of corruption when
> any call_rcu calls are pending while forking. The child should not mess
> with these; only the parent should. See call_rcu_after_fork_{parent,child}
> from liburcu for details.
>
> 		Emilio

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (7 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 11:18   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map Emilio G. Cota
                   ` (30 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/rcu.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
index 7df1e86..f6d1d56 100644
--- a/include/qemu/rcu.h
+++ b/include/qemu/rcu.h
@@ -71,7 +71,7 @@ struct rcu_reader_data {
     /* Data used by reader only */
     unsigned depth;
 
-    /* Data used for registry, protected by rcu_gp_lock */
+    /* Data used for registry, protected by rcu_registry_lock */
     QLIST_ENTRY(rcu_reader_data) node;
 };
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/
  2015-08-24  0:23 ` [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/ Emilio G. Cota
@ 2015-09-10 11:18   ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 11:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/rcu.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h
> index 7df1e86..f6d1d56 100644
> --- a/include/qemu/rcu.h
> +++ b/include/qemu/rcu.h
> @@ -71,7 +71,7 @@ struct rcu_reader_data {
>      /* Data used by reader only */
>      unsigned depth;
>  
> -    /* Data used for registry, protected by rcu_gp_lock */
> +    /* Data used for registry, protected by rcu_registry_lock */
>      QLIST_ENTRY(rcu_reader_data) node;
>  };

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (8 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/ Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 11:59   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups Emilio G. Cota
                   ` (29 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

l1_map is based on physical addresses in full-system mode, as pointed
out in an earlier comment. Said comment also mentions that virtual
addresses are only used in l1_map in user-only mode.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index 31239db..b873d5c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -122,8 +122,7 @@ uintptr_t qemu_real_host_page_mask;
 uintptr_t qemu_host_page_size;
 uintptr_t qemu_host_page_mask;
 
-/* This is a multi-level map on the virtual address space.
-   The bottom level has pointers to PageDesc.  */
+/* The bottom level has pointers to PageDesc */
 static void *l1_map[V_L1_SIZE];
 
 /* code generation context */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map
  2015-08-24  0:23 ` [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map Emilio G. Cota
@ 2015-09-10 11:59   ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 11:59 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> l1_map is based on physical addresses in full-system mode, as pointed
> out in an earlier comment. Said comment also mentions that virtual
> addresses are only used in l1_map in user-only mode.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  translate-all.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/translate-all.c b/translate-all.c
> index 31239db..b873d5c 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -122,8 +122,7 @@ uintptr_t qemu_real_host_page_mask;
>  uintptr_t qemu_host_page_size;
>  uintptr_t qemu_host_page_mask;
>  
> -/* This is a multi-level map on the virtual address space.
> -   The bottom level has pointers to PageDesc.  */
> +/* The bottom level has pointers to PageDesc */
>  static void *l1_map[V_L1_SIZE];
>  
>  /* code generation context */

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (9 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 13:22   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create) Emilio G. Cota
                   ` (28 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 util/qemu-thread-posix.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
index 04dae0f..3760e27 100644
--- a/util/qemu-thread-posix.c
+++ b/util/qemu-thread-posix.c
@@ -303,7 +303,16 @@ static inline void futex_wake(QemuEvent *ev, int n)
 
 static inline void futex_wait(QemuEvent *ev, unsigned val)
 {
-    futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0);
+    while (futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
+        switch (errno) {
+        case EWOULDBLOCK:
+            return;
+        case EINTR:
+            break; /* get out of switch and retry */
+        default:
+            abort();
+        }
+    }
 }
 #else
 static inline void futex_wake(QemuEvent *ev, int n)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups
  2015-08-24  0:23 ` [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups Emilio G. Cota
@ 2015-09-10 13:22   ` Alex Bennée
  2015-09-10 17:46     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 13:22 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  util/qemu-thread-posix.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> index 04dae0f..3760e27 100644
> --- a/util/qemu-thread-posix.c
> +++ b/util/qemu-thread-posix.c
> @@ -303,7 +303,16 @@ static inline void futex_wake(QemuEvent *ev, int n)
>  
>  static inline void futex_wait(QemuEvent *ev, unsigned val)
>  {
> -    futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0);
> +    while (futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
> +        switch (errno) {
> +        case EWOULDBLOCK:
> +            return;
> +        case EINTR:
> +            break; /* get out of switch and retry */
> +        default:
> +            abort();

I'd be tempted to error_exit with the errno in this case so additional
information is reported before we bail out. The man pages seems to
indicate other errnos are possible for FUTUX_WAIT although they may be
unlikely:

       EACCES No read access to futex memory.
       EFAULT Error retrieving timeout information from user space.

       I guess things would have gone very wrong for these

       EINVAL Invalid argument.

       Hard to get wrong

       ENFILE The system limit on the total number of open files has
       been reached.

       Might happen under system load?

       ENOSYS Invalid operation specified in op.

       Hardcoded op so no

       ETIMEDOUT
              Timeout during the FUTEX_WAIT operation.

       No timeout specified so we shouldn't hit it


> +        }
> +    }
>  }
>  #else
>  static inline void futex_wake(QemuEvent *ev, int n)

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups
  2015-09-10 13:22   ` Alex Bennée
@ 2015-09-10 17:46     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-09-10 17:46 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad

On Thu, Sep 10, 2015 at 14:22:49 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  util/qemu-thread-posix.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/util/qemu-thread-posix.c b/util/qemu-thread-posix.c
> > index 04dae0f..3760e27 100644
> > --- a/util/qemu-thread-posix.c
> > +++ b/util/qemu-thread-posix.c
> > @@ -303,7 +303,16 @@ static inline void futex_wake(QemuEvent *ev, int n)
> >  
> >  static inline void futex_wait(QemuEvent *ev, unsigned val)
> >  {
> > -    futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0);
> > +    while (futex(ev, FUTEX_WAIT, (int) val, NULL, NULL, 0)) {
> > +        switch (errno) {
> > +        case EWOULDBLOCK:
> > +            return;
> > +        case EINTR:
> > +            break; /* get out of switch and retry */
> > +        default:
> > +            abort();
> 
> I'd be tempted to error_exit with the errno in this case so additional
> information is reported before we bail out.

Yes that's a good suggestion.

> The man pages seems to indicate other errnos are possible for FUTUX_WAIT
> although they may be unlikely:
> 
>        EACCES No read access to futex memory.
>        EFAULT Error retrieving timeout information from user space.
> 
>        I guess things would have gone very wrong for these
> 
>        EINVAL Invalid argument.
> 
>        Hard to get wrong
> 
>        ENFILE The system limit on the total number of open files has
>        been reached.
> 
>        Might happen under system load?
> 
>        ENOSYS Invalid operation specified in op.
> 
>        Hardcoded op so no
> 
>        ETIMEDOUT
>               Timeout during the FUTEX_WAIT operation.
> 
>        No timeout specified so we shouldn't hit it

Of these I'd say all would be bugs in our code except for ENFILE, so
it might be worth adding.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create)
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (10 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-25  0:45   ` Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry Emilio G. Cota
                   ` (27 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 linux-user/syscall.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index f62c698..732936f 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -4513,6 +4513,7 @@ static void *clone_func(void *arg)
     CPUState *cpu;
     TaskState *ts;
 
+    rcu_register_thread();
     env = info->env;
     cpu = ENV_GET_CPU(env);
     thread_cpu = cpu;
@@ -5614,6 +5615,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
             thread_cpu = NULL;
             object_unref(OBJECT(cpu));
             g_free(ts);
+            rcu_unregister_thread();
             pthread_exit(NULL);
         }
 #ifdef TARGET_GPROF
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create) Emilio G. Cota
@ 2015-08-25  0:45   ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  0:45 UTC (permalink / raw)
  To: pbonzini
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 20:23:41 -0400, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  linux-user/syscall.c | 2 ++
>  1 file changed, 2 insertions(+)

Just noticed that this patch is incomplete, since the 'main' thread
doesn't get to call rcu_register_thread()--only its children call it.

This is fixed in patch 3/4 I sent as a reply to your review of patch 3/38,
so you might want to discard this patch from your queue.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (11 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create) Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 13:49   ` Alex Bennée
  2015-09-21  5:01   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses Emilio G. Cota
                   ` (26 subsequent siblings)
  39 siblings, 2 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Having the physical address in the TLB entry will allow us
to portably obtain the physical address of a memory access,
which will prove useful when implementing a scalable emulation
of atomic instructions.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cputlb.c                | 1 +
 include/exec/cpu-defs.h | 7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index d1ad8e8..1b3673e 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -409,6 +409,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     } else {
         te->addr_write = -1;
     }
+    te->addr_phys = paddr;
 }
 
 /* Add a new TLB entry, but without specifying the memory
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 5093be2..ca9c85c 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -60,10 +60,10 @@ typedef uint64_t target_ulong;
 /* use a fully associative victim tlb of 8 entries */
 #define CPU_VTLB_SIZE 8
 
-#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
-#define CPU_TLB_ENTRY_BITS 4
-#else
+#if TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 5
+#else
+#define CPU_TLB_ENTRY_BITS 6
 #endif
 
 /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
@@ -110,6 +110,7 @@ typedef struct CPUTLBEntry {
             target_ulong addr_read;
             target_ulong addr_write;
             target_ulong addr_code;
+            target_ulong addr_phys;
             /* Addend to virtual address to get host address.  IO accesses
                use the corresponding iotlb value.  */
             uintptr_t addend;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry
  2015-08-24  0:23 ` [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry Emilio G. Cota
@ 2015-09-10 13:49   ` Alex Bennée
  2015-09-10 17:50     ` Emilio G. Cota
  2015-09-21  5:01   ` Paolo Bonzini
  1 sibling, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 13:49 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Having the physical address in the TLB entry will allow us
> to portably obtain the physical address of a memory access,
> which will prove useful when implementing a scalable emulation
> of atomic instructions.
>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cputlb.c                | 1 +
>  include/exec/cpu-defs.h | 7 ++++---
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/cputlb.c b/cputlb.c
> index d1ad8e8..1b3673e 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -409,6 +409,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>      } else {
>          te->addr_write = -1;
>      }
> +    te->addr_phys = paddr;
>  }
>  
>  /* Add a new TLB entry, but without specifying the memory
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 5093be2..ca9c85c 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -60,10 +60,10 @@ typedef uint64_t target_ulong;
>  /* use a fully associative victim tlb of 8 entries */
>  #define CPU_VTLB_SIZE 8
>  
> -#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
> -#define CPU_TLB_ENTRY_BITS 4
> -#else
> +#if TARGET_LONG_BITS == 32
>  #define CPU_TLB_ENTRY_BITS 5
> +#else
> +#define CPU_TLB_ENTRY_BITS 6
>  #endif
>  
>  /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
> @@ -110,6 +110,7 @@ typedef struct CPUTLBEntry {
>              target_ulong addr_read;
>              target_ulong addr_write;
>              target_ulong addr_code;
> +            target_ulong addr_phys;
>              /* Addend to virtual address to get host address.  IO accesses
>                 use the corresponding iotlb value.  */
>              uintptr_t addend;

So this ends up expanding the TLB entry size and either pushing the
overall TLB up in size or reducing the number of entries per-TLB so I
think we would need some numbers on the impact on performance this has.

As far as I can see you never use this value in this patch series so
maybe this is worth deferring for now?

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry
  2015-09-10 13:49   ` Alex Bennée
@ 2015-09-10 17:50     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-09-10 17:50 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad

On Thu, Sep 10, 2015 at 14:49:07 +0100, Alex Bennée wrote:
> Emilio G. Cota <cota@braap.org> writes:
> 
> > Having the physical address in the TLB entry will allow us
> > to portably obtain the physical address of a memory access,
> > which will prove useful when implementing a scalable emulation
> > of atomic instructions.
> >
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  cputlb.c                | 1 +
> >  include/exec/cpu-defs.h | 7 ++++---
> >  2 files changed, 5 insertions(+), 3 deletions(-)
> >
> > diff --git a/cputlb.c b/cputlb.c
> > index d1ad8e8..1b3673e 100644
> > --- a/cputlb.c
> > +++ b/cputlb.c
> > @@ -409,6 +409,7 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
> >      } else {
> >          te->addr_write = -1;
> >      }
> > +    te->addr_phys = paddr;
> >  }
> >  
> >  /* Add a new TLB entry, but without specifying the memory
> > diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> > index 5093be2..ca9c85c 100644
> > --- a/include/exec/cpu-defs.h
> > +++ b/include/exec/cpu-defs.h
> > @@ -60,10 +60,10 @@ typedef uint64_t target_ulong;
> >  /* use a fully associative victim tlb of 8 entries */
> >  #define CPU_VTLB_SIZE 8
> >  
> > -#if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
> > -#define CPU_TLB_ENTRY_BITS 4
> > -#else
> > +#if TARGET_LONG_BITS == 32
> >  #define CPU_TLB_ENTRY_BITS 5
> > +#else
> > +#define CPU_TLB_ENTRY_BITS 6
> >  #endif
> >  
> >  /* TCG_TARGET_TLB_DISPLACEMENT_BITS is used in CPU_TLB_BITS to ensure that
> > @@ -110,6 +110,7 @@ typedef struct CPUTLBEntry {
> >              target_ulong addr_read;
> >              target_ulong addr_write;
> >              target_ulong addr_code;
> > +            target_ulong addr_phys;
> >              /* Addend to virtual address to get host address.  IO accesses
> >                 use the corresponding iotlb value.  */
> >              uintptr_t addend;
> 
> So this ends up expanding the TLB entry size and either pushing the
> overall TLB up in size or reducing the number of entries per-TLB so I
> think we would need some numbers on the impact on performance this has.

My tests show little to no perf impact, but note that I work on fairly big
machines that have large caches. I could run some benchmarks, although
this will take a while--will be on vacation for a couple of weeks.

> As far as I can see you never use this value in this patch series so
> maybe this is worth deferring for now?

This is used for atomic instruction emulation in full-system mode; see
aie-helper.c which has calls to helpers added to softmmu_template.h.

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry
  2015-08-24  0:23 ` [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry Emilio G. Cota
  2015-09-10 13:49   ` Alex Bennée
@ 2015-09-21  5:01   ` Paolo Bonzini
  1 sibling, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-21  5:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: alex.bennee, Frederic Konrad, mark.burton, a.rigo,
	guillaume.delbergue



On 24/08/2015 02:23, Emilio G. Cota wrote:
> Having the physical address in the TLB entry will allow us
> to portably obtain the physical address of a memory access,
> which will prove useful when implementing a scalable emulation
> of atomic instructions.

It came to my mind that addr_read and addr_code only differ in the
bottom TARGET_PAGE_BITS bits, and they are always zero in addr_phys.

So we could store addr_code as

   addr_read ^ (addr_phys & (TARGET_PAGE_SIZE - 1))

and discard the bottom bits of addr_phys.  This would make it possible
to include addr_phys without growing the size of the TLB entry.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (12 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  2:02   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module Emilio G. Cota
                   ` (25 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This will be used by the atomic instruction emulation code.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 softmmu_template.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg.h          |  5 +++++
 2 files changed, 53 insertions(+)

diff --git a/softmmu_template.h b/softmmu_template.h
index b66eaf8..6496a8a 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -480,6 +480,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 #endif
 }
 
+#if DATA_SIZE == 1
+
+/* get a load's physical address */
+hwaddr helper_ret_get_ld_phys(CPUArchState *env, target_ulong addr,
+                              int mmu_idx, uintptr_t retaddr)
+{
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    CPUTLBEntry *te = &env->tlb_table[mmu_idx][index];
+    target_ulong taddr;
+    target_ulong phys_addr;
+
+    retaddr -= GETPC_ADJ;
+    taddr = te->addr_read & (TARGET_PAGE_MASK | TLB_INVALID_MASK);
+    if (taddr != (addr & TARGET_PAGE_MASK)) {
+        if (!VICTIM_TLB_HIT(addr_read)) {
+            CPUState *cs = ENV_GET_CPU(env);
+
+            tlb_fill(cs, addr, MMU_DATA_LOAD, mmu_idx, retaddr);
+        }
+    }
+    phys_addr = te->addr_phys;
+    return phys_addr | (addr & ~TARGET_PAGE_MASK);
+}
+
+/* get a store's physical address */
+hwaddr helper_ret_get_st_phys(CPUArchState *env, target_ulong addr,
+                              int mmu_idx, uintptr_t retaddr)
+{
+    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+    CPUTLBEntry *te = &env->tlb_table[mmu_idx][index];
+    target_ulong taddr;
+    target_ulong phys_addr;
+
+    retaddr -= GETPC_ADJ;
+    taddr = te->addr_write & (TARGET_PAGE_MASK | TLB_INVALID_MASK);
+    if (taddr != (addr & TARGET_PAGE_MASK)) {
+        if (!VICTIM_TLB_HIT(addr_write)) {
+            CPUState *cs = ENV_GET_CPU(env);
+
+            tlb_fill(cs, addr, MMU_DATA_STORE, mmu_idx, retaddr);
+        }
+    }
+    phys_addr = te->addr_phys;
+    return phys_addr | (addr & ~TARGET_PAGE_MASK);
+}
+
+#endif /* DATA_SIZE == 1 */
+
 #if DATA_SIZE > 1
 void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 66b36f2..8d30d61 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -992,6 +992,11 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
 
+hwaddr helper_ret_get_ld_phys(CPUArchState *env, target_ulong addr,
+                              int mmu_idx, uintptr_t retaddr);
+hwaddr helper_ret_get_st_phys(CPUArchState *env, target_ulong addr,
+                              int mmu_idx, uintptr_t retaddr);
+
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
 # define helper_ret_ldsw_mmu  helper_be_ldsw_mmu
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses
  2015-08-24  0:23 ` [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses Emilio G. Cota
@ 2015-08-24  2:02   ` Paolo Bonzini
  2015-08-25  2:47     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  2:02 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> This will be used by the atomic instruction emulation code.

Is this a fast path?  If not, we can use the existing addend field and
convert the host address to a ram_addr_t easily.

Paolo

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  softmmu_template.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  tcg/tcg.h          |  5 +++++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/softmmu_template.h b/softmmu_template.h
> index b66eaf8..6496a8a 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -480,6 +480,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>  #endif
>  }
>  
> +#if DATA_SIZE == 1
> +
> +/* get a load's physical address */
> +hwaddr helper_ret_get_ld_phys(CPUArchState *env, target_ulong addr,
> +                              int mmu_idx, uintptr_t retaddr)
> +{
> +    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    CPUTLBEntry *te = &env->tlb_table[mmu_idx][index];
> +    target_ulong taddr;
> +    target_ulong phys_addr;
> +
> +    retaddr -= GETPC_ADJ;
> +    taddr = te->addr_read & (TARGET_PAGE_MASK | TLB_INVALID_MASK);
> +    if (taddr != (addr & TARGET_PAGE_MASK)) {
> +        if (!VICTIM_TLB_HIT(addr_read)) {
> +            CPUState *cs = ENV_GET_CPU(env);
> +
> +            tlb_fill(cs, addr, MMU_DATA_LOAD, mmu_idx, retaddr);
> +        }
> +    }
> +    phys_addr = te->addr_phys;
> +    return phys_addr | (addr & ~TARGET_PAGE_MASK);
> +}
> +
> +/* get a store's physical address */
> +hwaddr helper_ret_get_st_phys(CPUArchState *env, target_ulong addr,
> +                              int mmu_idx, uintptr_t retaddr)
> +{
> +    int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +    CPUTLBEntry *te = &env->tlb_table[mmu_idx][index];
> +    target_ulong taddr;
> +    target_ulong phys_addr;
> +
> +    retaddr -= GETPC_ADJ;
> +    taddr = te->addr_write & (TARGET_PAGE_MASK | TLB_INVALID_MASK);
> +    if (taddr != (addr & TARGET_PAGE_MASK)) {
> +        if (!VICTIM_TLB_HIT(addr_write)) {
> +            CPUState *cs = ENV_GET_CPU(env);
> +
> +            tlb_fill(cs, addr, MMU_DATA_STORE, mmu_idx, retaddr);
> +        }
> +    }
> +    phys_addr = te->addr_phys;
> +    return phys_addr | (addr & ~TARGET_PAGE_MASK);
> +}
> +
> +#endif /* DATA_SIZE == 1 */
> +
>  #if DATA_SIZE > 1
>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 66b36f2..8d30d61 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -992,6 +992,11 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
>  void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
>                         TCGMemOpIdx oi, uintptr_t retaddr);
>  
> +hwaddr helper_ret_get_ld_phys(CPUArchState *env, target_ulong addr,
> +                              int mmu_idx, uintptr_t retaddr);
> +hwaddr helper_ret_get_st_phys(CPUArchState *env, target_ulong addr,
> +                              int mmu_idx, uintptr_t retaddr);
> +
>  /* Temporary aliases until backends are converted.  */
>  #ifdef TARGET_WORDS_BIGENDIAN
>  # define helper_ret_ldsw_mmu  helper_be_ldsw_mmu
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses
  2015-08-24  2:02   ` Paolo Bonzini
@ 2015-08-25  2:47     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  2:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 19:02:30 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > This will be used by the atomic instruction emulation code.
> 
> Is this a fast path?  If not, we can use the existing addend field and
> convert the host address to a ram_addr_t easily.

On x86 this is a fast path because the helper that goes before every
store (aie_st_pre, p.17/38) checks whether any atomic operations on
the store's cache line have been done before. The check is done using
a bitmap (p.16/38); the input to the bitmap is the physical address with
the last 6 bits shifted out (I'm assuming 2**6=64-byte cache lines).

That said: How would the conversion via the addend field look like?
If it's just an addition it might be cheap enough--and we wouldn't
bloat CPUTLBEntry.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (13 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 14:25   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 16/38] aie: add module for Atomic Instruction Emulation Emilio G. Cota
                   ` (24 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This will be used by atomic instruction emulation code.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/qemu/radix-tree.h | 29 ++++++++++++++++++
 util/Makefile.objs        |  2 +-
 util/radix-tree.c         | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 include/qemu/radix-tree.h
 create mode 100644 util/radix-tree.c

diff --git a/include/qemu/radix-tree.h b/include/qemu/radix-tree.h
new file mode 100644
index 0000000..a4e1f97
--- /dev/null
+++ b/include/qemu/radix-tree.h
@@ -0,0 +1,29 @@
+#ifndef RADIX_TREE_H
+#define RADIX_TREE_H
+
+#include <stddef.h>
+
+typedef struct QemuRadixNode QemuRadixNode;
+typedef struct QemuRadixTree QemuRadixTree;
+
+struct QemuRadixNode {
+    void *slots[0];
+};
+
+struct QemuRadixTree {
+    QemuRadixNode *root;
+    int radix;
+    int max_height;
+};
+
+void qemu_radix_tree_init(QemuRadixTree *tree, int bits, int radix);
+void *qemu_radix_tree_find_alloc(QemuRadixTree *tree, unsigned long index,
+                                 void *(*create)(unsigned long),
+                                 void (*delete)(void *));
+
+static inline void *qemu_radix_tree_find(QemuRadixTree *t, unsigned long index)
+{
+    return qemu_radix_tree_find_alloc(t, index, NULL, NULL);
+}
+
+#endif /* RADIX_TREE_H */
diff --git a/util/Makefile.objs b/util/Makefile.objs
index 114d657..6b18d3d 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -1,4 +1,4 @@
-util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
+util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o radix-tree.o
 util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o event_notifier-win32.o
 util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o event_notifier-posix.o qemu-openpty.o
 util-obj-y += envlist.o path.o module.o
diff --git a/util/radix-tree.c b/util/radix-tree.c
new file mode 100644
index 0000000..69eff29
--- /dev/null
+++ b/util/radix-tree.c
@@ -0,0 +1,75 @@
+/*
+ * radix-tree.c
+ * Non-blocking radix tree.
+ *
+ * Features:
+ * - Concurrent lookups and inserts.
+ * - No support for deletions.
+ *
+ * Conventions:
+ * - Height is counted starting from 0 at the bottom.
+ * - The index is used from left to right, i.e. MSBs are used first. This way
+ *   nearby addresses land in nearby slots, minimising cache/TLB misses.
+ */
+#include <glib.h>
+
+#include "qemu/radix-tree.h"
+#include "qemu/atomic.h"
+#include "qemu/bitops.h"
+#include "qemu/osdep.h"
+
+typedef struct QemuRadixNode QemuRadixNode;
+
+void *qemu_radix_tree_find_alloc(QemuRadixTree *tree, unsigned long index,
+                                 void *(*create)(unsigned long),
+                                 void (*delete)(void *))
+{
+    QemuRadixNode *parent;
+    QemuRadixNode *node = tree->root;
+    void **slot;
+    int n_slots = BIT(tree->radix);
+    int level = tree->max_height - 1;
+    int shift = (level - 1) * tree->radix;
+
+    do {
+        parent = node;
+        slot = parent->slots + ((index >> shift) & (n_slots - 1));
+        node = atomic_read(slot);
+        smp_read_barrier_depends();
+        if (node == NULL) {
+            void *old;
+            void *new;
+
+            if (!create) {
+                return NULL;
+            }
+
+            if (level == 1) {
+                node = create(index);
+            } else {
+                node = g_malloc0(sizeof(*node) + sizeof(void *) * n_slots);
+            }
+            new = node;
+            /* atomic_cmpxchg is type-safe so we cannot use 'node' here */
+            old = atomic_cmpxchg(slot, NULL, new);
+            if (old) {
+                if (level == 1) {
+                    delete(node);
+                } else {
+                    g_free(node);
+                }
+                node = old;
+            }
+        }
+        shift -= tree->radix;
+        level--;
+    } while (level > 0);
+    return node;
+}
+
+void qemu_radix_tree_init(QemuRadixTree *tree, int bits, int radix)
+{
+    tree->radix = radix;
+    tree->max_height = 1 + DIV_ROUND_UP(bits, radix);
+    tree->root = g_malloc0(sizeof(*tree->root) + sizeof(void *) * BIT(radix));
+}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module
  2015-08-24  0:23 ` [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module Emilio G. Cota
@ 2015-09-10 14:25   ` Alex Bennée
  2015-09-10 18:00     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 14:25 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> This will be used by atomic instruction emulation code.

If we are adding utility functions into the code base like this (which I
can see being useful) we should at least add some documentation with
example calling conventions to docs/

Have you any performance numbers comparing the efficiency of the radix
approach to a "dumb" implementation?

>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  include/qemu/radix-tree.h | 29 ++++++++++++++++++
>  util/Makefile.objs        |  2 +-
>  util/radix-tree.c         | 75 +++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 105 insertions(+), 1 deletion(-)
>  create mode 100644 include/qemu/radix-tree.h
>  create mode 100644 util/radix-tree.c
>
> diff --git a/include/qemu/radix-tree.h b/include/qemu/radix-tree.h
> new file mode 100644
> index 0000000..a4e1f97
> --- /dev/null
> +++ b/include/qemu/radix-tree.h
> @@ -0,0 +1,29 @@
> +#ifndef RADIX_TREE_H
> +#define RADIX_TREE_H
> +
> +#include <stddef.h>
> +
> +typedef struct QemuRadixNode QemuRadixNode;
> +typedef struct QemuRadixTree QemuRadixTree;
> +
> +struct QemuRadixNode {
> +    void *slots[0];
> +};
> +
> +struct QemuRadixTree {
> +    QemuRadixNode *root;
> +    int radix;
> +    int max_height;
> +};
> +
> +void qemu_radix_tree_init(QemuRadixTree *tree, int bits, int radix);
> +void *qemu_radix_tree_find_alloc(QemuRadixTree *tree, unsigned long index,
> +                                 void *(*create)(unsigned long),
> +                                 void (*delete)(void *));
> +
> +static inline void *qemu_radix_tree_find(QemuRadixTree *t, unsigned long index)
> +{
> +    return qemu_radix_tree_find_alloc(t, index, NULL, NULL);
> +}
> +
> +#endif /* RADIX_TREE_H */
> diff --git a/util/Makefile.objs b/util/Makefile.objs
> index 114d657..6b18d3d 100644
> --- a/util/Makefile.objs
> +++ b/util/Makefile.objs
> @@ -1,4 +1,4 @@
> -util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o
> +util-obj-y = osdep.o cutils.o unicode.o qemu-timer-common.o radix-tree.o
>  util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o event_notifier-win32.o
>  util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o event_notifier-posix.o qemu-openpty.o
>  util-obj-y += envlist.o path.o module.o
> diff --git a/util/radix-tree.c b/util/radix-tree.c
> new file mode 100644
> index 0000000..69eff29
> --- /dev/null
> +++ b/util/radix-tree.c
> @@ -0,0 +1,75 @@
> +/*
> + * radix-tree.c
> + * Non-blocking radix tree.
> + *
> + * Features:
> + * - Concurrent lookups and inserts.
> + * - No support for deletions.
> + *
> + * Conventions:
> + * - Height is counted starting from 0 at the bottom.
> + * - The index is used from left to right, i.e. MSBs are used first. This way
> + *   nearby addresses land in nearby slots, minimising cache/TLB misses.
> + */
> +#include <glib.h>
> +
> +#include "qemu/radix-tree.h"
> +#include "qemu/atomic.h"
> +#include "qemu/bitops.h"
> +#include "qemu/osdep.h"
> +
> +typedef struct QemuRadixNode QemuRadixNode;
> +
> +void *qemu_radix_tree_find_alloc(QemuRadixTree *tree, unsigned long index,
> +                                 void *(*create)(unsigned long),
> +                                 void (*delete)(void *))
> +{
> +    QemuRadixNode *parent;
> +    QemuRadixNode *node = tree->root;
> +    void **slot;
> +    int n_slots = BIT(tree->radix);
> +    int level = tree->max_height - 1;
> +    int shift = (level - 1) * tree->radix;
> +
> +    do {
> +        parent = node;
> +        slot = parent->slots + ((index >> shift) & (n_slots - 1));
> +        node = atomic_read(slot);
> +        smp_read_barrier_depends();
> +        if (node == NULL) {
> +            void *old;
> +            void *new;
> +
> +            if (!create) {
> +                return NULL;
> +            }
> +
> +            if (level == 1) {
> +                node = create(index);
> +            } else {
> +                node = g_malloc0(sizeof(*node) + sizeof(void *) * n_slots);
> +            }
> +            new = node;
> +            /* atomic_cmpxchg is type-safe so we cannot use 'node' here */
> +            old = atomic_cmpxchg(slot, NULL, new);
> +            if (old) {
> +                if (level == 1) {
> +                    delete(node);
> +                } else {
> +                    g_free(node);
> +                }
> +                node = old;
> +            }
> +        }
> +        shift -= tree->radix;
> +        level--;
> +    } while (level > 0);
> +    return node;
> +}
> +
> +void qemu_radix_tree_init(QemuRadixTree *tree, int bits, int radix)
> +{
> +    tree->radix = radix;
> +    tree->max_height = 1 + DIV_ROUND_UP(bits, radix);
> +    tree->root = g_malloc0(sizeof(*tree->root) + sizeof(void *) * BIT(radix));
> +}

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module
  2015-09-10 14:25   ` Alex Bennée
@ 2015-09-10 18:00     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-09-10 18:00 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad

On Thu, Sep 10, 2015 at 15:25:50 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > This will be used by atomic instruction emulation code.
> 
> If we are adding utility functions into the code base like this (which I
> can see being useful) we should at least add some documentation with
> example calling conventions to docs/

Ack.

> Have you any performance numbers comparing the efficiency of the radix
> approach to a "dumb" implementation?

I tried a few lazy solutions (meaning having to write less code, which
isn't that dumb), such as having a g_hash_table protected by a mutex. But
this means potential contention on that mutex, so that's not a better option.

Whatever the solution might be, this is a slow path (only invoked
on atomic ops or regular stores that affect lines previously accessed
with atomic ops) so scalability is a more important goal than absolute speed.

The big perf impact comes on x86 when instrumenting stores, which is
orthogonal to this; invoking a helper on each store is a huge perf
killer for some benchmarks. It's OK for others though, but I still think
we should be able to do better.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 16/38] aie: add module for Atomic Instruction Emulation
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (14 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 17/38] aie: add target helpers Emilio G. Cota
                   ` (23 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 Makefile.target    |  1 +
 aie.c              | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/qemu/aie.h | 49 ++++++++++++++++++++++++++++++++++++++++++++++
 translate-all.c    |  2 ++
 4 files changed, 109 insertions(+)
 create mode 100644 aie.c
 create mode 100644 include/qemu/aie.h

diff --git a/Makefile.target b/Makefile.target
index 3e7aafd..840e257 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -85,6 +85,7 @@ all: $(PROGS) stap
 #########################################################
 # cpu emulator library
 obj-y = exec.o translate-all.o cpu-exec.o
+obj-y += aie.o
 obj-y += tcg/tcg.o tcg/tcg-op.o tcg/optimize.o
 obj-$(CONFIG_TCG_INTERPRETER) += tci.o
 obj-$(CONFIG_TCG_INTERPRETER) += disas/tci.o
diff --git a/aie.c b/aie.c
new file mode 100644
index 0000000..588c02b
--- /dev/null
+++ b/aie.c
@@ -0,0 +1,57 @@
+/*
+ * Atomic instruction emulation (AIE).
+ * This applies to LL/SC and higher-order atomic instructions.
+ * More info:
+ *   http://en.wikipedia.org/wiki/Load-link/store-conditional
+ */
+#include "qemu-common.h"
+#include "qemu/radix-tree.h"
+#include "qemu/thread.h"
+#include "qemu/aie.h"
+
+#if defined(CONFIG_USER_ONLY)
+# define AIE_FULL_ADDR_BITS  TARGET_VIRT_ADDR_SPACE_BITS
+#else
+#if HOST_LONG_BITS < TARGET_PHYS_ADDR_SPACE_BITS
+/* in this case QEMU restricts the maximum RAM size to fit in the host */
+# define AIE_FULL_ADDR_BITS  HOST_LONG_BITS
+#else
+# define AIE_FULL_ADDR_BITS  TARGET_PHYS_ADDR_SPACE_BITS
+#endif
+#endif /* CONFIG_USER_ONLY */
+
+#define AIE_ADDR_BITS  (AIE_FULL_ADDR_BITS - AIE_DISCARD_BITS)
+#define AIE_RADIX     8
+
+QemuRadixTree aie_rtree;
+unsigned long *aie_bm;
+
+static void *aie_entry_init(unsigned long index)
+{
+    AIEEntry *entry;
+
+    entry = qemu_memalign(64, sizeof(*entry));
+    qemu_spin_init(&entry->lock);
+    entry->bm_set = false;
+    return entry;
+}
+
+AIEEntry *aie_entry_get_lock(hwaddr paddr)
+{
+    aie_addr_t idx = to_aie(paddr);
+    AIEEntry *e;
+
+    e = qemu_radix_tree_find_alloc(&aie_rtree, idx, aie_entry_init, qemu_vfree);
+    qemu_spin_lock(&e->lock);
+    if (!e->bm_set) {
+        set_bit_atomic(idx & (AIE_BM_NR_ITEMS - 1), aie_bm);
+        e->bm_set = true;
+    }
+    return e;
+}
+
+void aie_init(void)
+{
+    qemu_radix_tree_init(&aie_rtree, AIE_ADDR_BITS, AIE_RADIX);
+    aie_bm = bitmap_new(AIE_BM_NR_ITEMS);
+}
diff --git a/include/qemu/aie.h b/include/qemu/aie.h
new file mode 100644
index 0000000..667f36c
--- /dev/null
+++ b/include/qemu/aie.h
@@ -0,0 +1,49 @@
+/*
+ * Atomic instruction emulation (AIE)
+ */
+#ifndef AIE_H
+#define AIE_H
+
+#include "qemu/radix-tree.h"
+#include "qemu/thread.h"
+#include "qemu/bitops.h"
+
+#include "exec/hwaddr.h"
+
+typedef hwaddr aie_addr_t;
+
+typedef struct AIEEntry AIEEntry;
+
+struct AIEEntry {
+    union {
+        struct {
+            QemuSpin lock;
+            bool bm_set;
+        };
+        uint8_t pad[64];
+    };
+} __attribute((aligned(64)));
+
+#define AIE_DISCARD_BITS 6
+
+#define AIE_BM_BITS     21
+#define AIE_BM_NR_ITEMS BIT(AIE_BM_BITS)
+
+extern QemuRadixTree aie_rtree;
+extern unsigned long *aie_bm;
+
+static inline aie_addr_t to_aie(hwaddr paddr)
+{
+    return paddr >> AIE_DISCARD_BITS;
+}
+
+void aie_init(void);
+
+AIEEntry *aie_entry_get_lock(hwaddr addr);
+
+static inline bool aie_entry_exists(hwaddr addr)
+{
+    return test_bit(to_aie(addr) & (AIE_BM_NR_ITEMS - 1), aie_bm);
+}
+
+#endif /* AIE_H */
diff --git a/translate-all.c b/translate-all.c
index b873d5c..f07547e 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -62,6 +62,7 @@
 #include "translate-all.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
+#include "qemu/aie.h"
 
 //#define DEBUG_TB_INVALIDATE
 //#define DEBUG_FLUSH
@@ -730,6 +731,7 @@ void tcg_exec_init(unsigned long tb_size)
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
     tcg_register_jit(tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_buffer_size);
     page_init();
+    aie_init();
 #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
     /* There's no guest base to take into account, so go ahead and
        initialize the prologue now.  */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 17/38] aie: add target helpers
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (15 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 16/38] aie: add module for Atomic Instruction Emulation Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-17 15:14   ` Alex Bennée
  2015-09-21  5:18   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 18/38] tcg: add fences Emilio G. Cota
                   ` (22 subsequent siblings)
  39 siblings, 2 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 aie-helper.c              | 112 ++++++++++++++++++++++++++++++++++++++++++++++
 include/exec/cpu-defs.h   |   5 +++
 include/qemu/aie-helper.h |   6 +++
 3 files changed, 123 insertions(+)
 create mode 100644 aie-helper.c
 create mode 100644 include/qemu/aie-helper.h

diff --git a/aie-helper.c b/aie-helper.c
new file mode 100644
index 0000000..7521150
--- /dev/null
+++ b/aie-helper.c
@@ -0,0 +1,112 @@
+/*
+ * To be included directly from the target's helper.c
+ */
+#include "qemu/aie.h"
+
+#ifdef CONFIG_USER_ONLY
+static inline
+hwaddr h_get_ld_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+    return vaddr;
+}
+
+static inline
+hwaddr h_get_st_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+    return vaddr;
+}
+#else
+static inline
+hwaddr h_get_ld_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+    return helper_ret_get_ld_phys(env, vaddr, cpu_mmu_index(env), retaddr);
+}
+
+static inline
+hwaddr h_get_st_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+    return helper_ret_get_st_phys(env, vaddr, cpu_mmu_index(env), retaddr);
+}
+#endif /* CONFIG_USER_ONLY */
+
+static inline void h_aie_lock(CPUArchState *env, hwaddr paddr)
+{
+    AIEEntry *entry = aie_entry_get_lock(paddr);
+
+    env->aie_entry = entry;
+    env->aie_locked = true;
+}
+
+static inline void h_aie_unlock(CPUArchState *env)
+{
+    assert(env->aie_entry && env->aie_locked);
+    qemu_spin_unlock(&env->aie_entry->lock);
+    env->aie_locked = false;
+}
+
+static inline void h_aie_unlock__done(CPUArchState *env)
+{
+    h_aie_unlock(env);
+    env->aie_entry = NULL;
+}
+
+static inline
+void aie_ld_lock_ret(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+    hwaddr paddr;
+
+    assert(!env->aie_locked);
+    paddr = h_get_ld_phys(env, vaddr, retaddr);
+    h_aie_lock(env, paddr);
+}
+
+void HELPER(aie_ld_lock)(CPUArchState *env, target_ulong vaddr)
+{
+    aie_ld_lock_ret(env, vaddr, GETRA());
+}
+
+static inline
+void aie_st_lock_ret(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
+{
+    hwaddr paddr;
+
+    assert(!env->aie_locked);
+    paddr = h_get_st_phys(env, vaddr, retaddr);
+    h_aie_lock(env, paddr);
+}
+
+void HELPER(aie_unlock__done)(CPUArchState *env)
+{
+    h_aie_unlock__done(env);
+}
+
+void HELPER(aie_ld_pre)(CPUArchState *env, target_ulong vaddr)
+{
+    if (likely(!env->aie_lock_enabled) || env->aie_locked) {
+        return;
+    }
+    aie_ld_lock_ret(env, vaddr, GETRA());
+}
+
+void HELPER(aie_st_pre)(CPUArchState *env, target_ulong vaddr)
+{
+    if (unlikely(env->aie_lock_enabled)) {
+        if (env->aie_locked) {
+            return;
+        }
+        aie_st_lock_ret(env, vaddr, GETRA());
+    } else {
+        hwaddr paddr = h_get_st_phys(env, vaddr, GETRA());
+
+        if (unlikely(aie_entry_exists(paddr))) {
+            h_aie_lock(env, paddr);
+        }
+    }
+}
+
+void HELPER(aie_st_post)(CPUArchState *env, target_ulong vaddr)
+{
+    if (unlikely(!env->aie_lock_enabled && env->aie_locked)) {
+        h_aie_unlock__done(env);
+    }
+}
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index ca9c85c..e6e4568 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -28,6 +28,7 @@
 #include "qemu/osdep.h"
 #include "qemu/queue.h"
 #include "tcg-target.h"
+#include "qemu/aie.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
 #endif
@@ -152,5 +153,9 @@ typedef struct CPUIOTLBEntry {
 #define CPU_COMMON                                                      \
     /* soft mmu support */                                              \
     CPU_COMMON_TLB                                                      \
+    AIEEntry *aie_entry;                                                \
+    bool aie_locked;                                                    \
+    bool aie_lock_enabled;                                              \
+    bool aie_llsc_st_tracking;                                          \
 
 #endif
diff --git a/include/qemu/aie-helper.h b/include/qemu/aie-helper.h
new file mode 100644
index 0000000..86a786a
--- /dev/null
+++ b/include/qemu/aie-helper.h
@@ -0,0 +1,6 @@
+DEF_HELPER_2(aie_ld_pre, void, env, tl)
+DEF_HELPER_2(aie_st_pre, void, env, tl)
+DEF_HELPER_2(aie_st_post, void, env, tl)
+
+DEF_HELPER_2(aie_ld_lock, void, env, tl)
+DEF_HELPER_1(aie_unlock__done, void, env)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 17/38] aie: add target helpers
  2015-08-24  0:23 ` [Qemu-devel] [RFC 17/38] aie: add target helpers Emilio G. Cota
@ 2015-09-17 15:14   ` Alex Bennée
  2015-09-21  5:18   ` Paolo Bonzini
  1 sibling, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-17 15:14 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  aie-helper.c              | 112 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/exec/cpu-defs.h   |   5 +++
>  include/qemu/aie-helper.h |   6 +++
>  3 files changed, 123 insertions(+)
>  create mode 100644 aie-helper.c
>  create mode 100644 include/qemu/aie-helper.h
>
> diff --git a/aie-helper.c b/aie-helper.c
> new file mode 100644
> index 0000000..7521150
> --- /dev/null
> +++ b/aie-helper.c
> @@ -0,0 +1,112 @@
> +/*
> + * To be included directly from the target's helper.c
> + */
> +#include "qemu/aie.h"
> +
> +#ifdef CONFIG_USER_ONLY
> +static inline
> +hwaddr h_get_ld_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
> +{
> +    return vaddr;
> +}
> +
> +static inline
> +hwaddr h_get_st_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
> +{
> +    return vaddr;
> +}
> +#else
> +static inline
> +hwaddr h_get_ld_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
> +{
> +    return helper_ret_get_ld_phys(env, vaddr, cpu_mmu_index(env), retaddr);
> +}
> +
> +static inline
> +hwaddr h_get_st_phys(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
> +{
> +    return helper_ret_get_st_phys(env, vaddr, cpu_mmu_index(env), retaddr);
> +}
> +#endif /* CONFIG_USER_ONLY */
> +
> +static inline void h_aie_lock(CPUArchState *env, hwaddr paddr)
> +{
> +    AIEEntry *entry = aie_entry_get_lock(paddr);
> +
> +    env->aie_entry = entry;
> +    env->aie_locked = true;
> +}
> +
> +static inline void h_aie_unlock(CPUArchState *env)
> +{
> +    assert(env->aie_entry && env->aie_locked);
> +    qemu_spin_unlock(&env->aie_entry->lock);
> +    env->aie_locked = false;
> +}

This is a little unbalanced as aie_entry_get_lock() returns the entry
with the lock grabbed but we free it in h_aie_unlock. Should we either
grab the lock in the helper or have a helper to free the lock that can
be kept in the aie utils?

> +
> +static inline void h_aie_unlock__done(CPUArchState *env)
> +{
> +    h_aie_unlock(env);
> +    env->aie_entry = NULL;
> +}
> +
> +static inline
> +void aie_ld_lock_ret(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
> +{
> +    hwaddr paddr;
> +
> +    assert(!env->aie_locked);
> +    paddr = h_get_ld_phys(env, vaddr, retaddr);
> +    h_aie_lock(env, paddr);
> +}
> +
> +void HELPER(aie_ld_lock)(CPUArchState *env, target_ulong vaddr)
> +{
> +    aie_ld_lock_ret(env, vaddr, GETRA());
> +}
> +
> +static inline
> +void aie_st_lock_ret(CPUArchState *env, target_ulong vaddr, uintptr_t retaddr)
> +{
> +    hwaddr paddr;
> +
> +    assert(!env->aie_locked);
> +    paddr = h_get_st_phys(env, vaddr, retaddr);
> +    h_aie_lock(env, paddr);
> +}
> +
> +void HELPER(aie_unlock__done)(CPUArchState *env)
> +{
> +    h_aie_unlock__done(env);
> +}
> +
> +void HELPER(aie_ld_pre)(CPUArchState *env, target_ulong vaddr)
> +{
> +    if (likely(!env->aie_lock_enabled) || env->aie_locked) {
> +        return;
> +    }
> +    aie_ld_lock_ret(env, vaddr, GETRA());
> +}
> +
> +void HELPER(aie_st_pre)(CPUArchState *env, target_ulong vaddr)
> +{
> +    if (unlikely(env->aie_lock_enabled)) {
> +        if (env->aie_locked) {
> +            return;
> +        }
> +        aie_st_lock_ret(env, vaddr, GETRA());
> +    } else {
> +        hwaddr paddr = h_get_st_phys(env, vaddr, GETRA());
> +
> +        if (unlikely(aie_entry_exists(paddr))) {
> +            h_aie_lock(env, paddr);
> +        }
> +    }
> +}
> +
> +void HELPER(aie_st_post)(CPUArchState *env, target_ulong vaddr)
> +{
> +    if (unlikely(!env->aie_lock_enabled && env->aie_locked)) {
> +        h_aie_unlock__done(env);
> +    }
> +}
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index ca9c85c..e6e4568 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -28,6 +28,7 @@
>  #include "qemu/osdep.h"
>  #include "qemu/queue.h"
>  #include "tcg-target.h"
> +#include "qemu/aie.h"
>  #ifndef CONFIG_USER_ONLY
>  #include "exec/hwaddr.h"
>  #endif
> @@ -152,5 +153,9 @@ typedef struct CPUIOTLBEntry {
>  #define CPU_COMMON                                                      \
>      /* soft mmu support */                                              \
>      CPU_COMMON_TLB                                                      \
> +    AIEEntry *aie_entry;                                                \
> +    bool aie_locked;                                                    \
> +    bool aie_lock_enabled;                                              \
> +    bool aie_llsc_st_tracking;                                          \
>  
>  #endif
> diff --git a/include/qemu/aie-helper.h b/include/qemu/aie-helper.h
> new file mode 100644
> index 0000000..86a786a
> --- /dev/null
> +++ b/include/qemu/aie-helper.h
> @@ -0,0 +1,6 @@
> +DEF_HELPER_2(aie_ld_pre, void, env, tl)
> +DEF_HELPER_2(aie_st_pre, void, env, tl)
> +DEF_HELPER_2(aie_st_post, void, env, tl)
> +
> +DEF_HELPER_2(aie_ld_lock, void, env, tl)
> +DEF_HELPER_1(aie_unlock__done, void, env)

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 17/38] aie: add target helpers
  2015-08-24  0:23 ` [Qemu-devel] [RFC 17/38] aie: add target helpers Emilio G. Cota
  2015-09-17 15:14   ` Alex Bennée
@ 2015-09-21  5:18   ` Paolo Bonzini
  2015-09-21 20:59     ` Alex Bennée
  1 sibling, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-21  5:18 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: alex.bennee, Frederic Konrad, mark.burton, a.rigo,
	guillaume.delbergue

On 24/08/2015 02:23, Emilio G. Cota wrote:
> +void HELPER(aie_st_pre)(CPUArchState *env, target_ulong vaddr)
> +{
> +    if (unlikely(env->aie_lock_enabled)) {
> +        if (env->aie_locked) {
> +            return;
> +        }

Now that I've reviewed your code more carefully, the approach you're
using looks more promising than I thought.  There are advantages over
Alvise's code, namely:

- cache-line vs. page granularity

- avoiding the global TLB flush (due to the virtually indexed TLBs)

- easy support for user-mode emulation

and some of the disadvantages look more easily fixable than I thought,
too (e.g. TLB entry bloat).

The main advantage of Alvise's code, on the other hand, is the minimal
overhead when there are no active LL/SC combinations and the better
integration with TCG.

A random idea: would it be possible to move some of the helper code to
generated TCG code?  For example, maintaining a count of outstanding
load-locked operations and forcing the slow path for stores if it is
non-zero?

Paolo

> +        aie_st_lock_ret(env, vaddr, GETRA());
> +    } else {
> +        hwaddr paddr = h_get_st_phys(env, vaddr, GETRA());
> +
> +        if (unlikely(aie_entry_exists(paddr))) {
> +            h_aie_lock(env, paddr);
> +        }
> +    }
> +}

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 17/38] aie: add target helpers
  2015-09-21  5:18   ` Paolo Bonzini
@ 2015-09-21 20:59     ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-21 20:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, Emilio G. Cota,
	guillaume.delbergue, Frederic Konrad


Paolo Bonzini <pbonzini@redhat.com> writes:

> On 24/08/2015 02:23, Emilio G. Cota wrote:
>> +void HELPER(aie_st_pre)(CPUArchState *env, target_ulong vaddr)
>> +{
>> +    if (unlikely(env->aie_lock_enabled)) {
>> +        if (env->aie_locked) {
>> +            return;
>> +        }
>
> Now that I've reviewed your code more carefully, the approach you're
> using looks more promising than I thought.  There are advantages over
> Alvise's code, namely:
>
> - cache-line vs. page granularity

There is nothing that stops Alvise's slow-path helpers having a better
granularity when deciding if the exclusive flag needs tripping. However
I like the lookup approach in these patches.

> - avoiding the global TLB flush (due to the virtually indexed TLBs)
>
> - easy support for user-mode emulation
>
> and some of the disadvantages look more easily fixable than I thought,
> too (e.g. TLB entry bloat).
>
> The main advantage of Alvise's code, on the other hand, is the minimal
> overhead when there are no active LL/SC combinations and the better
> integration with TCG.

I too prefer the expression of LL/SC semantics via TCG ops. 

> A random idea: would it be possible to move some of the helper code to
> generated TCG code?  For example, maintaining a count of outstanding
> load-locked operations and forcing the slow path for stores if it is
> non-zero?
>
> Paolo
>
>> +        aie_st_lock_ret(env, vaddr, GETRA());
>> +    } else {
>> +        hwaddr paddr = h_get_st_phys(env, vaddr, GETRA());
>> +
>> +        if (unlikely(aie_entry_exists(paddr))) {
>> +            h_aie_lock(env, paddr);
>> +        }
>> +    }
>> +}

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 18/38] tcg: add fences
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (16 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 17/38] aie: add target helpers Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 15:28   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb() Emilio G. Cota
                   ` (21 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg-op.c  |  5 +++++
 tcg/tcg-op.h  | 18 ++++++++++++++++++
 tcg/tcg-opc.h |  5 +++++
 3 files changed, 28 insertions(+)

diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 45098c3..6d5b1df 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -57,6 +57,11 @@ static void tcg_emit_op(TCGContext *ctx, TCGOpcode opc, int args)
     };
 }
 
+void tcg_gen_op0(TCGContext *ctx, TCGOpcode opc)
+{
+    tcg_emit_op(ctx, opc, -1);
+}
+
 void tcg_gen_op1(TCGContext *ctx, TCGOpcode opc, TCGArg a1)
 {
     int pi = ctx->gen_next_parm_idx;
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index d1d763f..52482c0 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -28,6 +28,7 @@
 
 /* Basic output routines.  Not for general consumption.  */
 
+void tcg_gen_op0(TCGContext *, TCGOpcode);
 void tcg_gen_op1(TCGContext *, TCGOpcode, TCGArg);
 void tcg_gen_op2(TCGContext *, TCGOpcode, TCGArg, TCGArg);
 void tcg_gen_op3(TCGContext *, TCGOpcode, TCGArg, TCGArg, TCGArg);
@@ -698,6 +699,23 @@ static inline void tcg_gen_trunc_i64_i32(TCGv_i32 ret, TCGv_i64 arg)
     tcg_gen_trunc_shr_i64_i32(ret, arg, 0);
 }
 
+/* fences */
+
+static inline void tcg_gen_fence_load(void)
+{
+    tcg_gen_op0(&tcg_ctx, INDEX_op_fence_load);
+}
+
+static inline void tcg_gen_fence_store(void)
+{
+    tcg_gen_op0(&tcg_ctx, INDEX_op_fence_store);
+}
+
+static inline void tcg_gen_fence_full(void)
+{
+    tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
+}
+
 /* QEMU specific operations.  */
 
 #ifndef TARGET_LONG_BITS
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 13ccb60..85de953 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -167,6 +167,11 @@ DEF(muls2_i64, 2, 2, 0, IMPL64 | IMPL(TCG_TARGET_HAS_muls2_i64))
 DEF(muluh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_muluh_i64))
 DEF(mulsh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_mulsh_i64))
 
+/* fences */
+DEF(fence_load, 0, 0, 0, 0)
+DEF(fence_store, 0, 0, 0, 0)
+DEF(fence_full, 0, 0, 0, 0)
+
 /* QEMU specific */
 #if TARGET_LONG_BITS > TCG_TARGET_REG_BITS
 DEF(debug_insn_start, 0, 0, 2, TCG_OPF_NOT_PRESENT)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 18/38] tcg: add fences
  2015-08-24  0:23 ` [Qemu-devel] [RFC 18/38] tcg: add fences Emilio G. Cota
@ 2015-09-10 15:28   ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 15:28 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg-op.c  |  5 +++++
>  tcg/tcg-op.h  | 18 ++++++++++++++++++
>  tcg/tcg-opc.h |  5 +++++
>  3 files changed, 28 insertions(+)
>
> diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
> index 45098c3..6d5b1df 100644
> --- a/tcg/tcg-op.c
> +++ b/tcg/tcg-op.c
> @@ -57,6 +57,11 @@ static void tcg_emit_op(TCGContext *ctx, TCGOpcode opc, int args)
>      };
>  }
>  
> +void tcg_gen_op0(TCGContext *ctx, TCGOpcode opc)
> +{
> +    tcg_emit_op(ctx, opc, -1);

Is that -1 always safe? I'm finding the guts of the TCG opbuf a little
hard to follow but you see code like this:

        TCGOp * const op = &s->gen_op_buf[oi];
        TCGArg * const args = &s->gen_opparam_buf[op->args];

and wonder how badly that could go wrong.

> +}
> +
>  void tcg_gen_op1(TCGContext *ctx, TCGOpcode opc, TCGArg a1)
>  {
>      int pi = ctx->gen_next_parm_idx;
> diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> index d1d763f..52482c0 100644
> --- a/tcg/tcg-op.h
> +++ b/tcg/tcg-op.h
> @@ -28,6 +28,7 @@
>  
>  /* Basic output routines.  Not for general consumption.  */
>  
> +void tcg_gen_op0(TCGContext *, TCGOpcode);
>  void tcg_gen_op1(TCGContext *, TCGOpcode, TCGArg);
>  void tcg_gen_op2(TCGContext *, TCGOpcode, TCGArg, TCGArg);
>  void tcg_gen_op3(TCGContext *, TCGOpcode, TCGArg, TCGArg, TCGArg);
> @@ -698,6 +699,23 @@ static inline void tcg_gen_trunc_i64_i32(TCGv_i32 ret, TCGv_i64 arg)
>      tcg_gen_trunc_shr_i64_i32(ret, arg, 0);
>  }
>  
> +/* fences */
> +
> +static inline void tcg_gen_fence_load(void)
> +{
> +    tcg_gen_op0(&tcg_ctx, INDEX_op_fence_load);
> +}
> +
> +static inline void tcg_gen_fence_store(void)
> +{
> +    tcg_gen_op0(&tcg_ctx, INDEX_op_fence_store);
> +}
> +
> +static inline void tcg_gen_fence_full(void)
> +{
> +    tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
> +}
> +
>  /* QEMU specific operations.  */
>  
>  #ifndef TARGET_LONG_BITS
> diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
> index 13ccb60..85de953 100644
> --- a/tcg/tcg-opc.h
> +++ b/tcg/tcg-opc.h
> @@ -167,6 +167,11 @@ DEF(muls2_i64, 2, 2, 0, IMPL64 | IMPL(TCG_TARGET_HAS_muls2_i64))
>  DEF(muluh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_muluh_i64))
>  DEF(mulsh_i64, 1, 2, 0, IMPL(TCG_TARGET_HAS_mulsh_i64))
>  
> +/* fences */
> +DEF(fence_load, 0, 0, 0, 0)
> +DEF(fence_store, 0, 0, 0, 0)
> +DEF(fence_full, 0, 0, 0, 0)
> +
>  /* QEMU specific */
>  #if TARGET_LONG_BITS > TCG_TARGET_REG_BITS
>  DEF(debug_insn_start, 0, 0, 2, TCG_OPF_NOT_PRESENT)

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb()
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (17 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 18/38] tcg: add fences Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-10 16:01   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 20/38] tcg/i386: implement fences Emilio G. Cota
                   ` (20 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg-op.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 52482c0..3ec9f13 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -716,6 +716,16 @@ static inline void tcg_gen_fence_full(void)
     tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
 }
 
+#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
+static inline void tcg_gen_smp_rmb(void)
+{ }
+#else
+static inline void tcg_gen_smp_rmb(void)
+{
+    tcg_gen_fence_load();
+}
+#endif
+
 /* QEMU specific operations.  */
 
 #ifndef TARGET_LONG_BITS
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb()
  2015-08-24  0:23 ` [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb() Emilio G. Cota
@ 2015-09-10 16:01   ` Alex Bennée
  2015-09-10 18:05     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-10 16:01 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  tcg/tcg-op.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> index 52482c0..3ec9f13 100644
> --- a/tcg/tcg-op.h
> +++ b/tcg/tcg-op.h
> @@ -716,6 +716,16 @@ static inline void tcg_gen_fence_full(void)
>      tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
>  }
>  
> +#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
> +static inline void tcg_gen_smp_rmb(void)
> +{ }
> +#else
> +static inline void tcg_gen_smp_rmb(void)
> +{
> +    tcg_gen_fence_load();
> +}
> +#endif

This seems a little pointless wrapping up tcg_gen_fence_load. Could the
magic dealing with the backend not be done with something like
TCG_TARGET_HAS_fence_load. On the x86/x86_64 backends this could then
NOP away.

> +
>  /* QEMU specific operations.  */
>  
>  #ifndef TARGET_LONG_BITS

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb()
  2015-09-10 16:01   ` Alex Bennée
@ 2015-09-10 18:05     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-09-10 18:05 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad

On Thu, Sep 10, 2015 at 17:01:14 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  tcg/tcg-op.h | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> > index 52482c0..3ec9f13 100644
> > --- a/tcg/tcg-op.h
> > +++ b/tcg/tcg-op.h
> > @@ -716,6 +716,16 @@ static inline void tcg_gen_fence_full(void)
> >      tcg_gen_op0(&tcg_ctx, INDEX_op_fence_full);
> >  }
> >  
> > +#if defined(__i386__) || defined(__x86_64__) || defined(__s390x__)
> > +static inline void tcg_gen_smp_rmb(void)
> > +{ }
> > +#else
> > +static inline void tcg_gen_smp_rmb(void)
> > +{
> > +    tcg_gen_fence_load();
> > +}
> > +#endif
> 
> This seems a little pointless wrapping up tcg_gen_fence_load. Could the
> magic dealing with the backend not be done with something like
> TCG_TARGET_HAS_fence_load. On the x86/x86_64 backends this could then
> NOP away.

This patch made sense at the time as a companion to this other patch:

cpu: add barriers around cpu->tcg_exit_req
(snip)
+++ b/include/exec/gen-icount.h                                                                                                                                                                                                                                                             
@@ -16,6 +16,7 @@ static inline void gen_tb_start(TranslationBlock *tb)                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                            
     exitreq_label = gen_new_label();                                                                                                                                                                                                                                                       
     flag = tcg_temp_new_i32();                                                                                                                                                                                                                                                             
+    tcg_gen_smp_rmb();                                                                                                                                                                                                                                                                     
     tcg_gen_ld_i32(flag, cpu_env,                                                                                                                                                                                                                                                          
                    offsetof(CPUState, tcg_exit_req) - ENV_OFFSET);                                                                                                                                                                                                                         
     tcg_gen_brcondi_i32(TCG_COND_NE, flag, 0, exitreq_label);

Paolo had the better idea of calling smp_rmb() once we've
exited TCG, which renders tcg_gen_smp_rmb() unnecessary.

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 20/38] tcg/i386: implement fences
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (18 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb() Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  1:32   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE Emilio G. Cota
                   ` (19 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/i386/tcg-target.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index 887f22f..6600c45 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1123,6 +1123,13 @@ static void tcg_out_jmp(TCGContext *s, tcg_insn_unit *dest)
     tcg_out_branch(s, 0, dest);
 }
 
+static inline void tcg_out_fence(TCGContext *s, uint8_t op)
+{
+        tcg_out8(s, 0x0f);
+        tcg_out8(s, 0xae);
+        tcg_out8(s, op);
+}
+
 #if defined(CONFIG_SOFTMMU)
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  *                                     int mmu_idx, uintptr_t ra)
@@ -2088,6 +2095,16 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         }
         break;
 
+    case INDEX_op_fence_load:
+        tcg_out_fence(s, 0xe8);
+        break;
+    case INDEX_op_fence_full:
+        tcg_out_fence(s, 0xf0);
+        break;
+    case INDEX_op_fence_store:
+        tcg_out_fence(s, 0xf8);
+        break;
+
     case INDEX_op_mov_i32:  /* Always emitted via tcg_out_mov.  */
     case INDEX_op_mov_i64:
     case INDEX_op_movi_i32: /* Always emitted via tcg_out_movi.  */
@@ -2226,6 +2243,9 @@ static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
 #endif
+    { INDEX_op_fence_load, { } },
+    { INDEX_op_fence_store, { } },
+    { INDEX_op_fence_full, { } },
     { -1 },
 };
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 20/38] tcg/i386: implement fences
  2015-08-24  0:23 ` [Qemu-devel] [RFC 20/38] tcg/i386: implement fences Emilio G. Cota
@ 2015-08-24  1:32   ` Paolo Bonzini
  2015-08-25  3:02     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:32 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> +    case INDEX_op_fence_load:
> +        tcg_out_fence(s, 0xe8);
> +        break;
> +    case INDEX_op_fence_full:
> +        tcg_out_fence(s, 0xf0);
> +        break;
> +    case INDEX_op_fence_store:
> +        tcg_out_fence(s, 0xf8);
> +        break;
> +

lfence and sfence are not needed in generated code; all loads are
acquires and all stores are release on x86.

Also, on targets that do not have MFENCE you want to generate something
like "lock addl $0, (%esp)".

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 20/38] tcg/i386: implement fences
  2015-08-24  1:32   ` Paolo Bonzini
@ 2015-08-25  3:02     ` Emilio G. Cota
  2015-08-25 22:55       ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25  3:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:32:51 -0700, Paolo Bonzini wrote:
> 
> 
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > +    case INDEX_op_fence_load:
> > +        tcg_out_fence(s, 0xe8);
> > +        break;
> > +    case INDEX_op_fence_full:
> > +        tcg_out_fence(s, 0xf0);
> > +        break;
> > +    case INDEX_op_fence_store:
> > +        tcg_out_fence(s, 0xf8);
> > +        break;
> > +
> 
> lfence and sfence are not needed in generated code; all loads are
> acquires and all stores are release on x86.

lfence and sfence here serve two purposes:

1) Template for other architectures
2) x86 code does sometimes have lfence/sfence (e.g. movntq+sfence),
   so I guessed they should remain in the translated code.
   If on x86 we always ignore the Write-Combining from the
   guest, maybe we could claim the l/sfence pair here is really unnecessary.
   I'm no x86 expert so for a first stab I decided to put them there.

   I didn't intend to translate say *all* PPC/ARM load barriers
   into lfences when generating x86, which is I think your point.

> Also, on targets that do not have MFENCE you want to generate something
> like "lock addl $0, (%esp)".

Good point, will fix.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 20/38] tcg/i386: implement fences
  2015-08-25  3:02     ` Emilio G. Cota
@ 2015-08-25 22:55       ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-25 22:55 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark burton, a rigo, qemu-devel, guillaume delbergue,
	alex bennee, Frederic Konrad


> lfence and sfence here serve two purposes:
> 
> 1) Template for other architectures

Ok, this makes sense.

> 2) x86 code does sometimes have lfence/sfence (e.g. movntq+sfence),
>    so I guessed they should remain in the translated code.
>    If on x86 we always ignore the Write-Combining from the
>    guest, maybe we could claim the l/sfence pair here is really unnecessary.

Yeah, I think it's fair enough to ignore WC and nontemporal stores.

>    I didn't intend to translate say *all* PPC/ARM load barriers
>    into lfences when generating x86, which is I think your point.

Yeah, it's just that the only gen_op_smp_rmb() you had in the RFC
also did not need an lfence.  But it seems like we're on the same
page.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (19 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 20/38] tcg/i386: implement fences Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-17 15:30   ` Alex Bennée
  2015-08-24  0:23 ` [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically Emilio G. Cota
                   ` (18 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 aie-helper.c              |   3 +-
 linux-user/main.c         |   4 +-
 target-i386/cpu.h         |   3 -
 target-i386/excp_helper.c |   7 ++
 target-i386/helper.h      |   6 +-
 target-i386/mem_helper.c  |  39 +++------
 target-i386/translate.c   | 217 ++++++++++++++++++++++++++++------------------
 7 files changed, 162 insertions(+), 117 deletions(-)

diff --git a/aie-helper.c b/aie-helper.c
index 7521150..a3faf04 100644
--- a/aie-helper.c
+++ b/aie-helper.c
@@ -82,7 +82,8 @@ void HELPER(aie_unlock__done)(CPUArchState *env)
 
 void HELPER(aie_ld_pre)(CPUArchState *env, target_ulong vaddr)
 {
-    if (likely(!env->aie_lock_enabled) || env->aie_locked) {
+    assert(env->aie_lock_enabled);
+    if (env->aie_locked) {
         return;
     }
     aie_ld_lock_ret(env, vaddr, GETRA());
diff --git a/linux-user/main.c b/linux-user/main.c
index fd06ce9..98ebe19 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -279,9 +279,9 @@ void cpu_loop(CPUX86State *env)
     target_siginfo_t info;
 
     for(;;) {
-        cpu_exec_start(cs);
+        cs->running = true;
         trapnr = cpu_x86_exec(cs);
-        cpu_exec_end(cs);
+        cs->running = false;
         switch(trapnr) {
         case 0x80:
             /* linux syscall from int $0x80 */
diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 3655ff3..ead2832 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -1318,9 +1318,6 @@ static inline MemTxAttrs cpu_get_mem_attrs(CPUX86State *env)
 void cpu_set_mxcsr(CPUX86State *env, uint32_t val);
 void cpu_set_fpuc(CPUX86State *env, uint16_t val);
 
-/* mem_helper.c */
-void helper_lock_init(void);
-
 /* svm_helper.c */
 void cpu_svm_check_intercept_param(CPUX86State *env1, uint32_t type,
                                    uint64_t param);
diff --git a/target-i386/excp_helper.c b/target-i386/excp_helper.c
index 99fca84..141cab4 100644
--- a/target-i386/excp_helper.c
+++ b/target-i386/excp_helper.c
@@ -96,6 +96,13 @@ static void QEMU_NORETURN raise_interrupt2(CPUX86State *env, int intno,
 {
     CPUState *cs = CPU(x86_env_get_cpu(env));
 
+    if (unlikely(env->aie_locked)) {
+        helper_aie_unlock__done(env);
+    }
+    if (unlikely(env->aie_lock_enabled)) {
+        env->aie_lock_enabled = false;
+    }
+
     if (!is_int) {
         cpu_svm_check_intercept_param(env, SVM_EXIT_EXCP_BASE + intno,
                                       error_code);
diff --git a/target-i386/helper.h b/target-i386/helper.h
index 74308f4..7d92140 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -1,8 +1,8 @@
 DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
 
-DEF_HELPER_0(lock, void)
-DEF_HELPER_0(unlock, void)
+DEF_HELPER_1(lock_enable, void, env)
+DEF_HELPER_1(lock_disable, void, env)
 DEF_HELPER_3(write_eflags, void, env, tl, i32)
 DEF_HELPER_1(read_eflags, tl, env)
 DEF_HELPER_2(divb_AL, void, env, tl)
@@ -217,3 +217,5 @@ DEF_HELPER_3(rcrl, tl, env, tl, tl)
 DEF_HELPER_3(rclq, tl, env, tl, tl)
 DEF_HELPER_3(rcrq, tl, env, tl, tl)
 #endif
+
+#include "qemu/aie-helper.h"
diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
index 8bf0da2..60abc8a 100644
--- a/target-i386/mem_helper.c
+++ b/target-i386/mem_helper.c
@@ -21,38 +21,21 @@
 #include "exec/helper-proto.h"
 #include "exec/cpu_ldst.h"
 
-/* broken thread support */
+#include "aie-helper.c"
 
-#if defined(CONFIG_USER_ONLY)
-QemuMutex global_cpu_lock;
-
-void helper_lock(void)
-{
-    qemu_mutex_lock(&global_cpu_lock);
-}
-
-void helper_unlock(void)
-{
-    qemu_mutex_unlock(&global_cpu_lock);
-}
-
-void helper_lock_init(void)
-{
-    qemu_mutex_init(&global_cpu_lock);
-}
-#else
-void helper_lock(void)
+void helper_lock_enable(CPUX86State *env)
 {
+    env->aie_lock_enabled = true;
 }
 
-void helper_unlock(void)
-{
-}
-
-void helper_lock_init(void)
+void helper_lock_disable(CPUX86State *env)
 {
+    assert(env->aie_lock_enabled);
+    if (env->aie_locked) {
+        h_aie_unlock__done(env);
+    }
+    env->aie_lock_enabled = false;
 }
-#endif
 
 void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
 {
@@ -60,6 +43,7 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
     int eflags;
 
     eflags = cpu_cc_compute_all(env, CC_OP);
+    aie_ld_lock_ret(env, a0, GETRA());
     d = cpu_ldq_data(env, a0);
     if (d == (((uint64_t)env->regs[R_EDX] << 32) | (uint32_t)env->regs[R_EAX])) {
         cpu_stq_data(env, a0, ((uint64_t)env->regs[R_ECX] << 32) | (uint32_t)env->regs[R_EBX]);
@@ -71,6 +55,7 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
         env->regs[R_EAX] = (uint32_t)d;
         eflags &= ~CC_Z;
     }
+    helper_aie_unlock__done(env);
     CC_SRC = eflags;
 }
 
@@ -84,6 +69,7 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
         raise_exception(env, EXCP0D_GPF);
     }
     eflags = cpu_cc_compute_all(env, CC_OP);
+    aie_ld_lock_ret(env, a0, GETRA());
     d0 = cpu_ldq_data(env, a0);
     d1 = cpu_ldq_data(env, a0 + 8);
     if (d0 == env->regs[R_EAX] && d1 == env->regs[R_EDX]) {
@@ -98,6 +84,7 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
         env->regs[R_EAX] = d0;
         eflags &= ~CC_Z;
     }
+    helper_aie_unlock__done(env);
     CC_SRC = eflags;
 }
 #endif
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 443bf60..4d6030f 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -300,6 +300,48 @@ static inline bool byte_reg_is_xH(int reg)
     return true;
 }
 
+static inline void gen_i386_ld_i32(DisasContext *s, TCGv_i32 val, TCGv addr,
+                                   TCGArg idx, TCGMemOp op)
+{
+    if (s->prefix & PREFIX_LOCK) {
+        gen_helper_aie_ld_pre(cpu_env, addr);
+    }
+    tcg_gen_qemu_ld_i32(val, addr, idx, op);
+}
+
+static inline void gen_i386_ld_i64(DisasContext *s, TCGv_i64 val, TCGv addr,
+                                   TCGArg idx, TCGMemOp op)
+{
+    if (s->prefix & PREFIX_LOCK) {
+        gen_helper_aie_ld_pre(cpu_env, addr);
+    }
+    tcg_gen_qemu_ld_i64(val, addr, idx, op);
+}
+
+static inline
+void gen_i386_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp op)
+{
+    gen_helper_aie_st_pre(cpu_env, addr);
+    tcg_gen_qemu_st_i32(val, addr, idx, op);
+    gen_helper_aie_st_post(cpu_env, addr);
+}
+
+static inline
+void gen_i386_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp op)
+{
+    gen_helper_aie_st_pre(cpu_env, addr);
+    tcg_gen_qemu_st_i64(val, addr, idx, op);
+    gen_helper_aie_st_post(cpu_env, addr);
+}
+
+#if TARGET_LONG_BITS == 32
+#define gen_i386_ld_tl  gen_i386_ld_i32
+#define gen_i386_st_tl  gen_i386_st_i32
+#else
+#define gen_i386_ld_tl  gen_i386_ld_i64
+#define gen_i386_st_tl  gen_i386_st_i64
+#endif
+
 /* Select the size of a push/pop operation.  */
 static inline TCGMemOp mo_pushpop(DisasContext *s, TCGMemOp ot)
 {
@@ -479,11 +521,23 @@ static inline void gen_op_addq_A0_reg_sN(int shift, int reg)
 
 static inline void gen_op_ld_v(DisasContext *s, int idx, TCGv t0, TCGv a0)
 {
+    gen_i386_ld_tl(s, t0, a0, s->mem_index, idx | MO_LE);
+}
+
+static inline
+void gen_op_ld_v_nolock(DisasContext *s, int idx, TCGv t0, TCGv a0)
+{
     tcg_gen_qemu_ld_tl(t0, a0, s->mem_index, idx | MO_LE);
 }
 
 static inline void gen_op_st_v(DisasContext *s, int idx, TCGv t0, TCGv a0)
 {
+    gen_i386_st_tl(t0, a0, s->mem_index, idx | MO_LE);
+}
+
+static inline
+void gen_op_st_v_nolock(DisasContext *s, int idx, TCGv t0, TCGv a0)
+{
     tcg_gen_qemu_st_tl(t0, a0, s->mem_index, idx | MO_LE);
 }
 
@@ -2587,23 +2641,23 @@ static void gen_jmp(DisasContext *s, target_ulong eip)
 
 static inline void gen_ldq_env_A0(DisasContext *s, int offset)
 {
-    tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
+    gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
     tcg_gen_st_i64(cpu_tmp1_i64, cpu_env, offset);
 }
 
 static inline void gen_stq_env_A0(DisasContext *s, int offset)
 {
     tcg_gen_ld_i64(cpu_tmp1_i64, cpu_env, offset);
-    tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
+    gen_i386_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
 }
 
 static inline void gen_ldo_env_A0(DisasContext *s, int offset)
 {
     int mem_index = s->mem_index;
-    tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
+    gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
     tcg_gen_st_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(0)));
     tcg_gen_addi_tl(cpu_tmp0, cpu_A0, 8);
-    tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
+    gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
     tcg_gen_st_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(1)));
 }
 
@@ -2611,10 +2665,10 @@ static inline void gen_sto_env_A0(DisasContext *s, int offset)
 {
     int mem_index = s->mem_index;
     tcg_gen_ld_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(0)));
-    tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
+    gen_i386_st_i64(cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
     tcg_gen_addi_tl(cpu_tmp0, cpu_A0, 8);
     tcg_gen_ld_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(1)));
-    tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
+    gen_i386_st_i64(cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
 }
 
 static inline void gen_op_movo(int d_offset, int s_offset)
@@ -3643,14 +3697,14 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                         break;
                     case 0x21: case 0x31: /* pmovsxbd, pmovzxbd */
                     case 0x24: case 0x34: /* pmovsxwq, pmovzxwq */
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         tcg_gen_st_i32(cpu_tmp2_i32, cpu_env, op2_offset +
                                         offsetof(XMMReg, XMM_L(0)));
                         break;
                     case 0x22: case 0x32: /* pmovsxbq, pmovzxbq */
-                        tcg_gen_qemu_ld_tl(cpu_tmp0, cpu_A0,
-                                           s->mem_index, MO_LEUW);
+                        gen_i386_ld_tl(s, cpu_tmp0, cpu_A0, s->mem_index,
+                                       MO_LEUW);
                         tcg_gen_st16_tl(cpu_tmp0, cpu_env, op2_offset +
                                         offsetof(XMMReg, XMM_W(0)));
                         break;
@@ -3738,12 +3792,12 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
 
                 gen_lea_modrm(env, s, modrm);
                 if ((b & 1) == 0) {
-                    tcg_gen_qemu_ld_tl(cpu_T[0], cpu_A0,
-                                       s->mem_index, ot | MO_BE);
+                    gen_i386_ld_tl(s, cpu_T[0], cpu_A0, s->mem_index,
+                                   ot | MO_BE);
                     gen_op_mov_reg_v(ot, reg, cpu_T[0]);
                 } else {
-                    tcg_gen_qemu_st_tl(cpu_regs[reg], cpu_A0,
-                                       s->mem_index, ot | MO_BE);
+                    gen_i386_st_tl(cpu_regs[reg], cpu_A0,
+                                   s->mem_index, ot | MO_BE);
                 }
                 break;
 
@@ -4079,8 +4133,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                     if (mod == 3) {
                         gen_op_mov_reg_v(ot, rm, cpu_T[0]);
                     } else {
-                        tcg_gen_qemu_st_tl(cpu_T[0], cpu_A0,
-                                           s->mem_index, MO_UB);
+                        gen_i386_st_tl(cpu_T[0], cpu_A0, s->mem_index, MO_UB);
                     }
                     break;
                 case 0x15: /* pextrw */
@@ -4089,8 +4142,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                     if (mod == 3) {
                         gen_op_mov_reg_v(ot, rm, cpu_T[0]);
                     } else {
-                        tcg_gen_qemu_st_tl(cpu_T[0], cpu_A0,
-                                           s->mem_index, MO_LEUW);
+                        gen_i386_st_tl(cpu_T[0], cpu_A0, s->mem_index, MO_LEUW);
                     }
                     break;
                 case 0x16:
@@ -4101,8 +4153,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                         if (mod == 3) {
                             tcg_gen_extu_i32_tl(cpu_regs[rm], cpu_tmp2_i32);
                         } else {
-                            tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                                s->mem_index, MO_LEUL);
+                            gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
+                                            s->mem_index, MO_LEUL);
                         }
                     } else { /* pextrq */
 #ifdef TARGET_X86_64
@@ -4112,8 +4164,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                         if (mod == 3) {
                             tcg_gen_mov_i64(cpu_regs[rm], cpu_tmp1_i64);
                         } else {
-                            tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0,
-                                                s->mem_index, MO_LEQ);
+                            gen_i386_st_i64(cpu_tmp1_i64, cpu_A0,
+                                            s->mem_index, MO_LEQ);
                         }
 #else
                         goto illegal_op;
@@ -4126,16 +4178,15 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                     if (mod == 3) {
                         gen_op_mov_reg_v(ot, rm, cpu_T[0]);
                     } else {
-                        tcg_gen_qemu_st_tl(cpu_T[0], cpu_A0,
-                                           s->mem_index, MO_LEUL);
+                        gen_i386_st_tl(cpu_T[0], cpu_A0, s->mem_index, MO_LEUL);
                     }
                     break;
                 case 0x20: /* pinsrb */
                     if (mod == 3) {
                         gen_op_mov_v_reg(MO_32, cpu_T[0], rm);
                     } else {
-                        tcg_gen_qemu_ld_tl(cpu_T[0], cpu_A0,
-                                           s->mem_index, MO_UB);
+                        gen_i386_ld_tl(s, cpu_T[0], cpu_A0, s->mem_index,
+                                       MO_UB);
                     }
                     tcg_gen_st8_tl(cpu_T[0], cpu_env, offsetof(CPUX86State,
                                             xmm_regs[reg].XMM_B(val & 15)));
@@ -4146,8 +4197,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                                         offsetof(CPUX86State,xmm_regs[rm]
                                                 .XMM_L((val >> 6) & 3)));
                     } else {
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                     }
                     tcg_gen_st_i32(cpu_tmp2_i32, cpu_env,
                                     offsetof(CPUX86State,xmm_regs[reg]
@@ -4174,8 +4225,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                         if (mod == 3) {
                             tcg_gen_trunc_tl_i32(cpu_tmp2_i32, cpu_regs[rm]);
                         } else {
-                            tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                                s->mem_index, MO_LEUL);
+                            gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                            s->mem_index, MO_LEUL);
                         }
                         tcg_gen_st_i32(cpu_tmp2_i32, cpu_env,
                                         offsetof(CPUX86State,
@@ -4185,8 +4236,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
                         if (mod == 3) {
                             gen_op_mov_v_reg(ot, cpu_tmp1_i64, rm);
                         } else {
-                            tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0,
-                                                s->mem_index, MO_LEQ);
+                            gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0,
+                                            s->mem_index, MO_LEQ);
                         }
                         tcg_gen_st_i64(cpu_tmp1_i64, cpu_env,
                                         offsetof(CPUX86State,
@@ -4567,7 +4618,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
 
     /* lock generation */
     if (prefixes & PREFIX_LOCK)
-        gen_helper_lock();
+        gen_helper_lock_enable(cpu_env);
 
     /* now check op code */
  reswitch:
@@ -5567,13 +5618,15 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
         } else {
             gen_lea_modrm(env, s, modrm);
             gen_op_mov_v_reg(ot, cpu_T[0], reg);
-            /* for xchg, lock is implicit */
-            if (!(prefixes & PREFIX_LOCK))
-                gen_helper_lock();
-            gen_op_ld_v(s, ot, cpu_T[1], cpu_A0);
-            gen_op_st_v(s, ot, cpu_T[0], cpu_A0);
+            /*
+             * For xchg, lock is implicit. We then unlock here if the prefix
+             * was missing; otherwise we unlock later.
+             */
+            gen_helper_aie_ld_lock(cpu_env, cpu_A0);
+            gen_op_ld_v_nolock(s, ot, cpu_T[1], cpu_A0);
+            gen_op_st_v_nolock(s, ot, cpu_T[0], cpu_A0);
             if (!(prefixes & PREFIX_LOCK))
-                gen_helper_unlock();
+                gen_helper_aie_unlock__done(cpu_env);
             gen_op_mov_reg_v(ot, reg, cpu_T[1]);
         }
         break;
@@ -5724,24 +5777,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
 
                     switch(op >> 4) {
                     case 0:
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         gen_helper_flds_FT0(cpu_env, cpu_tmp2_i32);
                         break;
                     case 1:
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         gen_helper_fildl_FT0(cpu_env, cpu_tmp2_i32);
                         break;
                     case 2:
-                        tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0,
-                                            s->mem_index, MO_LEQ);
+                        gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0,
+                                        s->mem_index, MO_LEQ);
                         gen_helper_fldl_FT0(cpu_env, cpu_tmp1_i64);
                         break;
                     case 3:
                     default:
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LESW);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LESW);
                         gen_helper_fildl_FT0(cpu_env, cpu_tmp2_i32);
                         break;
                     }
@@ -5763,24 +5816,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 case 0:
                     switch(op >> 4) {
                     case 0:
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         gen_helper_flds_ST0(cpu_env, cpu_tmp2_i32);
                         break;
                     case 1:
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         gen_helper_fildl_ST0(cpu_env, cpu_tmp2_i32);
                         break;
                     case 2:
-                        tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0,
-                                            s->mem_index, MO_LEQ);
+                        gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0,
+                                        s->mem_index, MO_LEQ);
                         gen_helper_fldl_ST0(cpu_env, cpu_tmp1_i64);
                         break;
                     case 3:
                     default:
-                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LESW);
+                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LESW);
                         gen_helper_fildl_ST0(cpu_env, cpu_tmp2_i32);
                         break;
                     }
@@ -5790,19 +5843,19 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                     switch(op >> 4) {
                     case 1:
                         gen_helper_fisttl_ST0(cpu_tmp2_i32, cpu_env);
-                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         break;
                     case 2:
                         gen_helper_fisttll_ST0(cpu_tmp1_i64, cpu_env);
-                        tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0,
-                                            s->mem_index, MO_LEQ);
+                        gen_i386_st_i64(cpu_tmp1_i64, cpu_A0,
+                                        s->mem_index, MO_LEQ);
                         break;
                     case 3:
                     default:
                         gen_helper_fistt_ST0(cpu_tmp2_i32, cpu_env);
-                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUW);
+                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUW);
                         break;
                     }
                     gen_helper_fpop(cpu_env);
@@ -5811,24 +5864,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                     switch(op >> 4) {
                     case 0:
                         gen_helper_fsts_ST0(cpu_tmp2_i32, cpu_env);
-                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         break;
                     case 1:
                         gen_helper_fistl_ST0(cpu_tmp2_i32, cpu_env);
-                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUL);
+                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUL);
                         break;
                     case 2:
                         gen_helper_fstl_ST0(cpu_tmp1_i64, cpu_env);
-                        tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0,
-                                            s->mem_index, MO_LEQ);
+                        gen_i386_st_i64(cpu_tmp1_i64, cpu_A0,
+                                        s->mem_index, MO_LEQ);
                         break;
                     case 3:
                     default:
                         gen_helper_fist_ST0(cpu_tmp2_i32, cpu_env);
-                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                            s->mem_index, MO_LEUW);
+                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
+                                        s->mem_index, MO_LEUW);
                         break;
                     }
                     if ((op & 7) == 3)
@@ -5842,8 +5895,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 gen_helper_fldenv(cpu_env, cpu_A0, tcg_const_i32(dflag - 1));
                 break;
             case 0x0d: /* fldcw mem */
-                tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                    s->mem_index, MO_LEUW);
+                gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUW);
                 gen_helper_fldcw(cpu_env, cpu_tmp2_i32);
                 break;
             case 0x0e: /* fnstenv mem */
@@ -5853,8 +5905,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 break;
             case 0x0f: /* fnstcw mem */
                 gen_helper_fnstcw(cpu_tmp2_i32, cpu_env);
-                tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                    s->mem_index, MO_LEUW);
+                gen_i386_st_i32(cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUW);
                 break;
             case 0x1d: /* fldt mem */
                 gen_update_cc_op(s);
@@ -5879,8 +5930,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 break;
             case 0x2f: /* fnstsw mem */
                 gen_helper_fnstsw(cpu_tmp2_i32, cpu_env);
-                tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
-                                    s->mem_index, MO_LEUW);
+                gen_i386_st_i32(cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUW);
                 break;
             case 0x3c: /* fbld */
                 gen_update_cc_op(s);
@@ -5894,12 +5944,12 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 gen_helper_fpop(cpu_env);
                 break;
             case 0x3d: /* fildll */
-                tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
+                gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
                 gen_helper_fildll_ST0(cpu_env, cpu_tmp1_i64);
                 break;
             case 0x3f: /* fistpll */
                 gen_helper_fistll_ST0(cpu_tmp1_i64, cpu_env);
-                tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
+                gen_i386_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
                 gen_helper_fpop(cpu_env);
                 break;
             default:
@@ -7754,8 +7804,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 goto illegal_op;
             gen_lea_modrm(env, s, modrm);
             if (op == 2) {
-                tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
-                                    s->mem_index, MO_LEUL);
+                gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUL);
                 gen_helper_ldmxcsr(cpu_env, cpu_tmp2_i32);
             } else {
                 tcg_gen_ld32u_tl(cpu_T[0], cpu_env, offsetof(CPUX86State, mxcsr));
@@ -7763,9 +7812,12 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             }
             break;
         case 5: /* lfence */
+            tcg_gen_fence_load();
+            break;
         case 6: /* mfence */
             if ((modrm & 0xc7) != 0xc0 || !(s->cpuid_features & CPUID_SSE2))
                 goto illegal_op;
+            tcg_gen_fence_full();
             break;
         case 7: /* sfence / clflush */
             if ((modrm & 0xc7) == 0xc0) {
@@ -7773,6 +7825,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
                 /* XXX: also check for cpuid_ext2_features & CPUID_EXT2_EMMX */
                 if (!(s->cpuid_features & CPUID_SSE))
                     goto illegal_op;
+                tcg_gen_fence_store();
             } else {
                 /* clflush */
                 if (!(s->cpuid_features & CPUID_CLFLUSH))
@@ -7841,11 +7894,11 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
     }
     /* lock generation */
     if (s->prefix & PREFIX_LOCK)
-        gen_helper_unlock();
+        gen_helper_lock_disable(cpu_env);
     return s->pc;
  illegal_op:
     if (s->prefix & PREFIX_LOCK)
-        gen_helper_unlock();
+        gen_helper_lock_disable(cpu_env);
     /* XXX: ensure that no lock was generated */
     gen_exception(s, EXCP06_ILLOP, pc_start - s->cs_base);
     return s->pc;
@@ -7899,8 +7952,6 @@ void optimize_flags_init(void)
                                          offsetof(CPUX86State, regs[i]),
                                          reg_names[i]);
     }
-
-    helper_lock_init();
 }
 
 /* generate intermediate code in gen_opc_buf and gen_opparam_buf for
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE
  2015-08-24  0:23 ` [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE Emilio G. Cota
@ 2015-09-17 15:30   ` Alex Bennée
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Bennée @ 2015-09-17 15:30 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  aie-helper.c              |   3 +-
>  linux-user/main.c         |   4 +-
>  target-i386/cpu.h         |   3 -
>  target-i386/excp_helper.c |   7 ++
>  target-i386/helper.h      |   6 +-
>  target-i386/mem_helper.c  |  39 +++------
>  target-i386/translate.c   | 217 ++++++++++++++++++++++++++++------------------
>  7 files changed, 162 insertions(+), 117 deletions(-)
>
> diff --git a/aie-helper.c b/aie-helper.c
> index 7521150..a3faf04 100644
> --- a/aie-helper.c
> +++ b/aie-helper.c
> @@ -82,7 +82,8 @@ void HELPER(aie_unlock__done)(CPUArchState *env)
>  
>  void HELPER(aie_ld_pre)(CPUArchState *env, target_ulong vaddr)
>  {
> -    if (likely(!env->aie_lock_enabled) || env->aie_locked) {
> +    assert(env->aie_lock_enabled);
> +    if (env->aie_locked) {
>          return;
>      }
>      aie_ld_lock_ret(env, vaddr, GETRA());
> diff --git a/linux-user/main.c b/linux-user/main.c
> index fd06ce9..98ebe19 100644
> --- a/linux-user/main.c
> +++ b/linux-user/main.c
> @@ -279,9 +279,9 @@ void cpu_loop(CPUX86State *env)
>      target_siginfo_t info;
>  
>      for(;;) {
> -        cpu_exec_start(cs);
> +        cs->running = true;
>          trapnr = cpu_x86_exec(cs);
> -        cpu_exec_end(cs);
> +        cs->running = false;

Hmm I could do with more in the commit message about why it is OK to
replace the cpu_exec_start/end() wrappers with a simple flag setting. 

ION the duplication in linux-user/main.c is terrifying but that's not
your fault ;-) 

>          switch(trapnr) {
>          case 0x80:
>              /* linux syscall from int $0x80 */
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index 3655ff3..ead2832 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -1318,9 +1318,6 @@ static inline MemTxAttrs cpu_get_mem_attrs(CPUX86State *env)
>  void cpu_set_mxcsr(CPUX86State *env, uint32_t val);
>  void cpu_set_fpuc(CPUX86State *env, uint16_t val);
>  
> -/* mem_helper.c */
> -void helper_lock_init(void);
> -
>  /* svm_helper.c */
>  void cpu_svm_check_intercept_param(CPUX86State *env1, uint32_t type,
>                                     uint64_t param);
> diff --git a/target-i386/excp_helper.c b/target-i386/excp_helper.c
> index 99fca84..141cab4 100644
> --- a/target-i386/excp_helper.c
> +++ b/target-i386/excp_helper.c
> @@ -96,6 +96,13 @@ static void QEMU_NORETURN raise_interrupt2(CPUX86State *env, int intno,
>  {
>      CPUState *cs = CPU(x86_env_get_cpu(env));
>  
> +    if (unlikely(env->aie_locked)) {
> +        helper_aie_unlock__done(env);
> +    }
> +    if (unlikely(env->aie_lock_enabled)) {
> +        env->aie_lock_enabled = false;
> +    }
> +
>      if (!is_int) {
>          cpu_svm_check_intercept_param(env, SVM_EXIT_EXCP_BASE + intno,
>                                        error_code);
> diff --git a/target-i386/helper.h b/target-i386/helper.h
> index 74308f4..7d92140 100644
> --- a/target-i386/helper.h
> +++ b/target-i386/helper.h
> @@ -1,8 +1,8 @@
>  DEF_HELPER_FLAGS_4(cc_compute_all, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
>  DEF_HELPER_FLAGS_4(cc_compute_c, TCG_CALL_NO_RWG_SE, tl, tl, tl, tl, int)
>  
> -DEF_HELPER_0(lock, void)
> -DEF_HELPER_0(unlock, void)
> +DEF_HELPER_1(lock_enable, void, env)
> +DEF_HELPER_1(lock_disable, void, env)
>  DEF_HELPER_3(write_eflags, void, env, tl, i32)
>  DEF_HELPER_1(read_eflags, tl, env)
>  DEF_HELPER_2(divb_AL, void, env, tl)
> @@ -217,3 +217,5 @@ DEF_HELPER_3(rcrl, tl, env, tl, tl)
>  DEF_HELPER_3(rclq, tl, env, tl, tl)
>  DEF_HELPER_3(rcrq, tl, env, tl, tl)
>  #endif
> +
> +#include "qemu/aie-helper.h"
> diff --git a/target-i386/mem_helper.c b/target-i386/mem_helper.c
> index 8bf0da2..60abc8a 100644
> --- a/target-i386/mem_helper.c
> +++ b/target-i386/mem_helper.c
> @@ -21,38 +21,21 @@
>  #include "exec/helper-proto.h"
>  #include "exec/cpu_ldst.h"
>  
> -/* broken thread support */
> +#include "aie-helper.c"
>  
> -#if defined(CONFIG_USER_ONLY)
> -QemuMutex global_cpu_lock;
> -
> -void helper_lock(void)
> -{
> -    qemu_mutex_lock(&global_cpu_lock);
> -}
> -
> -void helper_unlock(void)
> -{
> -    qemu_mutex_unlock(&global_cpu_lock);
> -}
> -
> -void helper_lock_init(void)
> -{
> -    qemu_mutex_init(&global_cpu_lock);
> -}
> -#else
> -void helper_lock(void)
> +void helper_lock_enable(CPUX86State *env)
>  {
> +    env->aie_lock_enabled = true;
>  }
>  
> -void helper_unlock(void)
> -{
> -}
> -
> -void helper_lock_init(void)
> +void helper_lock_disable(CPUX86State *env)
>  {
> +    assert(env->aie_lock_enabled);
> +    if (env->aie_locked) {
> +        h_aie_unlock__done(env);
> +    }
> +    env->aie_lock_enabled = false;
>  }
> -#endif

I'm skipping over the rest of this as the locking of x86 is outside of
my area of expertise. However I'm currently erring in favour of Alvise'
TCG primitive approach over the heavy use of helpers. It will be
interesting to see a head to head in performance on some RISC backends. 

>  
>  void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
>  {
> @@ -60,6 +43,7 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
>      int eflags;
>  
>      eflags = cpu_cc_compute_all(env, CC_OP);
> +    aie_ld_lock_ret(env, a0, GETRA());
>      d = cpu_ldq_data(env, a0);
>      if (d == (((uint64_t)env->regs[R_EDX] << 32) | (uint32_t)env->regs[R_EAX])) {
>          cpu_stq_data(env, a0, ((uint64_t)env->regs[R_ECX] << 32) | (uint32_t)env->regs[R_EBX]);
> @@ -71,6 +55,7 @@ void helper_cmpxchg8b(CPUX86State *env, target_ulong a0)
>          env->regs[R_EAX] = (uint32_t)d;
>          eflags &= ~CC_Z;
>      }
> +    helper_aie_unlock__done(env);
>      CC_SRC = eflags;
>  }
>  
> @@ -84,6 +69,7 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
>          raise_exception(env, EXCP0D_GPF);
>      }
>      eflags = cpu_cc_compute_all(env, CC_OP);
> +    aie_ld_lock_ret(env, a0, GETRA());
>      d0 = cpu_ldq_data(env, a0);
>      d1 = cpu_ldq_data(env, a0 + 8);
>      if (d0 == env->regs[R_EAX] && d1 == env->regs[R_EDX]) {
> @@ -98,6 +84,7 @@ void helper_cmpxchg16b(CPUX86State *env, target_ulong a0)
>          env->regs[R_EAX] = d0;
>          eflags &= ~CC_Z;
>      }
> +    helper_aie_unlock__done(env);
>      CC_SRC = eflags;
>  }
>  #endif
> diff --git a/target-i386/translate.c b/target-i386/translate.c
> index 443bf60..4d6030f 100644
> --- a/target-i386/translate.c
> +++ b/target-i386/translate.c
> @@ -300,6 +300,48 @@ static inline bool byte_reg_is_xH(int reg)
>      return true;
>  }
>  
> +static inline void gen_i386_ld_i32(DisasContext *s, TCGv_i32 val, TCGv addr,
> +                                   TCGArg idx, TCGMemOp op)
> +{
> +    if (s->prefix & PREFIX_LOCK) {
> +        gen_helper_aie_ld_pre(cpu_env, addr);
> +    }
> +    tcg_gen_qemu_ld_i32(val, addr, idx, op);
> +}
> +
> +static inline void gen_i386_ld_i64(DisasContext *s, TCGv_i64 val, TCGv addr,
> +                                   TCGArg idx, TCGMemOp op)
> +{
> +    if (s->prefix & PREFIX_LOCK) {
> +        gen_helper_aie_ld_pre(cpu_env, addr);
> +    }
> +    tcg_gen_qemu_ld_i64(val, addr, idx, op);
> +}
> +
> +static inline
> +void gen_i386_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp op)
> +{
> +    gen_helper_aie_st_pre(cpu_env, addr);
> +    tcg_gen_qemu_st_i32(val, addr, idx, op);
> +    gen_helper_aie_st_post(cpu_env, addr);
> +}
> +
> +static inline
> +void gen_i386_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp op)
> +{
> +    gen_helper_aie_st_pre(cpu_env, addr);
> +    tcg_gen_qemu_st_i64(val, addr, idx, op);
> +    gen_helper_aie_st_post(cpu_env, addr);
> +}
> +
> +#if TARGET_LONG_BITS == 32
> +#define gen_i386_ld_tl  gen_i386_ld_i32
> +#define gen_i386_st_tl  gen_i386_st_i32
> +#else
> +#define gen_i386_ld_tl  gen_i386_ld_i64
> +#define gen_i386_st_tl  gen_i386_st_i64
> +#endif
> +
>  /* Select the size of a push/pop operation.  */
>  static inline TCGMemOp mo_pushpop(DisasContext *s, TCGMemOp ot)
>  {
> @@ -479,11 +521,23 @@ static inline void gen_op_addq_A0_reg_sN(int shift, int reg)
>  
>  static inline void gen_op_ld_v(DisasContext *s, int idx, TCGv t0, TCGv a0)
>  {
> +    gen_i386_ld_tl(s, t0, a0, s->mem_index, idx | MO_LE);
> +}
> +
> +static inline
> +void gen_op_ld_v_nolock(DisasContext *s, int idx, TCGv t0, TCGv a0)
> +{
>      tcg_gen_qemu_ld_tl(t0, a0, s->mem_index, idx | MO_LE);
>  }
>  
>  static inline void gen_op_st_v(DisasContext *s, int idx, TCGv t0, TCGv a0)
>  {
> +    gen_i386_st_tl(t0, a0, s->mem_index, idx | MO_LE);
> +}
> +
> +static inline
> +void gen_op_st_v_nolock(DisasContext *s, int idx, TCGv t0, TCGv a0)
> +{
>      tcg_gen_qemu_st_tl(t0, a0, s->mem_index, idx | MO_LE);
>  }
>  
> @@ -2587,23 +2641,23 @@ static void gen_jmp(DisasContext *s, target_ulong eip)
>  
>  static inline void gen_ldq_env_A0(DisasContext *s, int offset)
>  {
> -    tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
> +    gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
>      tcg_gen_st_i64(cpu_tmp1_i64, cpu_env, offset);
>  }
>  
>  static inline void gen_stq_env_A0(DisasContext *s, int offset)
>  {
>      tcg_gen_ld_i64(cpu_tmp1_i64, cpu_env, offset);
> -    tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
> +    gen_i386_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
>  }
>  
>  static inline void gen_ldo_env_A0(DisasContext *s, int offset)
>  {
>      int mem_index = s->mem_index;
> -    tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
> +    gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
>      tcg_gen_st_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(0)));
>      tcg_gen_addi_tl(cpu_tmp0, cpu_A0, 8);
> -    tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
> +    gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
>      tcg_gen_st_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(1)));
>  }
>  
> @@ -2611,10 +2665,10 @@ static inline void gen_sto_env_A0(DisasContext *s, int offset)
>  {
>      int mem_index = s->mem_index;
>      tcg_gen_ld_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(0)));
> -    tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
> +    gen_i386_st_i64(cpu_tmp1_i64, cpu_A0, mem_index, MO_LEQ);
>      tcg_gen_addi_tl(cpu_tmp0, cpu_A0, 8);
>      tcg_gen_ld_i64(cpu_tmp1_i64, cpu_env, offset + offsetof(XMMReg, XMM_Q(1)));
> -    tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
> +    gen_i386_st_i64(cpu_tmp1_i64, cpu_tmp0, mem_index, MO_LEQ);
>  }
>  
>  static inline void gen_op_movo(int d_offset, int s_offset)
> @@ -3643,14 +3697,14 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                          break;
>                      case 0x21: case 0x31: /* pmovsxbd, pmovzxbd */
>                      case 0x24: case 0x34: /* pmovsxwq, pmovzxwq */
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          tcg_gen_st_i32(cpu_tmp2_i32, cpu_env, op2_offset +
>                                          offsetof(XMMReg, XMM_L(0)));
>                          break;
>                      case 0x22: case 0x32: /* pmovsxbq, pmovzxbq */
> -                        tcg_gen_qemu_ld_tl(cpu_tmp0, cpu_A0,
> -                                           s->mem_index, MO_LEUW);
> +                        gen_i386_ld_tl(s, cpu_tmp0, cpu_A0, s->mem_index,
> +                                       MO_LEUW);
>                          tcg_gen_st16_tl(cpu_tmp0, cpu_env, op2_offset +
>                                          offsetof(XMMReg, XMM_W(0)));
>                          break;
> @@ -3738,12 +3792,12 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>  
>                  gen_lea_modrm(env, s, modrm);
>                  if ((b & 1) == 0) {
> -                    tcg_gen_qemu_ld_tl(cpu_T[0], cpu_A0,
> -                                       s->mem_index, ot | MO_BE);
> +                    gen_i386_ld_tl(s, cpu_T[0], cpu_A0, s->mem_index,
> +                                   ot | MO_BE);
>                      gen_op_mov_reg_v(ot, reg, cpu_T[0]);
>                  } else {
> -                    tcg_gen_qemu_st_tl(cpu_regs[reg], cpu_A0,
> -                                       s->mem_index, ot | MO_BE);
> +                    gen_i386_st_tl(cpu_regs[reg], cpu_A0,
> +                                   s->mem_index, ot | MO_BE);
>                  }
>                  break;
>  
> @@ -4079,8 +4133,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                      if (mod == 3) {
>                          gen_op_mov_reg_v(ot, rm, cpu_T[0]);
>                      } else {
> -                        tcg_gen_qemu_st_tl(cpu_T[0], cpu_A0,
> -                                           s->mem_index, MO_UB);
> +                        gen_i386_st_tl(cpu_T[0], cpu_A0, s->mem_index, MO_UB);
>                      }
>                      break;
>                  case 0x15: /* pextrw */
> @@ -4089,8 +4142,7 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                      if (mod == 3) {
>                          gen_op_mov_reg_v(ot, rm, cpu_T[0]);
>                      } else {
> -                        tcg_gen_qemu_st_tl(cpu_T[0], cpu_A0,
> -                                           s->mem_index, MO_LEUW);
> +                        gen_i386_st_tl(cpu_T[0], cpu_A0, s->mem_index, MO_LEUW);
>                      }
>                      break;
>                  case 0x16:
> @@ -4101,8 +4153,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                          if (mod == 3) {
>                              tcg_gen_extu_i32_tl(cpu_regs[rm], cpu_tmp2_i32);
>                          } else {
> -                            tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                                s->mem_index, MO_LEUL);
> +                            gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
> +                                            s->mem_index, MO_LEUL);
>                          }
>                      } else { /* pextrq */
>  #ifdef TARGET_X86_64
> @@ -4112,8 +4164,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                          if (mod == 3) {
>                              tcg_gen_mov_i64(cpu_regs[rm], cpu_tmp1_i64);
>                          } else {
> -                            tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0,
> -                                                s->mem_index, MO_LEQ);
> +                            gen_i386_st_i64(cpu_tmp1_i64, cpu_A0,
> +                                            s->mem_index, MO_LEQ);
>                          }
>  #else
>                          goto illegal_op;
> @@ -4126,16 +4178,15 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                      if (mod == 3) {
>                          gen_op_mov_reg_v(ot, rm, cpu_T[0]);
>                      } else {
> -                        tcg_gen_qemu_st_tl(cpu_T[0], cpu_A0,
> -                                           s->mem_index, MO_LEUL);
> +                        gen_i386_st_tl(cpu_T[0], cpu_A0, s->mem_index, MO_LEUL);
>                      }
>                      break;
>                  case 0x20: /* pinsrb */
>                      if (mod == 3) {
>                          gen_op_mov_v_reg(MO_32, cpu_T[0], rm);
>                      } else {
> -                        tcg_gen_qemu_ld_tl(cpu_T[0], cpu_A0,
> -                                           s->mem_index, MO_UB);
> +                        gen_i386_ld_tl(s, cpu_T[0], cpu_A0, s->mem_index,
> +                                       MO_UB);
>                      }
>                      tcg_gen_st8_tl(cpu_T[0], cpu_env, offsetof(CPUX86State,
>                                              xmm_regs[reg].XMM_B(val & 15)));
> @@ -4146,8 +4197,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                                          offsetof(CPUX86State,xmm_regs[rm]
>                                                  .XMM_L((val >> 6) & 3)));
>                      } else {
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                      }
>                      tcg_gen_st_i32(cpu_tmp2_i32, cpu_env,
>                                      offsetof(CPUX86State,xmm_regs[reg]
> @@ -4174,8 +4225,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                          if (mod == 3) {
>                              tcg_gen_trunc_tl_i32(cpu_tmp2_i32, cpu_regs[rm]);
>                          } else {
> -                            tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                                s->mem_index, MO_LEUL);
> +                            gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                            s->mem_index, MO_LEUL);
>                          }
>                          tcg_gen_st_i32(cpu_tmp2_i32, cpu_env,
>                                          offsetof(CPUX86State,
> @@ -4185,8 +4236,8 @@ static void gen_sse(CPUX86State *env, DisasContext *s, int b,
>                          if (mod == 3) {
>                              gen_op_mov_v_reg(ot, cpu_tmp1_i64, rm);
>                          } else {
> -                            tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0,
> -                                                s->mem_index, MO_LEQ);
> +                            gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0,
> +                                            s->mem_index, MO_LEQ);
>                          }
>                          tcg_gen_st_i64(cpu_tmp1_i64, cpu_env,
>                                          offsetof(CPUX86State,
> @@ -4567,7 +4618,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>  
>      /* lock generation */
>      if (prefixes & PREFIX_LOCK)
> -        gen_helper_lock();
> +        gen_helper_lock_enable(cpu_env);
>  
>      /* now check op code */
>   reswitch:
> @@ -5567,13 +5618,15 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>          } else {
>              gen_lea_modrm(env, s, modrm);
>              gen_op_mov_v_reg(ot, cpu_T[0], reg);
> -            /* for xchg, lock is implicit */
> -            if (!(prefixes & PREFIX_LOCK))
> -                gen_helper_lock();
> -            gen_op_ld_v(s, ot, cpu_T[1], cpu_A0);
> -            gen_op_st_v(s, ot, cpu_T[0], cpu_A0);
> +            /*
> +             * For xchg, lock is implicit. We then unlock here if the prefix
> +             * was missing; otherwise we unlock later.
> +             */
> +            gen_helper_aie_ld_lock(cpu_env, cpu_A0);
> +            gen_op_ld_v_nolock(s, ot, cpu_T[1], cpu_A0);
> +            gen_op_st_v_nolock(s, ot, cpu_T[0], cpu_A0);
>              if (!(prefixes & PREFIX_LOCK))
> -                gen_helper_unlock();
> +                gen_helper_aie_unlock__done(cpu_env);
>              gen_op_mov_reg_v(ot, reg, cpu_T[1]);
>          }
>          break;
> @@ -5724,24 +5777,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>  
>                      switch(op >> 4) {
>                      case 0:
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          gen_helper_flds_FT0(cpu_env, cpu_tmp2_i32);
>                          break;
>                      case 1:
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          gen_helper_fildl_FT0(cpu_env, cpu_tmp2_i32);
>                          break;
>                      case 2:
> -                        tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0,
> -                                            s->mem_index, MO_LEQ);
> +                        gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0,
> +                                        s->mem_index, MO_LEQ);
>                          gen_helper_fldl_FT0(cpu_env, cpu_tmp1_i64);
>                          break;
>                      case 3:
>                      default:
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LESW);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LESW);
>                          gen_helper_fildl_FT0(cpu_env, cpu_tmp2_i32);
>                          break;
>                      }
> @@ -5763,24 +5816,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  case 0:
>                      switch(op >> 4) {
>                      case 0:
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          gen_helper_flds_ST0(cpu_env, cpu_tmp2_i32);
>                          break;
>                      case 1:
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          gen_helper_fildl_ST0(cpu_env, cpu_tmp2_i32);
>                          break;
>                      case 2:
> -                        tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0,
> -                                            s->mem_index, MO_LEQ);
> +                        gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0,
> +                                        s->mem_index, MO_LEQ);
>                          gen_helper_fldl_ST0(cpu_env, cpu_tmp1_i64);
>                          break;
>                      case 3:
>                      default:
> -                        tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LESW);
> +                        gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LESW);
>                          gen_helper_fildl_ST0(cpu_env, cpu_tmp2_i32);
>                          break;
>                      }
> @@ -5790,19 +5843,19 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                      switch(op >> 4) {
>                      case 1:
>                          gen_helper_fisttl_ST0(cpu_tmp2_i32, cpu_env);
> -                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          break;
>                      case 2:
>                          gen_helper_fisttll_ST0(cpu_tmp1_i64, cpu_env);
> -                        tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0,
> -                                            s->mem_index, MO_LEQ);
> +                        gen_i386_st_i64(cpu_tmp1_i64, cpu_A0,
> +                                        s->mem_index, MO_LEQ);
>                          break;
>                      case 3:
>                      default:
>                          gen_helper_fistt_ST0(cpu_tmp2_i32, cpu_env);
> -                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUW);
> +                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUW);
>                          break;
>                      }
>                      gen_helper_fpop(cpu_env);
> @@ -5811,24 +5864,24 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                      switch(op >> 4) {
>                      case 0:
>                          gen_helper_fsts_ST0(cpu_tmp2_i32, cpu_env);
> -                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          break;
>                      case 1:
>                          gen_helper_fistl_ST0(cpu_tmp2_i32, cpu_env);
> -                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUL);
> +                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUL);
>                          break;
>                      case 2:
>                          gen_helper_fstl_ST0(cpu_tmp1_i64, cpu_env);
> -                        tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0,
> -                                            s->mem_index, MO_LEQ);
> +                        gen_i386_st_i64(cpu_tmp1_i64, cpu_A0,
> +                                        s->mem_index, MO_LEQ);
>                          break;
>                      case 3:
>                      default:
>                          gen_helper_fist_ST0(cpu_tmp2_i32, cpu_env);
> -                        tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                            s->mem_index, MO_LEUW);
> +                        gen_i386_st_i32(cpu_tmp2_i32, cpu_A0,
> +                                        s->mem_index, MO_LEUW);
>                          break;
>                      }
>                      if ((op & 7) == 3)
> @@ -5842,8 +5895,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  gen_helper_fldenv(cpu_env, cpu_A0, tcg_const_i32(dflag - 1));
>                  break;
>              case 0x0d: /* fldcw mem */
> -                tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                    s->mem_index, MO_LEUW);
> +                gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUW);
>                  gen_helper_fldcw(cpu_env, cpu_tmp2_i32);
>                  break;
>              case 0x0e: /* fnstenv mem */
> @@ -5853,8 +5905,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  break;
>              case 0x0f: /* fnstcw mem */
>                  gen_helper_fnstcw(cpu_tmp2_i32, cpu_env);
> -                tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                    s->mem_index, MO_LEUW);
> +                gen_i386_st_i32(cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUW);
>                  break;
>              case 0x1d: /* fldt mem */
>                  gen_update_cc_op(s);
> @@ -5879,8 +5930,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  break;
>              case 0x2f: /* fnstsw mem */
>                  gen_helper_fnstsw(cpu_tmp2_i32, cpu_env);
> -                tcg_gen_qemu_st_i32(cpu_tmp2_i32, cpu_A0,
> -                                    s->mem_index, MO_LEUW);
> +                gen_i386_st_i32(cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUW);
>                  break;
>              case 0x3c: /* fbld */
>                  gen_update_cc_op(s);
> @@ -5894,12 +5944,12 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  gen_helper_fpop(cpu_env);
>                  break;
>              case 0x3d: /* fildll */
> -                tcg_gen_qemu_ld_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
> +                gen_i386_ld_i64(s, cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
>                  gen_helper_fildll_ST0(cpu_env, cpu_tmp1_i64);
>                  break;
>              case 0x3f: /* fistpll */
>                  gen_helper_fistll_ST0(cpu_tmp1_i64, cpu_env);
> -                tcg_gen_qemu_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
> +                gen_i386_st_i64(cpu_tmp1_i64, cpu_A0, s->mem_index, MO_LEQ);
>                  gen_helper_fpop(cpu_env);
>                  break;
>              default:
> @@ -7754,8 +7804,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  goto illegal_op;
>              gen_lea_modrm(env, s, modrm);
>              if (op == 2) {
> -                tcg_gen_qemu_ld_i32(cpu_tmp2_i32, cpu_A0,
> -                                    s->mem_index, MO_LEUL);
> +                gen_i386_ld_i32(s, cpu_tmp2_i32, cpu_A0, s->mem_index, MO_LEUL);
>                  gen_helper_ldmxcsr(cpu_env, cpu_tmp2_i32);
>              } else {
>                  tcg_gen_ld32u_tl(cpu_T[0], cpu_env, offsetof(CPUX86State, mxcsr));
> @@ -7763,9 +7812,12 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>              }
>              break;
>          case 5: /* lfence */
> +            tcg_gen_fence_load();
> +            break;
>          case 6: /* mfence */
>              if ((modrm & 0xc7) != 0xc0 || !(s->cpuid_features & CPUID_SSE2))
>                  goto illegal_op;
> +            tcg_gen_fence_full();
>              break;
>          case 7: /* sfence / clflush */
>              if ((modrm & 0xc7) == 0xc0) {
> @@ -7773,6 +7825,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>                  /* XXX: also check for cpuid_ext2_features & CPUID_EXT2_EMMX */
>                  if (!(s->cpuid_features & CPUID_SSE))
>                      goto illegal_op;
> +                tcg_gen_fence_store();
>              } else {
>                  /* clflush */
>                  if (!(s->cpuid_features & CPUID_CLFLUSH))
> @@ -7841,11 +7894,11 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
>      }
>      /* lock generation */
>      if (s->prefix & PREFIX_LOCK)
> -        gen_helper_unlock();
> +        gen_helper_lock_disable(cpu_env);
>      return s->pc;
>   illegal_op:
>      if (s->prefix & PREFIX_LOCK)
> -        gen_helper_unlock();
> +        gen_helper_lock_disable(cpu_env);
>      /* XXX: ensure that no lock was generated */
>      gen_exception(s, EXCP06_ILLOP, pc_start - s->cs_base);
>      return s->pc;
> @@ -7899,8 +7952,6 @@ void optimize_flags_init(void)
>                                           offsetof(CPUX86State, regs[i]),
>                                           reg_names[i]);
>      }
> -
> -    helper_lock_init();
>  }
>  
>  /* generate intermediate code in gen_opc_buf and gen_opparam_buf for

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (20 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  1:09   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling Emilio G. Cota
                   ` (17 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c                         |  9 ++++++---
 exec.c                             |  2 +-
 hw/openrisc/cputimer.c             |  2 +-
 qom/cpu.c                          |  4 ++--
 target-arm/helper-a64.c            |  2 +-
 target-arm/helper.c                |  2 +-
 target-i386/helper.c               |  2 +-
 target-i386/seg_helper.c           | 14 +++++++-------
 target-i386/svm_helper.c           |  4 ++--
 target-openrisc/interrupt_helper.c |  2 +-
 target-openrisc/sys_helper.c       |  2 +-
 target-ppc/excp_helper.c           |  8 ++++----
 target-ppc/helper_regs.h           |  2 +-
 target-s390x/helper.c              |  2 +-
 target-unicore32/softmmu.c         |  2 +-
 translate-all.c                    |  4 ++--
 16 files changed, 33 insertions(+), 30 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 2b9a447..fd57b9c 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -463,12 +463,14 @@ int cpu_exec(CPUState *cpu)
                         interrupt_request &= ~CPU_INTERRUPT_SSTEP_MASK;
                     }
                     if (interrupt_request & CPU_INTERRUPT_DEBUG) {
-                        cpu->interrupt_request &= ~CPU_INTERRUPT_DEBUG;
+                        atomic_and(&cpu->interrupt_request,
+                                   ~CPU_INTERRUPT_DEBUG);
                         cpu->exception_index = EXCP_DEBUG;
                         cpu_loop_exit(cpu);
                     }
                     if (interrupt_request & CPU_INTERRUPT_HALT) {
-                        cpu->interrupt_request &= ~CPU_INTERRUPT_HALT;
+                        atomic_and(&cpu->interrupt_request,
+                                   ~CPU_INTERRUPT_HALT);
                         cpu->halted = 1;
                         cpu->exception_index = EXCP_HLT;
                         cpu_loop_exit(cpu);
@@ -495,7 +497,8 @@ int cpu_exec(CPUState *cpu)
                     /* Don't use the cached interrupt_request value,
                        do_interrupt may have updated the EXITTB flag. */
                     if (cpu->interrupt_request & CPU_INTERRUPT_EXITTB) {
-                        cpu->interrupt_request &= ~CPU_INTERRUPT_EXITTB;
+                        atomic_and(&cpu->interrupt_request,
+                                   ~CPU_INTERRUPT_EXITTB);
                         /* ensure that no TB jump will be modified as
                            the program flow was changed */
                         next_tb = 0;
diff --git a/exec.c b/exec.c
index ff79c3d..edf2236 100644
--- a/exec.c
+++ b/exec.c
@@ -445,7 +445,7 @@ static int cpu_common_post_load(void *opaque, int version_id)
 
     /* 0x01 was CPU_INTERRUPT_EXIT. This line can be removed when the
        version_id is increased. */
-    cpu->interrupt_request &= ~0x01;
+    atomic_and(&cpu->interrupt_request, ~0x01);
     tlb_flush(cpu, 1);
 
     return 0;
diff --git a/hw/openrisc/cputimer.c b/hw/openrisc/cputimer.c
index 9c54945..2451698 100644
--- a/hw/openrisc/cputimer.c
+++ b/hw/openrisc/cputimer.c
@@ -85,7 +85,7 @@ static void openrisc_timer_cb(void *opaque)
         CPUState *cs = CPU(cpu);
 
         cpu->env.ttmr |= TTMR_IP;
-        cs->interrupt_request |= CPU_INTERRUPT_TIMER;
+        atomic_or(&cs->interrupt_request, CPU_INTERRUPT_TIMER);
     }
 
     switch (cpu->env.ttmr & TTMR_M) {
diff --git a/qom/cpu.c b/qom/cpu.c
index 3841f0d..ac19710 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -108,7 +108,7 @@ static void cpu_common_get_memory_mapping(CPUState *cpu,
 
 void cpu_reset_interrupt(CPUState *cpu, int mask)
 {
-    cpu->interrupt_request &= ~mask;
+    atomic_and(&cpu->interrupt_request, ~mask);
 }
 
 void cpu_exit(CPUState *cpu)
@@ -242,7 +242,7 @@ static void cpu_common_reset(CPUState *cpu)
         log_cpu_state(cpu, cc->reset_dump_flags);
     }
 
-    cpu->interrupt_request = 0;
+    atomic_set(&cpu->interrupt_request, 0);
     cpu->current_tb = NULL;
     cpu->halted = 0;
     cpu->mem_io_pc = 0;
diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index 08c95a3..05c4ab6 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -541,6 +541,6 @@ void aarch64_cpu_do_interrupt(CPUState *cs)
     aarch64_restore_sp(env, new_el);
 
     env->pc = addr;
-    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+    atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
 #endif
diff --git a/target-arm/helper.c b/target-arm/helper.c
index aea5a4b..fc8f5b3 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -4980,7 +4980,7 @@ void arm_cpu_do_interrupt(CPUState *cs)
     }
     env->regs[14] = env->regs[15] + offset;
     env->regs[15] = addr;
-    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+    atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
 
 
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 5480a96..6339832 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1232,7 +1232,7 @@ void do_cpu_init(X86CPU *cpu)
     *save = *env;
 
     cpu_reset(cs);
-    cs->interrupt_request = sipi;
+    atomic_set(&cs->interrupt_request, sipi);
     memcpy(&env->start_init_save, &save->start_init_save,
            offsetof(CPUX86State, end_init_save) -
            offsetof(CPUX86State, start_init_save));
diff --git a/target-i386/seg_helper.c b/target-i386/seg_helper.c
index 8a4271e..2753cc6 100644
--- a/target-i386/seg_helper.c
+++ b/target-i386/seg_helper.c
@@ -1292,7 +1292,7 @@ bool x86_cpu_exec_interrupt(CPUState *cs, int interrupt_request)
 
 #if !defined(CONFIG_USER_ONLY)
     if (interrupt_request & CPU_INTERRUPT_POLL) {
-        cs->interrupt_request &= ~CPU_INTERRUPT_POLL;
+        atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_POLL);
         apic_poll_irq(cpu->apic_state);
     }
 #endif
@@ -1302,17 +1302,17 @@ bool x86_cpu_exec_interrupt(CPUState *cs, int interrupt_request)
         if ((interrupt_request & CPU_INTERRUPT_SMI) &&
             !(env->hflags & HF_SMM_MASK)) {
             cpu_svm_check_intercept_param(env, SVM_EXIT_SMI, 0);
-            cs->interrupt_request &= ~CPU_INTERRUPT_SMI;
+            atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_SMI);
             do_smm_enter(cpu);
             ret = true;
         } else if ((interrupt_request & CPU_INTERRUPT_NMI) &&
                    !(env->hflags2 & HF2_NMI_MASK)) {
-            cs->interrupt_request &= ~CPU_INTERRUPT_NMI;
+            atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_NMI);
             env->hflags2 |= HF2_NMI_MASK;
             do_interrupt_x86_hardirq(env, EXCP02_NMI, 1);
             ret = true;
         } else if (interrupt_request & CPU_INTERRUPT_MCE) {
-            cs->interrupt_request &= ~CPU_INTERRUPT_MCE;
+            atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_MCE);
             do_interrupt_x86_hardirq(env, EXCP12_MCHK, 0);
             ret = true;
         } else if ((interrupt_request & CPU_INTERRUPT_HARD) &&
@@ -1323,8 +1323,8 @@ bool x86_cpu_exec_interrupt(CPUState *cs, int interrupt_request)
                       !(env->hflags & HF_INHIBIT_IRQ_MASK))))) {
             int intno;
             cpu_svm_check_intercept_param(env, SVM_EXIT_INTR, 0);
-            cs->interrupt_request &= ~(CPU_INTERRUPT_HARD |
-                                       CPU_INTERRUPT_VIRQ);
+            atomic_and(&cs->interrupt_request, ~(CPU_INTERRUPT_HARD |
+                                                 CPU_INTERRUPT_VIRQ));
             intno = cpu_get_pic_interrupt(env);
             qemu_log_mask(CPU_LOG_TB_IN_ASM,
                           "Servicing hardware INT=0x%02x\n", intno);
@@ -1344,7 +1344,7 @@ bool x86_cpu_exec_interrupt(CPUState *cs, int interrupt_request)
             qemu_log_mask(CPU_LOG_TB_IN_ASM,
                           "Servicing virtual hardware INT=0x%02x\n", intno);
             do_interrupt_x86_hardirq(env, intno, 1);
-            cs->interrupt_request &= ~CPU_INTERRUPT_VIRQ;
+            atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_VIRQ);
             ret = true;
 #endif
         }
diff --git a/target-i386/svm_helper.c b/target-i386/svm_helper.c
index f1fabf5..cedc3ef 100644
--- a/target-i386/svm_helper.c
+++ b/target-i386/svm_helper.c
@@ -296,7 +296,7 @@ void helper_vmrun(CPUX86State *env, int aflag, int next_eip_addend)
     if (int_ctl & V_IRQ_MASK) {
         CPUState *cs = CPU(x86_env_get_cpu(env));
 
-        cs->interrupt_request |= CPU_INTERRUPT_VIRQ;
+        atomic_or(&cs->interrupt_request, CPU_INTERRUPT_VIRQ);
     }
 
     /* maybe we need to inject an event */
@@ -665,7 +665,7 @@ void helper_vmexit(CPUX86State *env, uint32_t exit_code, uint64_t exit_info_1)
     env->hflags &= ~HF_SVMI_MASK;
     env->intercept = 0;
     env->intercept_exceptions = 0;
-    cs->interrupt_request &= ~CPU_INTERRUPT_VIRQ;
+    atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_VIRQ);
     env->tsc_offset = 0;
 
     env->gdt.base  = x86_ldq_phys(cs, env->vm_hsave + offsetof(struct vmcb,
diff --git a/target-openrisc/interrupt_helper.c b/target-openrisc/interrupt_helper.c
index 55a780c..300a980 100644
--- a/target-openrisc/interrupt_helper.c
+++ b/target-openrisc/interrupt_helper.c
@@ -54,5 +54,5 @@ void HELPER(rfe)(CPUOpenRISCState *env)
         tlb_flush(cs, 1);
     }
 #endif
-    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+    atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
diff --git a/target-openrisc/sys_helper.c b/target-openrisc/sys_helper.c
index 53ca6bc..4be72af 100644
--- a/target-openrisc/sys_helper.c
+++ b/target-openrisc/sys_helper.c
@@ -148,7 +148,7 @@ void HELPER(mtspr)(CPUOpenRISCState *env,
                 env->ttmr = (rb & ~TTMR_IP) | ip;
             } else {    /* Clear IP bit.  */
                 env->ttmr = rb & ~TTMR_IP;
-                cs->interrupt_request &= ~CPU_INTERRUPT_TIMER;
+                atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_TIMER);
             }
 
             cpu_openrisc_timer_update(cpu);
diff --git a/target-ppc/excp_helper.c b/target-ppc/excp_helper.c
index b803475..b6e17e9 100644
--- a/target-ppc/excp_helper.c
+++ b/target-ppc/excp_helper.c
@@ -139,7 +139,7 @@ static inline void powerpc_excp(PowerPCCPU *cpu, int excp_model, int excp)
                         "Entering checkstop state\n");
             }
             cs->halted = 1;
-            cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+            atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
         }
         if (0) {
             /* XXX: find a suitable condition to enable the hypervisor mode */
@@ -828,7 +828,7 @@ bool ppc_cpu_exec_interrupt(CPUState *cs, int interrupt_request)
     if (interrupt_request & CPU_INTERRUPT_HARD) {
         ppc_hw_interrupt(env);
         if (env->pending_interrupts == 0) {
-            cs->interrupt_request &= ~CPU_INTERRUPT_HARD;
+            atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_HARD);
         }
         return true;
     }
@@ -872,7 +872,7 @@ void helper_store_msr(CPUPPCState *env, target_ulong val)
     val = hreg_store_msr(env, val, 0);
     if (val != 0) {
         cs = CPU(ppc_env_get_cpu(env));
-        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+        atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
         helper_raise_exception(env, val);
     }
 }
@@ -906,7 +906,7 @@ static inline void do_rfi(CPUPPCState *env, target_ulong nip, target_ulong msr,
     /* No need to raise an exception here,
      * as rfi is always the last insn of a TB
      */
-    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+    atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
 
 void helper_rfi(CPUPPCState *env)
diff --git a/target-ppc/helper_regs.h b/target-ppc/helper_regs.h
index 271fddf..3164714 100644
--- a/target-ppc/helper_regs.h
+++ b/target-ppc/helper_regs.h
@@ -85,7 +85,7 @@ static inline int hreg_store_msr(CPUPPCState *env, target_ulong value,
         /* Flush all tlb when changing translation mode */
         tlb_flush(cs, 1);
         excp = POWERPC_EXCP_NONE;
-        cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+        atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
     }
     if (unlikely((env->flags & POWERPC_FLAG_TGPR) &&
                  ((value ^ env->msr) & (1 << MSR_TGPR)))) {
diff --git a/target-s390x/helper.c b/target-s390x/helper.c
index d887006..f11bd08 100644
--- a/target-s390x/helper.c
+++ b/target-s390x/helper.c
@@ -563,7 +563,7 @@ void s390_cpu_do_interrupt(CPUState *cs)
     cs->exception_index = -1;
 
     if (!env->pending_int) {
-        cs->interrupt_request &= ~CPU_INTERRUPT_HARD;
+        atomic_and(&cs->interrupt_request, ~CPU_INTERRUPT_HARD);
     }
 }
 
diff --git a/target-unicore32/softmmu.c b/target-unicore32/softmmu.c
index 9a3786d..9da5287 100644
--- a/target-unicore32/softmmu.c
+++ b/target-unicore32/softmmu.c
@@ -116,7 +116,7 @@ void uc32_cpu_do_interrupt(CPUState *cs)
     /* The PC already points to the proper instruction.  */
     env->regs[30] = env->regs[31];
     env->regs[31] = addr;
-    cs->interrupt_request |= CPU_INTERRUPT_EXITTB;
+    atomic_or(&cs->interrupt_request, CPU_INTERRUPT_EXITTB);
 }
 
 static int get_phys_addr_ucv2(CPUUniCore32State *env, uint32_t address,
diff --git a/translate-all.c b/translate-all.c
index f07547e..12eaed7 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1596,7 +1596,7 @@ static void tcg_handle_interrupt(CPUState *cpu, int mask)
     int old_mask;
 
     old_mask = cpu->interrupt_request;
-    cpu->interrupt_request |= mask;
+    atomic_or(&cpu->interrupt_request, mask);
 
     /*
      * If called from iothread context, wake the target cpu in
@@ -1790,7 +1790,7 @@ void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf)
 
 void cpu_interrupt(CPUState *cpu, int mask)
 {
-    cpu->interrupt_request |= mask;
+    atomic_or(&cpu->interrupt_request, mask);
     cpu->tcg_exit_req = 1;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically
  2015-08-24  0:23 ` [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically Emilio G. Cota
@ 2015-08-24  1:09   ` Paolo Bonzini
  2015-08-25 20:36     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:09 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c                         |  9 ++++++---
>  exec.c                             |  2 +-
>  hw/openrisc/cputimer.c             |  2 +-
>  qom/cpu.c                          |  4 ++--
>  target-arm/helper-a64.c            |  2 +-
>  target-arm/helper.c                |  2 +-
>  target-i386/helper.c               |  2 +-
>  target-i386/seg_helper.c           | 14 +++++++-------
>  target-i386/svm_helper.c           |  4 ++--
>  target-openrisc/interrupt_helper.c |  2 +-
>  target-openrisc/sys_helper.c       |  2 +-
>  target-ppc/excp_helper.c           |  8 ++++----
>  target-ppc/helper_regs.h           |  2 +-
>  target-s390x/helper.c              |  2 +-
>  target-unicore32/softmmu.c         |  2 +-
>  translate-all.c                    |  4 ++--
>  16 files changed, 33 insertions(+), 30 deletions(-)

Is this needed if you have patch 23 anyway?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically
  2015-08-24  1:09   ` Paolo Bonzini
@ 2015-08-25 20:36     ` Emilio G. Cota
  2015-08-25 22:52       ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 20:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:09:48 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  cpu-exec.c                         |  9 ++++++---
> >  exec.c                             |  2 +-
> >  hw/openrisc/cputimer.c             |  2 +-
> >  qom/cpu.c                          |  4 ++--
> >  target-arm/helper-a64.c            |  2 +-
> >  target-arm/helper.c                |  2 +-
> >  target-i386/helper.c               |  2 +-
> >  target-i386/seg_helper.c           | 14 +++++++-------
> >  target-i386/svm_helper.c           |  4 ++--
> >  target-openrisc/interrupt_helper.c |  2 +-
> >  target-openrisc/sys_helper.c       |  2 +-
> >  target-ppc/excp_helper.c           |  8 ++++----
> >  target-ppc/helper_regs.h           |  2 +-
> >  target-s390x/helper.c              |  2 +-
> >  target-unicore32/softmmu.c         |  2 +-
> >  translate-all.c                    |  4 ++--
> >  16 files changed, 33 insertions(+), 30 deletions(-)
> 
> Is this needed if you have patch 23 anyway?

Sorry, this should have been in the commit log.

This patch is needed as is. One real risk this is protecting
against is the call of cpu_interrupt(cpu_foo) when the calling
thread is not cpu_foo's thread--this write to interrupt_request
might race with other writes, e.g. another call to cpu_interrupt
from another thread, or the clearing of interrupt_request by
cpu_foo.

Patch 23 fixes another issue--bootup hangs without it. The amount
of code wrapped by the iothread lock can be reduced, though.
Will fix.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically
  2015-08-25 20:36     ` Emilio G. Cota
@ 2015-08-25 22:52       ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-25 22:52 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark burton, a rigo, qemu-devel, guillaume delbergue,
	alex bennee, Frederic Konrad


> This patch is needed as is. One real risk this is protecting
> against is the call of cpu_interrupt(cpu_foo) when the calling
> thread is not cpu_foo's thread--this write to interrupt_request
> might race with other writes, e.g. another call to cpu_interrupt
> from another thread, or the clearing of interrupt_request by
> cpu_foo.

But it should be protected by the iothread lock.  That requires
a lot of auditing. :(

I prefer to go with patch 23 first and then optimize things on
top (not that I don't like the optimization, since it also affects
KVM!). :)

Paolo

> Patch 23 fixes another issue--bootup hangs without it. The amount
> of code wrapped by the iothread lock can be reduced, though.
> Will fix.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (21 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-09-09 10:13   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop Emilio G. Cota
                   ` (16 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c        | 34 ++++++++++++++++++++++++++++------
 include/qom/cpu.h |  1 +
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index fd57b9c..a1700ac 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -371,6 +371,29 @@ static void cpu_handle_debug_exception(CPUState *cpu)
     cc->debug_excp_handler(cpu);
 }
 
+#ifdef CONFIG_SOFTMMU
+static inline void cpu_exit_loop_lock(CPUState *cpu)
+{
+    qemu_mutex_lock_iothread();
+    cpu->cpu_loop_exit_locked = true;
+}
+
+static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
+{
+    if (cpu->cpu_loop_exit_locked) {
+        cpu->cpu_loop_exit_locked = false;
+        qemu_mutex_unlock_iothread();
+    }
+}
+
+#else
+static inline void cpu_exit_loop_lock(CPUState *cpu)
+{ }
+
+static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
+{ }
+#endif
+
 /* main execution loop */
 
 int cpu_exec(CPUState *cpu)
@@ -452,12 +475,8 @@ int cpu_exec(CPUState *cpu)
             for(;;) {
                 interrupt_request = cpu->interrupt_request;
                 if (unlikely(interrupt_request)) {
-                    /* FIXME: this needs to take the iothread lock.
-                     * For this we need to find all places in
-                     * cc->cpu_exec_interrupt that can call cpu_loop_exit,
-                     * and call qemu_unlock_iothread_mutex() there.  Else,
-                     * add a flag telling cpu_loop_exit() to unlock it.
-                     */
+                    cpu_exit_loop_lock(cpu);
+
                     if (unlikely(cpu->singlestep_enabled & SSTEP_NOIRQ)) {
                         /* Mask out external interrupts for this step. */
                         interrupt_request &= ~CPU_INTERRUPT_SSTEP_MASK;
@@ -503,6 +522,8 @@ int cpu_exec(CPUState *cpu)
                            the program flow was changed */
                         next_tb = 0;
                     }
+
+                    cpu_exit_loop_lock_reset(cpu);
                 }
                 if (unlikely(cpu->exit_request)) {
                     cpu->exception_index = EXCP_INTERRUPT;
@@ -609,6 +630,7 @@ int cpu_exec(CPUState *cpu)
             env = &x86_cpu->env;
 #endif
             tb_lock_reset();
+            cpu_exit_loop_lock_reset(cpu);
         }
     } /* for(;;) */
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 1d97b63..dbe0438 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -270,6 +270,7 @@ struct CPUState {
     bool created;
     bool stop;
     bool stopped;
+    bool cpu_loop_exit_locked;
     volatile sig_atomic_t exit_request;
     uint32_t interrupt_request;
     int singlestep_enabled;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling
  2015-08-24  0:23 ` [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling Emilio G. Cota
@ 2015-09-09 10:13   ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-09 10:13 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: alex.bennee, Frederic Konrad, mark.burton, a.rigo,
	guillaume.delbergue



On 24/08/2015 02:23, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c        | 34 ++++++++++++++++++++++++++++------
>  include/qom/cpu.h |  1 +
>  2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index fd57b9c..a1700ac 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -371,6 +371,29 @@ static void cpu_handle_debug_exception(CPUState *cpu)
>      cc->debug_excp_handler(cpu);
>  }
>  
> +#ifdef CONFIG_SOFTMMU
> +static inline void cpu_exit_loop_lock(CPUState *cpu)
> +{
> +    qemu_mutex_lock_iothread();
> +    cpu->cpu_loop_exit_locked = true;
> +}
> +
> +static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
> +{
> +    if (cpu->cpu_loop_exit_locked) {
> +        cpu->cpu_loop_exit_locked = false;
> +        qemu_mutex_unlock_iothread();
> +    }

This can use qemu_mutex_iothread_locked, avoiding the introduction of a
new CPUState member.

Paolo

> +}
> +
> +#else
> +static inline void cpu_exit_loop_lock(CPUState *cpu)
> +{ }
> +
> +static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
> +{ }
> +#endif
> +
>  /* main execution loop */
>  
>  int cpu_exec(CPUState *cpu)
> @@ -452,12 +475,8 @@ int cpu_exec(CPUState *cpu)
>              for(;;) {
>                  interrupt_request = cpu->interrupt_request;
>                  if (unlikely(interrupt_request)) {
> -                    /* FIXME: this needs to take the iothread lock.
> -                     * For this we need to find all places in
> -                     * cc->cpu_exec_interrupt that can call cpu_loop_exit,
> -                     * and call qemu_unlock_iothread_mutex() there.  Else,
> -                     * add a flag telling cpu_loop_exit() to unlock it.
> -                     */
> +                    cpu_exit_loop_lock(cpu);
> +
>                      if (unlikely(cpu->singlestep_enabled & SSTEP_NOIRQ)) {
>                          /* Mask out external interrupts for this step. */
>                          interrupt_request &= ~CPU_INTERRUPT_SSTEP_MASK;
> @@ -503,6 +522,8 @@ int cpu_exec(CPUState *cpu)
>                             the program flow was changed */
>                          next_tb = 0;
>                      }
> +
> +                    cpu_exit_loop_lock_reset(cpu);
>                  }
>                  if (unlikely(cpu->exit_request)) {
>                      cpu->exception_index = EXCP_INTERRUPT;
> @@ -609,6 +630,7 @@ int cpu_exec(CPUState *cpu)
>              env = &x86_cpu->env;
>  #endif
>              tb_lock_reset();
> +            cpu_exit_loop_lock_reset(cpu);
>          }
>      } /* for(;;) */
>  
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 1d97b63..dbe0438 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -270,6 +270,7 @@ struct CPUState {
>      bool created;
>      bool stop;
>      bool stopped;
> +    bool cpu_loop_exit_locked;
>      volatile sig_atomic_t exit_request;
>      uint32_t interrupt_request;
>      int singlestep_enabled;
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (22 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  2:01   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req Emilio G. Cota
                   ` (15 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Otherwise after an exception we end up in a deadlock.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 bsd-user/mmap.c         | 12 ++++++++++++
 cpu-exec.c              |  1 +
 include/exec/exec-all.h |  2 ++
 linux-user/mmap.c       |  8 ++++++++
 4 files changed, 23 insertions(+)

diff --git a/bsd-user/mmap.c b/bsd-user/mmap.c
index 092bf7f..b37a8f5 100644
--- a/bsd-user/mmap.c
+++ b/bsd-user/mmap.c
@@ -48,6 +48,14 @@ void mmap_unlock(void)
     }
 }
 
+void mmap_lock_reset(void)
+{
+    while (mmap_lock_count) {
+        mmap_lock_count--;
+        pthread_mutex_unlock(&mmap_mutex);
+    }
+}
+
 /* Grab lock to make sure things are in a consistent state after fork().  */
 void mmap_fork_start(void)
 {
@@ -72,6 +80,10 @@ void mmap_lock(void)
 void mmap_unlock(void)
 {
 }
+
+void mmap_lock_reset(void)
+{
+}
 #endif
 
 /* NOTE: all the constants are the HOST ones, but addresses are target. */
diff --git a/cpu-exec.c b/cpu-exec.c
index a1700ac..f758928 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -630,6 +630,7 @@ int cpu_exec(CPUState *cpu)
             env = &x86_cpu->env;
 #endif
             tb_lock_reset();
+            mmap_lock_reset();
             cpu_exit_loop_lock_reset(cpu);
         }
     } /* for(;;) */
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index b1934bf..3b8399a 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -334,6 +334,7 @@ void tlb_fill(CPUState *cpu, target_ulong addr, int is_write, int mmu_idx,
 #if defined(CONFIG_USER_ONLY)
 void mmap_lock(void);
 void mmap_unlock(void);
+void mmap_lock_reset(void);
 
 static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 {
@@ -342,6 +343,7 @@ static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong
 #else
 static inline void mmap_lock(void) {}
 static inline void mmap_unlock(void) {}
+static inline void mmap_lock_reset(void) {}
 
 /* cputlb.c */
 tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr);
diff --git a/linux-user/mmap.c b/linux-user/mmap.c
index 78e1b2d..8ee80f5 100644
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -51,6 +51,14 @@ void mmap_unlock(void)
     }
 }
 
+void mmap_lock_reset(void)
+{
+    if (mmap_lock_count) {
+        mmap_lock_count = 0;
+        pthread_mutex_unlock(&mmap_mutex);
+    }
+}
+
 /* Grab lock to make sure things are in a consistent state after fork().  */
 void mmap_fork_start(void)
 {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop
  2015-08-24  0:23 ` [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop Emilio G. Cota
@ 2015-08-24  2:01   ` Paolo Bonzini
  2015-08-25 21:16     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  2:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> Otherwise after an exception we end up in a deadlock.

Can you explain better the path that exits cpu_exec with the lock taken?

Also, let's remove the recursive locking by introducing "mmap_lock()
already taken" variants of target_mprotect and target_mmap.

Paolo

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  bsd-user/mmap.c         | 12 ++++++++++++
>  cpu-exec.c              |  1 +
>  include/exec/exec-all.h |  2 ++
>  linux-user/mmap.c       |  8 ++++++++
>  4 files changed, 23 insertions(+)
> 
> diff --git a/bsd-user/mmap.c b/bsd-user/mmap.c
> index 092bf7f..b37a8f5 100644
> --- a/bsd-user/mmap.c
> +++ b/bsd-user/mmap.c
> @@ -48,6 +48,14 @@ void mmap_unlock(void)
>      }
>  }
>  
> +void mmap_lock_reset(void)
> +{
> +    while (mmap_lock_count) {
> +        mmap_lock_count--;
> +        pthread_mutex_unlock(&mmap_mutex);
> +    }
> +}
> +
>  /* Grab lock to make sure things are in a consistent state after fork().  */
>  void mmap_fork_start(void)
>  {
> @@ -72,6 +80,10 @@ void mmap_lock(void)
>  void mmap_unlock(void)
>  {
>  }
> +
> +void mmap_lock_reset(void)
> +{
> +}
>  #endif
>  
>  /* NOTE: all the constants are the HOST ones, but addresses are target. */
> diff --git a/cpu-exec.c b/cpu-exec.c
> index a1700ac..f758928 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -630,6 +630,7 @@ int cpu_exec(CPUState *cpu)
>              env = &x86_cpu->env;
>  #endif
>              tb_lock_reset();
> +            mmap_lock_reset();
>              cpu_exit_loop_lock_reset(cpu);
>          }
>      } /* for(;;) */
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index b1934bf..3b8399a 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -334,6 +334,7 @@ void tlb_fill(CPUState *cpu, target_ulong addr, int is_write, int mmu_idx,
>  #if defined(CONFIG_USER_ONLY)
>  void mmap_lock(void);
>  void mmap_unlock(void);
> +void mmap_lock_reset(void);
>  
>  static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>  {
> @@ -342,6 +343,7 @@ static inline tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong
>  #else
>  static inline void mmap_lock(void) {}
>  static inline void mmap_unlock(void) {}
> +static inline void mmap_lock_reset(void) {}
>  
>  /* cputlb.c */
>  tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr);
> diff --git a/linux-user/mmap.c b/linux-user/mmap.c
> index 78e1b2d..8ee80f5 100644
> --- a/linux-user/mmap.c
> +++ b/linux-user/mmap.c
> @@ -51,6 +51,14 @@ void mmap_unlock(void)
>      }
>  }
>  
> +void mmap_lock_reset(void)
> +{
> +    if (mmap_lock_count) {
> +        mmap_lock_count = 0;
> +        pthread_mutex_unlock(&mmap_mutex);
> +    }
> +}
> +
>  /* Grab lock to make sure things are in a consistent state after fork().  */
>  void mmap_fork_start(void)
>  {
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop
  2015-08-24  2:01   ` Paolo Bonzini
@ 2015-08-25 21:16     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 21:16 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 19:01:39 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > Otherwise after an exception we end up in a deadlock.
> 
> Can you explain better the path that exits cpu_exec with the lock taken?

In fact I cannot :-) So please ignore this patch.

I wrote this while rebasing my code on top of your mttcg branch,
and then it was needed. However, patch 01/38 (which I wrote after
writing this patch) was the right fix, and I forgot to check whether
this patch was still necessary--turns out it wasn't.

> Also, let's remove the recursive locking by introducing "mmap_lock()
> already taken" variants of target_mprotect and target_mmap.

Will do.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (23 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  2:01   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock Emilio G. Cota
                   ` (14 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/gen-icount.h | 1 +
 translate-all.c           | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/exec/gen-icount.h b/include/exec/gen-icount.h
index 05d89d3..f429821 100644
--- a/include/exec/gen-icount.h
+++ b/include/exec/gen-icount.h
@@ -16,6 +16,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
 
     exitreq_label = gen_new_label();
     flag = tcg_temp_new_i32();
+    tcg_gen_smp_rmb();
     tcg_gen_ld_i32(flag, cpu_env,
                    offsetof(CPUState, tcg_exit_req) - ENV_OFFSET);
     tcg_gen_brcondi_i32(TCG_COND_NE, flag, 0, exitreq_label);
diff --git a/translate-all.c b/translate-all.c
index 12eaed7..76a0be8 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1614,6 +1614,7 @@ static void tcg_handle_interrupt(CPUState *cpu, int mask)
             cpu_abort(cpu, "Raised interrupt while not in I/O function");
         }
     } else {
+        smp_wmb();
         cpu->tcg_exit_req = 1;
     }
 }
@@ -1791,6 +1792,7 @@ void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf)
 void cpu_interrupt(CPUState *cpu, int mask)
 {
     atomic_or(&cpu->interrupt_request, mask);
+    smp_wmb();
     cpu->tcg_exit_req = 1;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req
  2015-08-24  0:23 ` [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req Emilio G. Cota
@ 2015-08-24  2:01   ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  2:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> @@ -16,6 +16,7 @@ static inline void gen_tb_start(TranslationBlock *tb)
>  
>      exitreq_label = gen_new_label();
>      flag = tcg_temp_new_i32();
> +    tcg_gen_smp_rmb();
>      tcg_gen_ld_i32(flag, cpu_env,
>                     offsetof(CPUState, tcg_exit_req) - ENV_OFFSET);
>      tcg_gen_brcondi_i32(TCG_COND_NE, flag, 0, exitreq_label);

This can also be done in cpu-exec.c.  I have addressed Richard
Henderson's comments on the "signal-free TCG exit" patches and will
resend them asap.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (24 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  1:14   ` Paolo Bonzini
  2015-09-04  8:50   ` Paolo Bonzini
  2015-08-24  0:23 ` [Qemu-devel] [RFC 27/38] cpu-exec: convert tb_invalidated_flag into a per-TB flag Emilio G. Cota
                   ` (13 subsequent siblings)
  39 siblings, 2 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This paves the way for a lockless tb_find_fast.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c        |  8 +++++++-
 exec.c            |  2 ++
 include/qom/cpu.h | 15 +++++++++++++++
 qom/cpu.c         |  2 +-
 translate-all.c   | 32 +++++++++++++++++++++++++++++++-
 5 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index f758928..5ad578d 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -334,7 +334,9 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
     }
 
     /* we add the TB in the virtual pc hash table */
+    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
     cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
+    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
     return tb;
 }
 
@@ -343,13 +345,17 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu)
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
     TranslationBlock *tb;
     target_ulong cs_base, pc;
+    unsigned int version;
     int flags;
 
     /* we record a subset of the CPU state. It will
        always be the same before a given translated block
        is executed. */
     cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
-    tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
+    do {
+        version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
+        tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
+    } while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
     if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
                  tb->flags != flags)) {
         tb = tb_find_slow(cpu, pc, cs_base, flags);
diff --git a/exec.c b/exec.c
index edf2236..ae6f416 100644
--- a/exec.c
+++ b/exec.c
@@ -577,6 +577,8 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
     int cpu_index;
     Error *local_err = NULL;
 
+    qemu_mutex_init(&cpu->tb_jmp_cache_lock);
+    seqlock_init(&cpu->tb_jmp_cache_sequence, &cpu->tb_jmp_cache_lock);
 #ifndef CONFIG_USER_ONLY
     cpu->as = &address_space_memory;
     cpu->thread_id = qemu_get_thread_id();
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index dbe0438..f383c24 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -27,6 +27,7 @@
 #include "exec/hwaddr.h"
 #include "exec/memattrs.h"
 #include "qemu/queue.h"
+#include "qemu/seqlock.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
 
@@ -287,6 +288,13 @@ struct CPUState {
 
     void *env_ptr; /* CPUArchState */
     struct TranslationBlock *current_tb;
+    /*
+     * The seqlock here is needed because not all updates are to a single
+     * entry; sometimes we want to atomically clear all entries that belong to
+     * a given page, e.g. when flushing said page.
+     */
+    QemuMutex tb_jmp_cache_lock;
+    QemuSeqLock tb_jmp_cache_sequence;
     struct TranslationBlock *tb_jmp_cache[TB_JMP_CACHE_SIZE];
 
     struct GDBRegisterState *gdb_regs;
@@ -342,6 +350,13 @@ extern struct CPUTailQ cpus;
 
 extern __thread CPUState *current_cpu;
 
+static inline void cpu_tb_jmp_cache_clear(CPUState *cpu)
+{
+    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
+    memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
+    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
+}
+
 /**
  * cpu_paging_enabled:
  * @cpu: The CPU whose state is to be inspected.
diff --git a/qom/cpu.c b/qom/cpu.c
index ac19710..5e72e7a 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -251,7 +251,7 @@ static void cpu_common_reset(CPUState *cpu)
     cpu->icount_decr.u32 = 0;
     cpu->can_do_io = 1;
     cpu->exception_index = -1;
-    memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
+    cpu_tb_jmp_cache_clear(cpu);
 }
 
 static bool cpu_common_has_work(CPUState *cs)
diff --git a/translate-all.c b/translate-all.c
index 76a0be8..668b43a 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -863,7 +863,7 @@ void tb_flush(CPUState *cpu)
     tcg_ctx.tb_ctx.nb_tbs = 0;
 
     CPU_FOREACH(cpu) {
-        memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
+        cpu_tb_jmp_cache_clear(cpu);
     }
 
     memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
@@ -988,6 +988,27 @@ static inline void tb_jmp_remove(TranslationBlock *tb, int n)
     }
 }
 
+static inline void tb_jmp_cache_entry_clear(CPUState *cpu, TranslationBlock *tb)
+{
+    unsigned int version;
+    unsigned int h;
+    bool hit = false;
+
+    h = tb_jmp_cache_hash_func(tb->pc);
+    do {
+        version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
+        hit = cpu->tb_jmp_cache[h] == tb;
+    } while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
+
+    if (hit) {
+        seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
+        if (likely(cpu->tb_jmp_cache[h] == tb)) {
+            cpu->tb_jmp_cache[h] = NULL;
+        }
+        seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
+    }
+}
+
 /* invalidate one TB
  *
  * Called with tb_lock held.
@@ -1024,6 +1045,13 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
         invalidate_page_bitmap(p);
     }
 
+    tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
+
+    /* remove the TB from the hash list */
+    CPU_FOREACH(cpu) {
+        tb_jmp_cache_entry_clear(cpu, tb);
+    }
+
     /* suppress this TB from the two jump lists */
     tb_jmp_remove(tb, 0);
     tb_jmp_remove(tb, 1);
@@ -1707,12 +1735,14 @@ void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr)
     /* Discard jump cache entries for any tb which might potentially
        overlap the flushed page.  */
     i = tb_jmp_cache_hash_page(addr - TARGET_PAGE_SIZE);
+    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
     memset(&cpu->tb_jmp_cache[i], 0,
            TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
 
     i = tb_jmp_cache_hash_page(addr);
     memset(&cpu->tb_jmp_cache[i], 0,
            TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
+    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
 }
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock
  2015-08-24  0:23 ` [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock Emilio G. Cota
@ 2015-08-24  1:14   ` Paolo Bonzini
  2015-08-25 21:46     ` Emilio G. Cota
  2015-09-04  8:50   ` Paolo Bonzini
  1 sibling, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:14 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> This paves the way for a lockless tb_find_fast.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c        |  8 +++++++-
>  exec.c            |  2 ++
>  include/qom/cpu.h | 15 +++++++++++++++
>  qom/cpu.c         |  2 +-
>  translate-all.c   | 32 +++++++++++++++++++++++++++++++-
>  5 files changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index f758928..5ad578d 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -334,7 +334,9 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>      }
>  
>      /* we add the TB in the virtual pc hash table */
> +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
>      cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
> +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
>      return tb;
>  }
>  
> @@ -343,13 +345,17 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu)
>      CPUArchState *env = (CPUArchState *)cpu->env_ptr;
>      TranslationBlock *tb;
>      target_ulong cs_base, pc;
> +    unsigned int version;
>      int flags;
>  
>      /* we record a subset of the CPU state. It will
>         always be the same before a given translated block
>         is executed. */
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> -    tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
> +    do {
> +        version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
> +        tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
> +    } while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
>          tb = tb_find_slow(cpu, pc, cs_base, flags);
> diff --git a/exec.c b/exec.c
> index edf2236..ae6f416 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -577,6 +577,8 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
>      int cpu_index;
>      Error *local_err = NULL;
>  
> +    qemu_mutex_init(&cpu->tb_jmp_cache_lock);
> +    seqlock_init(&cpu->tb_jmp_cache_sequence, &cpu->tb_jmp_cache_lock);
>  #ifndef CONFIG_USER_ONLY
>      cpu->as = &address_space_memory;
>      cpu->thread_id = qemu_get_thread_id();
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index dbe0438..f383c24 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -27,6 +27,7 @@
>  #include "exec/hwaddr.h"
>  #include "exec/memattrs.h"
>  #include "qemu/queue.h"
> +#include "qemu/seqlock.h"
>  #include "qemu/thread.h"
>  #include "qemu/typedefs.h"
>  
> @@ -287,6 +288,13 @@ struct CPUState {
>  
>      void *env_ptr; /* CPUArchState */
>      struct TranslationBlock *current_tb;
> +    /*
> +     * The seqlock here is needed because not all updates are to a single
> +     * entry; sometimes we want to atomically clear all entries that belong to
> +     * a given page, e.g. when flushing said page.
> +     */
> +    QemuMutex tb_jmp_cache_lock;
> +    QemuSeqLock tb_jmp_cache_sequence;
>      struct TranslationBlock *tb_jmp_cache[TB_JMP_CACHE_SIZE];
>  
>      struct GDBRegisterState *gdb_regs;
> @@ -342,6 +350,13 @@ extern struct CPUTailQ cpus;
>  
>  extern __thread CPUState *current_cpu;
>  
> +static inline void cpu_tb_jmp_cache_clear(CPUState *cpu)
> +{
> +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
> +    memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
> +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
> +}
> +
>  /**
>   * cpu_paging_enabled:
>   * @cpu: The CPU whose state is to be inspected.
> diff --git a/qom/cpu.c b/qom/cpu.c
> index ac19710..5e72e7a 100644
> --- a/qom/cpu.c
> +++ b/qom/cpu.c
> @@ -251,7 +251,7 @@ static void cpu_common_reset(CPUState *cpu)
>      cpu->icount_decr.u32 = 0;
>      cpu->can_do_io = 1;
>      cpu->exception_index = -1;
> -    memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
> +    cpu_tb_jmp_cache_clear(cpu);
>  }
>  
>  static bool cpu_common_has_work(CPUState *cs)
> diff --git a/translate-all.c b/translate-all.c
> index 76a0be8..668b43a 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -863,7 +863,7 @@ void tb_flush(CPUState *cpu)
>      tcg_ctx.tb_ctx.nb_tbs = 0;
>  
>      CPU_FOREACH(cpu) {
> -        memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
> +        cpu_tb_jmp_cache_clear(cpu);
>      }
>  
>      memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> @@ -988,6 +988,27 @@ static inline void tb_jmp_remove(TranslationBlock *tb, int n)
>      }
>  }
>  
> +static inline void tb_jmp_cache_entry_clear(CPUState *cpu, TranslationBlock *tb)
> +{
> +    unsigned int version;
> +    unsigned int h;
> +    bool hit = false;
> +
> +    h = tb_jmp_cache_hash_func(tb->pc);
> +    do {
> +        version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
> +        hit = cpu->tb_jmp_cache[h] == tb;
> +    } while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
> +
> +    if (hit) {
> +        seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
> +        if (likely(cpu->tb_jmp_cache[h] == tb)) {
> +            cpu->tb_jmp_cache[h] = NULL;
> +        }
> +        seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
> +    }
> +}
> +
>  /* invalidate one TB
>   *
>   * Called with tb_lock held.
> @@ -1024,6 +1045,13 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>          invalidate_page_bitmap(p);
>      }
>  
> +    tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
> +
> +    /* remove the TB from the hash list */
> +    CPU_FOREACH(cpu) {
> +        tb_jmp_cache_entry_clear(cpu, tb);
> +    }
> +
>      /* suppress this TB from the two jump lists */
>      tb_jmp_remove(tb, 0);
>      tb_jmp_remove(tb, 1);
> @@ -1707,12 +1735,14 @@ void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr)
>      /* Discard jump cache entries for any tb which might potentially
>         overlap the flushed page.  */
>      i = tb_jmp_cache_hash_page(addr - TARGET_PAGE_SIZE);
> +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
>      memset(&cpu->tb_jmp_cache[i], 0,
>             TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
>  
>      i = tb_jmp_cache_hash_page(addr);
>      memset(&cpu->tb_jmp_cache[i], 0,
>             TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
> +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
>  }
>  
>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
> 

I'm not sure how the last three patches compare with the existing "tcg:
move tb_find_fast outside the tb_lock critical section"?

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock
  2015-08-24  1:14   ` Paolo Bonzini
@ 2015-08-25 21:46     ` Emilio G. Cota
  2015-08-25 22:49       ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 21:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:14:58 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > This paves the way for a lockless tb_find_fast.
> > 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
(snip)
> > @@ -1707,12 +1735,14 @@ void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr)
> >      /* Discard jump cache entries for any tb which might potentially
> >         overlap the flushed page.  */
> >      i = tb_jmp_cache_hash_page(addr - TARGET_PAGE_SIZE);
> > +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
> >      memset(&cpu->tb_jmp_cache[i], 0,
> >             TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
> >  
> >      i = tb_jmp_cache_hash_page(addr);
> >      memset(&cpu->tb_jmp_cache[i], 0,
> >             TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
> > +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
> >  }
> >  
> >  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
> > 
> 
> I'm not sure how the last three patches compare with the existing "tcg:
> move tb_find_fast outside the tb_lock critical section"?

The seqlock for tb_jmp_cache is necessary the moment that the
array can be wiped out with a memset(), as shown above. That
function (tb_flush_jmp_cache) is called by tlb_flush_page,
which has many callers.

One could argue that we could enforce calling tlb_flush_page to be
a) always done by the owner thread or b) done while all others CPUs
are paused.

I argue that worrying about that is not worth it; let's protect
the array with a seqlock, which on TSO is essentially free, and
worry about more important things.

Wrt the next two patches:

Patch 27 is an improvement in that each TB has its own valid flag,
which makes sense because this should only affect TB's that are
trying to chain to/from it, not all TBs.

Patch 28 uses the RCU QLIST which to me seems cleaner and less
error-prone than open-coding an RCU LIST.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock
  2015-08-25 21:46     ` Emilio G. Cota
@ 2015-08-25 22:49       ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-25 22:49 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark burton, a rigo, qemu-devel, guillaume delbergue,
	alex bennee, Frederic Konrad

> The seqlock for tb_jmp_cache is necessary the moment that the
> array can be wiped out with a memset(), as shown above. That
> function (tb_flush_jmp_cache) is called by tlb_flush_page,
> which has many callers.
> 
> One could argue that we could enforce calling tlb_flush_page to be
> a) always done by the owner thread or b) done while all others CPUs
> are paused.
> 
> I argue that worrying about that is not worth it; let's protect
> the array with a seqlock, which on TSO is essentially free, and
> worry about more important things.

Got it, this makes sense.

Paolo

> Wrt the next two patches:
> 
> Patch 27 is an improvement in that each TB has its own valid flag,
> which makes sense because this should only affect TB's that are
> trying to chain to/from it, not all TBs.
> 
> Patch 28 uses the RCU QLIST which to me seems cleaner and less
> error-prone than open-coding an RCU LIST.
> 
> Thanks,
> 
> 		Emilio
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock
  2015-08-24  0:23 ` [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock Emilio G. Cota
  2015-08-24  1:14   ` Paolo Bonzini
@ 2015-09-04  8:50   ` Paolo Bonzini
  2015-09-04 10:04     ` Paolo Bonzini
  1 sibling, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-04  8:50 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: alex.bennee, Frederic Konrad, mark.burton, a.rigo,
	guillaume.delbergue



On 24/08/2015 02:23, Emilio G. Cota wrote:
> This paves the way for a lockless tb_find_fast.

Having now reviewed the patch, I think we can do better.

The idea is:

- only the CPU thread can set cpu->tb_jmp_cache[]

- other threads can, under seqlock protection, _clear_ cpu->tb_jmp_cache[]

- the seqlock can be protected by tb_lock.  Then you need not retry the
read, you can just fall back to the slow path, which will take the
tb_lock and thus serialize with the clearer.

Paolo

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cpu-exec.c        |  8 +++++++-
>  exec.c            |  2 ++
>  include/qom/cpu.h | 15 +++++++++++++++
>  qom/cpu.c         |  2 +-
>  translate-all.c   | 32 +++++++++++++++++++++++++++++++-
>  5 files changed, 56 insertions(+), 3 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index f758928..5ad578d 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -334,7 +334,9 @@ static TranslationBlock *tb_find_slow(CPUState *cpu,
>      }
>  
>      /* we add the TB in the virtual pc hash table */
> +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
>      cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb;
> +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
>      return tb;
>  }
>  
> @@ -343,13 +345,17 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu)
>      CPUArchState *env = (CPUArchState *)cpu->env_ptr;
>      TranslationBlock *tb;
>      target_ulong cs_base, pc;
> +    unsigned int version;
>      int flags;
>  
>      /* we record a subset of the CPU state. It will
>         always be the same before a given translated block
>         is executed. */
>      cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags);
> -    tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
> +    do {
> +        version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
> +        tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)];
> +    } while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
>      if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base ||
>                   tb->flags != flags)) {
>          tb = tb_find_slow(cpu, pc, cs_base, flags);
> diff --git a/exec.c b/exec.c
> index edf2236..ae6f416 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -577,6 +577,8 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
>      int cpu_index;
>      Error *local_err = NULL;
>  
> +    qemu_mutex_init(&cpu->tb_jmp_cache_lock);
> +    seqlock_init(&cpu->tb_jmp_cache_sequence, &cpu->tb_jmp_cache_lock);
>  #ifndef CONFIG_USER_ONLY
>      cpu->as = &address_space_memory;
>      cpu->thread_id = qemu_get_thread_id();
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index dbe0438..f383c24 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -27,6 +27,7 @@
>  #include "exec/hwaddr.h"
>  #include "exec/memattrs.h"
>  #include "qemu/queue.h"
> +#include "qemu/seqlock.h"
>  #include "qemu/thread.h"
>  #include "qemu/typedefs.h"
>  
> @@ -287,6 +288,13 @@ struct CPUState {
>  
>      void *env_ptr; /* CPUArchState */
>      struct TranslationBlock *current_tb;
> +    /*
> +     * The seqlock here is needed because not all updates are to a single
> +     * entry; sometimes we want to atomically clear all entries that belong to
> +     * a given page, e.g. when flushing said page.
> +     */
> +    QemuMutex tb_jmp_cache_lock;
> +    QemuSeqLock tb_jmp_cache_sequence;
>      struct TranslationBlock *tb_jmp_cache[TB_JMP_CACHE_SIZE];
>  
>      struct GDBRegisterState *gdb_regs;
> @@ -342,6 +350,13 @@ extern struct CPUTailQ cpus;
>  
>  extern __thread CPUState *current_cpu;
>  
> +static inline void cpu_tb_jmp_cache_clear(CPUState *cpu)
> +{
> +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
> +    memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
> +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
> +}
> +
>  /**
>   * cpu_paging_enabled:
>   * @cpu: The CPU whose state is to be inspected.
> diff --git a/qom/cpu.c b/qom/cpu.c
> index ac19710..5e72e7a 100644
> --- a/qom/cpu.c
> +++ b/qom/cpu.c
> @@ -251,7 +251,7 @@ static void cpu_common_reset(CPUState *cpu)
>      cpu->icount_decr.u32 = 0;
>      cpu->can_do_io = 1;
>      cpu->exception_index = -1;
> -    memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
> +    cpu_tb_jmp_cache_clear(cpu);
>  }
>  
>  static bool cpu_common_has_work(CPUState *cs)
> diff --git a/translate-all.c b/translate-all.c
> index 76a0be8..668b43a 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -863,7 +863,7 @@ void tb_flush(CPUState *cpu)
>      tcg_ctx.tb_ctx.nb_tbs = 0;
>  
>      CPU_FOREACH(cpu) {
> -        memset(cpu->tb_jmp_cache, 0, sizeof(cpu->tb_jmp_cache));
> +        cpu_tb_jmp_cache_clear(cpu);
>      }
>  
>      memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
> @@ -988,6 +988,27 @@ static inline void tb_jmp_remove(TranslationBlock *tb, int n)
>      }
>  }
>  
> +static inline void tb_jmp_cache_entry_clear(CPUState *cpu, TranslationBlock *tb)
> +{
> +    unsigned int version;
> +    unsigned int h;
> +    bool hit = false;
> +
> +    h = tb_jmp_cache_hash_func(tb->pc);
> +    do {
> +        version = seqlock_read_begin(&cpu->tb_jmp_cache_sequence);
> +        hit = cpu->tb_jmp_cache[h] == tb;
> +    } while (seqlock_read_retry(&cpu->tb_jmp_cache_sequence, version));
> +
> +    if (hit) {
> +        seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
> +        if (likely(cpu->tb_jmp_cache[h] == tb)) {
> +            cpu->tb_jmp_cache[h] = NULL;
> +        }
> +        seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
> +    }
> +}
> +
>  /* invalidate one TB
>   *
>   * Called with tb_lock held.
> @@ -1024,6 +1045,13 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
>          invalidate_page_bitmap(p);
>      }
>  
> +    tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
> +
> +    /* remove the TB from the hash list */
> +    CPU_FOREACH(cpu) {
> +        tb_jmp_cache_entry_clear(cpu, tb);
> +    }
> +
>      /* suppress this TB from the two jump lists */
>      tb_jmp_remove(tb, 0);
>      tb_jmp_remove(tb, 1);
> @@ -1707,12 +1735,14 @@ void tb_flush_jmp_cache(CPUState *cpu, target_ulong addr)
>      /* Discard jump cache entries for any tb which might potentially
>         overlap the flushed page.  */
>      i = tb_jmp_cache_hash_page(addr - TARGET_PAGE_SIZE);
> +    seqlock_write_lock(&cpu->tb_jmp_cache_sequence);
>      memset(&cpu->tb_jmp_cache[i], 0,
>             TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
>  
>      i = tb_jmp_cache_hash_page(addr);
>      memset(&cpu->tb_jmp_cache[i], 0,
>             TB_JMP_PAGE_SIZE * sizeof(TranslationBlock *));
> +    seqlock_write_unlock(&cpu->tb_jmp_cache_sequence);
>  }
>  
>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock
  2015-09-04  8:50   ` Paolo Bonzini
@ 2015-09-04 10:04     ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-09-04 10:04 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, mark.burton, alex.bennee, a.rigo,
	Frederic Konrad



On 04/09/2015 10:50, Paolo Bonzini wrote:
> Having now reviewed the patch, I think we can do better.
> 
> The idea is:
> 
> - only the CPU thread can set cpu->tb_jmp_cache[]
> 
> - other threads can, under seqlock protection, _clear_ cpu->tb_jmp_cache[]
> 
> - the seqlock can be protected by tb_lock.  Then you need not retry the
> read, you can just fall back to the slow path, which will take the
> tb_lock and thus serialize with the clearer.

... and then we're back to the idea of making tb_invalidated_flag per-TB. :)

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 27/38] cpu-exec: convert tb_invalidated_flag into a per-TB flag
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (25 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 28/38] cpu-exec: use RCU to perform lockless TB lookups Emilio G. Cota
                   ` (12 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This will allow us to safely look up TB's without taking any locks.
Note however that tb_lock protects the valid field, so if chaining
is an option then we'll have to acquire the lock.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 23 +++++++---------------
 include/exec/exec-all.h |  3 +--
 translate-all.c         | 51 +++++++++++++++++--------------------------------
 3 files changed, 25 insertions(+), 52 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 5ad578d..826ec25 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -239,9 +239,7 @@ static void cpu_exec_nocache(CPUState *cpu, int max_cycles,
     tb_lock();
     tb = tb_gen_code(cpu, orig_tb->pc, orig_tb->cs_base, orig_tb->flags,
                      max_cycles | CF_NOCACHE);
-    tb->orig_tb = (atomic_mb_read(&tcg_ctx.tb_ctx.tb_invalidated_flag)
-                   ? NULL
-                   : orig_tb);
+    tb->orig_tb = orig_tb->valid ? orig_tb : NULL;
     cpu->current_tb = tb;
     tb_unlock();
 
@@ -268,8 +266,6 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
     tb_page_addr_t phys_pc, phys_page1;
     target_ulong virt_page2;
 
-    atomic_mb_set(&tcg_ctx.tb_ctx.tb_invalidated_flag, 0);
-
     /* find translated block using physical mappings */
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
@@ -536,15 +532,6 @@ int cpu_exec(CPUState *cpu)
                     cpu_loop_exit(cpu);
                 }
                 tb = tb_find_fast(cpu);
-                /* Note: we do it here to avoid a gcc bug on Mac OS X when
-                   doing it in tb_find_slow */
-                if (atomic_mb_read(&tcg_ctx.tb_ctx.tb_invalidated_flag)) {
-                    /* as some TB could have been invalidated because
-                       of memory exceptions while generating the code, we
-                       must recompute the hash index here */
-                    next_tb = 0;
-                    atomic_mb_set(&tcg_ctx.tb_ctx.tb_invalidated_flag, 0);
-                }
                 if (qemu_loglevel_mask(CPU_LOG_EXEC)) {
                     qemu_log("Trace %p [" TARGET_FMT_lx "] %s\n",
                              tb->tc_ptr, tb->pc, lookup_symbol(tb->pc));
@@ -553,9 +540,13 @@ int cpu_exec(CPUState *cpu)
                    spans two pages, we cannot safely do a direct
                    jump. */
                 if (next_tb != 0 && tb->page_addr[1] == -1) {
+                    TranslationBlock *next;
+
                     tb_lock_recursive();
-                    tb_add_jump((TranslationBlock *)(next_tb & ~TB_EXIT_MASK),
-                                next_tb & TB_EXIT_MASK, tb);
+                    next = (TranslationBlock *)(next_tb & ~TB_EXIT_MASK);
+                    if (tb->valid && next->valid) {
+                        tb_add_jump(next, next_tb & TB_EXIT_MASK, tb);
+                    }
                 }
                 /* The lock may not be taken if we went through the
                  * fast lookup path and did not have to do any patching.
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 3b8399a..7e4aea7 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -178,6 +178,7 @@ struct TranslationBlock {
        jmp_first */
     struct TranslationBlock *jmp_next[2];
     struct TranslationBlock *jmp_first;
+    bool valid; /* protected by tb_lock */
 };
 
 #include "qemu/thread.h"
@@ -195,8 +196,6 @@ struct TBContext {
     /* statistics */
     int tb_flush_count;
     int tb_phys_invalidate_count;
-
-    int tb_invalidated_flag;
 };
 
 void tb_free(TranslationBlock *tb);
diff --git a/translate-all.c b/translate-all.c
index 668b43a..94adcd0 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -791,6 +791,17 @@ static inline void invalidate_page_bitmap(PageDesc *p)
 #endif
 }
 
+static void tb_invalidate_all(void)
+{
+    int i;
+
+    for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
+        TranslationBlock *tb = &tcg_ctx.tb_ctx.tbs[i];
+
+        tb->valid = false;
+    }
+}
+
 /* Set to NULL all the 'first_tb' fields in all PageDescs. */
 static void page_flush_tb_1(int level, void **lp)
 {
@@ -866,6 +877,7 @@ void tb_flush(CPUState *cpu)
         cpu_tb_jmp_cache_clear(cpu);
     }
 
+    tb_invalidate_all();
     memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
     page_flush_tb();
 
@@ -1021,11 +1033,6 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     tb_page_addr_t phys_pc;
     TranslationBlock *tb1, *tb2;
 
-    /* Set the invalidated_flag first, to block patching a
-     * jump to tb.  FIXME: invalidated_flag should be per TB.
-     */
-    atomic_mb_set(&tcg_ctx.tb_ctx.tb_invalidated_flag, 1);
-
     /* Now remove the TB from the hash list, so that tb_find_slow
      * cannot find it anymore.
      */
@@ -1045,8 +1052,6 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
         invalidate_page_bitmap(p);
     }
 
-    tcg_ctx.tb_ctx.tb_invalidated_flag = 1;
-
     /* remove the TB from the hash list */
     CPU_FOREACH(cpu) {
         tb_jmp_cache_entry_clear(cpu, tb);
@@ -1070,33 +1075,7 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     }
     tb->jmp_first = (TranslationBlock *)((uintptr_t)tb | 2); /* fail safe */
 
-#if 0
-    /* TODO: I think this barrier is not necessary.  On the
-     * cpu_exec side, it is okay if the read from tb_jmp_cache
-     * comes after the read from tb_phys_hash.  This is because
-     * the read would be bleeding into the tb_lock critical
-     * section, hence there cannot be any concurrent tb_invalidate.
-     * And if you don't need a barrier there, you shouldn't need
-     * one here, either.
-     */
-     smp_wmb();
-#endif
-
-    /* Finally, remove the TB from the per-CPU cache that is
-     * accessed without tb_lock.  The tb can still be executed
-     * once after returning, if the cache was accessed before
-     * this point, but that's it.
-     *
-     * The cache cannot be filled with this tb anymore, because
-     * the lists are accessed with tb_lock held.
-     */
-    h = tb_jmp_cache_hash_func(tb->pc);
-    CPU_FOREACH(cpu) {
-        if (cpu->tb_jmp_cache[h] == tb) {
-            cpu->tb_jmp_cache[h] = NULL;
-        }
-    }
-
+    tb->valid = false;
     tcg_ctx.tb_ctx.tb_phys_invalidate_count++;
 }
 
@@ -1157,12 +1136,16 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
         tb_flush_safe(cpu);
 #endif
         cpu_loop_exit(cpu);
+        tb_flush(cpu);
+        /* cannot fail at this point */
+        tb = tb_alloc(pc);
     }
 
     tb->tc_ptr = tcg_ctx.code_gen_ptr;
     tb->cs_base = cs_base;
     tb->flags = flags;
     tb->cflags = cflags;
+    tb->valid = true;
     cpu_gen_code(env, tb, &code_gen_size);
     tcg_ctx.code_gen_ptr = (void *)(((uintptr_t)tcg_ctx.code_gen_ptr +
             code_gen_size + CODE_GEN_ALIGN - 1) & ~(CODE_GEN_ALIGN - 1));
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 28/38] cpu-exec: use RCU to perform lockless TB lookups
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (26 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 27/38] cpu-exec: convert tb_invalidated_flag into a per-TB flag Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 29/38] tcg: export have_tb_lock Emilio G. Cota
                   ` (11 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Only grab tb_lock when new code has to be generated.

Note that due to the RCU usage we lose the ability to move
recently-found TB's to the beginning of the slot's list.
We could in theory try to do something smart about this,
but given that each CPU has a private tb_jmp_cache, it
might be OK to just leave it alone.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 21 +++++++++-----------
 include/exec/exec-all.h | 12 +++++++++---
 translate-all.c         | 52 ++++++++++++++++++++++++-------------------------
 3 files changed, 43 insertions(+), 42 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 826ec25..ff08da8 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -24,6 +24,7 @@
 #include "qemu/atomic.h"
 #include "qemu/timer.h"
 #include "exec/tb-hash.h"
+#include "qemu/rcu_queue.h"
 #include "qemu/rcu.h"
 
 #if !defined(CONFIG_USER_ONLY)
@@ -261,7 +262,8 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
                                           uint64_t flags)
 {
     CPUArchState *env = (CPUArchState *)cpu->env_ptr;
-    TranslationBlock *tb, **ptb1;
+    TBPhysHashSlot *slot;
+    TranslationBlock *tb;
     unsigned int h;
     tb_page_addr_t phys_pc, phys_page1;
     target_ulong virt_page2;
@@ -270,12 +272,9 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
     phys_pc = get_page_addr_code(env, pc);
     phys_page1 = phys_pc & TARGET_PAGE_MASK;
     h = tb_phys_hash_func(phys_pc);
-    ptb1 = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    for(;;) {
-        tb = atomic_rcu_read(ptb1);
-        if (!tb) {
-            return NULL;
-        }
+    slot = &tcg_ctx.tb_ctx.tb_phys_hash[h];
+
+    QLIST_FOREACH_RCU(tb, &slot->list, slot_node) {
         if (tb->pc == pc &&
             tb->page_addr[0] == phys_page1 &&
             tb->cs_base == cs_base &&
@@ -288,16 +287,14 @@ static TranslationBlock *tb_find_physical(CPUState *cpu,
                     TARGET_PAGE_SIZE;
                 phys_page2 = get_page_addr_code(env, virt_page2);
                 if (tb->page_addr[1] == phys_page2) {
-                    break;
+                    return tb;
                 }
             } else {
-                break;
+                return tb;
             }
         }
-        ptb1 = &tb->phys_hash_next;
     }
-
-    return tb;
+    return NULL;
 }
 
 static TranslationBlock *tb_find_slow(CPUState *cpu,
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 7e4aea7..050e820 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -155,8 +155,8 @@ struct TranslationBlock {
 #define CF_USE_ICOUNT  0x20000
 
     void *tc_ptr;    /* pointer to the translated code */
-    /* next matching tb for physical address. */
-    struct TranslationBlock *phys_hash_next;
+    /* list node in slot of physically-indexed hash of translation blocks */
+    QLIST_ENTRY(TranslationBlock) slot_node;
     /* original tb when cflags has CF_NOCACHE */
     struct TranslationBlock *orig_tb;
     /* first and second physical page containing code. The lower bit
@@ -183,12 +183,18 @@ struct TranslationBlock {
 
 #include "qemu/thread.h"
 
+typedef struct TBPhysHashSlot TBPhysHashSlot;
+
+struct TBPhysHashSlot {
+    QLIST_HEAD(, TranslationBlock) list;
+};
+
 typedef struct TBContext TBContext;
 
 struct TBContext {
 
     TranslationBlock *tbs;
-    TranslationBlock *tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
+    TBPhysHashSlot tb_phys_hash[CODE_GEN_PHYS_HASH_SIZE];
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/translate-all.c b/translate-all.c
index 94adcd0..df65c83 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -60,6 +60,7 @@
 #include "exec/cputlb.h"
 #include "exec/tb-hash.h"
 #include "translate-all.h"
+#include "qemu/rcu_queue.h"
 #include "qemu/bitmap.h"
 #include "qemu/timer.h"
 #include "qemu/aie.h"
@@ -721,6 +722,17 @@ static inline void code_gen_alloc(size_t tb_size)
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
 
+static void tb_ctx_init(void)
+{
+    int i;
+
+    for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
+        TBPhysHashSlot *slot = &tcg_ctx.tb_ctx.tb_phys_hash[i];
+
+        QLIST_INIT(&slot->list);
+    }
+}
+
 /* Must be called before using the QEMU cpus. 'tb_size' is the size
    (in bytes) allocated to the translation buffer. Zero means default
    size. */
@@ -731,6 +743,7 @@ void tcg_exec_init(unsigned long tb_size)
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
     tcg_register_jit(tcg_ctx.code_gen_buffer, tcg_ctx.code_gen_buffer_size);
     page_init();
+    tb_ctx_init();
     aie_init();
 #if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_USE_GUEST_BASE)
     /* There's no guest base to take into account, so go ahead and
@@ -878,7 +891,7 @@ void tb_flush(CPUState *cpu)
     }
 
     tb_invalidate_all();
-    memset(tcg_ctx.tb_ctx.tb_phys_hash, 0, sizeof(tcg_ctx.tb_ctx.tb_phys_hash));
+    tb_ctx_init();
     page_flush_tb();
 
     tcg_ctx.code_gen_ptr = tcg_ctx.code_gen_buffer;
@@ -898,7 +911,9 @@ static void tb_invalidate_check(target_ulong address)
 
     address &= TARGET_PAGE_MASK;
     for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tb_ctx.tb_phys_hash[i]; tb != NULL; tb = tb->phys_hash_next) {
+        TBPhysHashSlot *slot = &tcg_ctx.tb_ctx.tb_phys_hash[i];
+
+        QLIST_FOREACH_RCU(tb, &slot->list, slot_node) {
             if (!(address + TARGET_PAGE_SIZE <= tb->pc ||
                   address >= tb->pc + tb->size)) {
                 printf("ERROR invalidate: address=" TARGET_FMT_lx
@@ -919,8 +934,9 @@ static void tb_page_check(void)
     int i, flags1, flags2;
 
     for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
-        for (tb = tcg_ctx.tb_ctx.tb_phys_hash[i]; tb != NULL;
-                tb = tb->phys_hash_next) {
+        TBPhysHashSlot *slot = &tcg_ctx.tb_ctx.tb_phys_hash[i];
+
+        QLIST_FOREACH_RCU(tb, &slot->list, slot_node) {
             flags1 = page_get_flags(tb->pc);
             flags2 = page_get_flags(tb->pc + tb->size - 1);
             if ((flags1 & PAGE_WRITE) || (flags2 & PAGE_WRITE)) {
@@ -933,20 +949,6 @@ static void tb_page_check(void)
 
 #endif
 
-static inline void tb_hash_remove(TranslationBlock **ptb, TranslationBlock *tb)
-{
-    TranslationBlock *tb1;
-
-    for (;;) {
-        tb1 = *ptb;
-        if (tb1 == tb) {
-            *ptb = tb1->phys_hash_next;
-            break;
-        }
-        ptb = &tb1->phys_hash_next;
-    }
-}
-
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
@@ -1029,16 +1031,13 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
 {
     CPUState *cpu;
     PageDesc *p;
-    unsigned int h, n1;
-    tb_page_addr_t phys_pc;
+    unsigned int n1;
     TranslationBlock *tb1, *tb2;
 
     /* Now remove the TB from the hash list, so that tb_find_slow
      * cannot find it anymore.
      */
-    phys_pc = tb->page_addr[0] + (tb->pc & ~TARGET_PAGE_MASK);
-    h = tb_phys_hash_func(phys_pc);
-    tb_hash_remove(&tcg_ctx.tb_ctx.tb_phys_hash[h], tb);
+    QLIST_REMOVE_RCU(tb, slot_node);
 
     /* remove the TB from the page list */
     if (tb->page_addr[0] != page_addr) {
@@ -1485,13 +1484,12 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
                          tb_page_addr_t phys_page2)
 {
     unsigned int h;
-    TranslationBlock **ptb;
+    TBPhysHashSlot *slot;
 
     /* add in the physical hash table */
     h = tb_phys_hash_func(phys_pc);
-    ptb = &tcg_ctx.tb_ctx.tb_phys_hash[h];
-    tb->phys_hash_next = *ptb;
-    atomic_rcu_set(ptb, tb);
+    slot = &tcg_ctx.tb_ctx.tb_phys_hash[h];
+    QLIST_INSERT_HEAD_RCU(&slot->list, tb, slot_node);
 
     /* add in the page list */
     tb_alloc_page(tb, 0, phys_pc & TARGET_PAGE_MASK);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 29/38] tcg: export have_tb_lock
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (27 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 28/38] cpu-exec: use RCU to perform lockless TB lookups Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  0:23 ` [Qemu-devel] [RFC 30/38] translate-all: add tb_lock assertions Emilio G. Cota
                   ` (10 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 tcg/tcg.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tcg/tcg.h b/tcg/tcg.h
index 8d30d61..9a873ac 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -599,6 +599,7 @@ void tb_lock(void);
 void tb_unlock(void);
 bool tb_lock_recursive(void);
 void tb_lock_reset(void);
+extern __thread int have_tb_lock;
 
 /* Called with tb_lock held.  */
 static inline void *tcg_malloc(int size)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 30/38] translate-all: add tb_lock assertions
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (28 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 29/38] tcg: export have_tb_lock Emilio G. Cota
@ 2015-08-24  0:23 ` Emilio G. Cota
  2015-08-24  0:24 ` [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode Emilio G. Cota
                   ` (9 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:23 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/translate-all.c b/translate-all.c
index df65c83..e7b4a31 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -252,6 +252,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
     int64_t ti;
 #endif
 
+    assert(have_tb_lock);
+
 #ifdef CONFIG_PROFILER
     ti = profile_getclock();
 #endif
@@ -442,6 +444,10 @@ static PageDesc *page_find_alloc(tb_page_addr_t index, int alloc)
     void **lp;
     int i;
 
+#ifdef CONFIG_SOFTMMU
+    assert(have_tb_lock);
+#endif
+
     /* Level 1.  Always allocated.  */
     lp = l1_map + ((index >> V_L1_SHIFT) & (V_L1_SIZE - 1));
 
@@ -767,6 +773,8 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
 
+    assert(have_tb_lock);
+
     if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks ||
         (tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer) >=
          tcg_ctx.code_gen_buffer_max_size) {
@@ -781,6 +789,8 @@ static TranslationBlock *tb_alloc(target_ulong pc)
 /* Called with tb_lock held.  */
 void tb_free(TranslationBlock *tb)
 {
+    assert(have_tb_lock);
+
     /* In practice this is mostly used for single use temporary TB
        Ignore the hard cases and just back up if this TB happens to
        be the last one generated.  */
@@ -933,6 +943,8 @@ static void tb_page_check(void)
     TranslationBlock *tb;
     int i, flags1, flags2;
 
+    assert(have_tb_lock);
+
     for (i = 0; i < CODE_GEN_PHYS_HASH_SIZE; i++) {
         TBPhysHashSlot *slot = &tcg_ctx.tb_ctx.tb_phys_hash[i];
 
@@ -1034,6 +1046,8 @@ void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr)
     unsigned int n1;
     TranslationBlock *tb1, *tb2;
 
+    assert(have_tb_lock);
+
     /* Now remove the TB from the hash list, so that tb_find_slow
      * cannot find it anymore.
      */
@@ -1120,6 +1134,8 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     target_ulong virt_page2;
     int code_gen_size;
 
+    assert(have_tb_lock);
+
     phys_pc = get_page_addr_code(env, pc);
     if (use_icount) {
         cflags |= CF_USE_ICOUNT;
@@ -1428,6 +1444,10 @@ static inline void tb_alloc_page(TranslationBlock *tb,
     bool page_already_protected;
 #endif
 
+#ifdef CONFIG_SOFTMMU
+    assert(have_tb_lock);
+#endif
+
     tb->page_addr[n] = page_addr;
     p = page_find_alloc(page_addr >> TARGET_PAGE_BITS, 1);
     tb->page_next[n] = p->first_tb;
@@ -1486,6 +1506,10 @@ static void tb_link_page(TranslationBlock *tb, tb_page_addr_t phys_pc,
     unsigned int h;
     TBPhysHashSlot *slot;
 
+#ifdef CONFIG_SOFTMMU
+    assert(have_tb_lock);
+#endif
+
     /* add in the physical hash table */
     h = tb_phys_hash_func(phys_pc);
     slot = &tcg_ctx.tb_ctx.tb_phys_hash[h];
@@ -1527,6 +1551,8 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
     uintptr_t v;
     TranslationBlock *tb;
 
+    assert(have_tb_lock);
+
     if (tcg_ctx.tb_ctx.nb_tbs <= 0) {
         return NULL;
     }
@@ -1579,6 +1605,8 @@ void tb_check_watchpoint(CPUState *cpu)
 {
     TranslationBlock *tb;
 
+    assert(have_tb_lock);
+
     tb = tb_find_pc(cpu->mem_io_pc);
     if (tb) {
         /* We can use retranslation to find the PC.  */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (29 preceding siblings ...)
  2015-08-24  0:23 ` [Qemu-devel] [RFC 30/38] translate-all: add tb_lock assertions Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  1:07   ` Paolo Bonzini
  2015-08-24  0:24 ` [Qemu-devel] [RFC 32/38] cpu list: convert to RCU QLIST Emilio G. Cota
                   ` (8 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Note that user-only uses mmap_lock for this.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 31 +++++++++++++++++++++++--------
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index e7b4a31..8f8c402 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -1203,8 +1203,9 @@ void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
  * Called with mmap_lock held for user-mode emulation
  * If called from generated code, iothread mutex must not be held.
  */
-void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
-                                   int is_cpu_write_access)
+static void
+tb_invalidate_phys_page_range_locked(tb_page_addr_t start, tb_page_addr_t end,
+                                     int is_cpu_write_access)
 {
     TranslationBlock *tb, *tb_next, *saved_tb;
     CPUState *cpu = current_cpu;
@@ -1236,7 +1237,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
     /* we remove all the TBs in the range [start, end[ */
     /* XXX: see if in some cases it could be faster to invalidate all
        the code */
-    tb_lock();
     tb = p->first_tb;
     while (tb != NULL) {
         n = (uintptr_t)tb & 3;
@@ -1310,14 +1310,19 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
         cpu_resume_from_signal(cpu, NULL);
     }
 #endif
+}
+
+void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
+                                   int is_cpu_write_access)
+{
+    tb_lock();
+    tb_invalidate_phys_page_range_locked(start, end, is_cpu_write_access);
     tb_unlock();
 }
 
 #ifdef CONFIG_SOFTMMU
-/* len must be <= 8 and start must be a multiple of len.
- * Called via softmmu_template.h, with iothread mutex not held.
- */
-void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
+
+static void tb_invalidate_phys_page_fast_locked(tb_page_addr_t start, int len)
 {
     PageDesc *p;
 
@@ -1352,9 +1357,19 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
         }
     } else {
     do_invalidate:
-        tb_invalidate_phys_page_range(start, start + len, 1);
+        tb_invalidate_phys_page_range_locked(start, start + len, 1);
     }
 }
+
+/* len must be <= 8 and start must be a multiple of len.
+ * Called via softmmu_template.h, with iothread mutex not held.
+ */
+void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
+{
+    tb_lock();
+    tb_invalidate_phys_page_fast_locked(start, len);
+    tb_unlock();
+}
 #else
 /* Called with mmap_lock held.  */
 static void tb_invalidate_phys_page(tb_page_addr_t addr,
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode
  2015-08-24  0:24 ` [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode Emilio G. Cota
@ 2015-08-24  1:07   ` Paolo Bonzini
  2015-08-25 21:54     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:07 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:24, Emilio G. Cota wrote:
> Note that user-only uses mmap_lock for this.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Why is this needed?  The RCU-like page_find should work just fine.

Paolo

> ---
>  translate-all.c | 31 +++++++++++++++++++++++--------
>  1 file changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/translate-all.c b/translate-all.c
> index e7b4a31..8f8c402 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -1203,8 +1203,9 @@ void tb_invalidate_phys_range(tb_page_addr_t start, tb_page_addr_t end)
>   * Called with mmap_lock held for user-mode emulation
>   * If called from generated code, iothread mutex must not be held.
>   */
> -void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
> -                                   int is_cpu_write_access)
> +static void
> +tb_invalidate_phys_page_range_locked(tb_page_addr_t start, tb_page_addr_t end,
> +                                     int is_cpu_write_access)
>  {
>      TranslationBlock *tb, *tb_next, *saved_tb;
>      CPUState *cpu = current_cpu;
> @@ -1236,7 +1237,6 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>      /* we remove all the TBs in the range [start, end[ */
>      /* XXX: see if in some cases it could be faster to invalidate all
>         the code */
> -    tb_lock();
>      tb = p->first_tb;
>      while (tb != NULL) {
>          n = (uintptr_t)tb & 3;
> @@ -1310,14 +1310,19 @@ void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
>          cpu_resume_from_signal(cpu, NULL);
>      }
>  #endif
> +}
> +
> +void tb_invalidate_phys_page_range(tb_page_addr_t start, tb_page_addr_t end,
> +                                   int is_cpu_write_access)
> +{
> +    tb_lock();
> +    tb_invalidate_phys_page_range_locked(start, end, is_cpu_write_access);
>      tb_unlock();
>  }
>  
>  #ifdef CONFIG_SOFTMMU
> -/* len must be <= 8 and start must be a multiple of len.
> - * Called via softmmu_template.h, with iothread mutex not held.
> - */
> -void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
> +
> +static void tb_invalidate_phys_page_fast_locked(tb_page_addr_t start, int len)
>  {
>      PageDesc *p;
>  
> @@ -1352,9 +1357,19 @@ void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
>          }
>      } else {
>      do_invalidate:
> -        tb_invalidate_phys_page_range(start, start + len, 1);
> +        tb_invalidate_phys_page_range_locked(start, start + len, 1);
>      }
>  }
> +
> +/* len must be <= 8 and start must be a multiple of len.
> + * Called via softmmu_template.h, with iothread mutex not held.
> + */
> +void tb_invalidate_phys_page_fast(tb_page_addr_t start, int len)
> +{
> +    tb_lock();
> +    tb_invalidate_phys_page_fast_locked(start, len);
> +    tb_unlock();
> +}
>  #else
>  /* Called with mmap_lock held.  */
>  static void tb_invalidate_phys_page(tb_page_addr_t addr,
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode
  2015-08-24  1:07   ` Paolo Bonzini
@ 2015-08-25 21:54     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 21:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:07:04 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:24, Emilio G. Cota wrote:
> > Note that user-only uses mmap_lock for this.
> > 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> 
> Why is this needed?  The RCU-like page_find should work just fine.

Ouch, you're right, forgot about that. Patch 30, which adds the
tb_lock assertions, should change as well to only check for
have_tb_lock in page_find when alloc==1.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 32/38] cpu list: convert to RCU QLIST
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (30 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  0:24 ` [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep Emilio G. Cota
                   ` (7 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This avoids the chance of reading a corrupted list of CPUs in usermode.

Note: this breaks hw/ppc/spapr due to the removal of CPU_FOREACH_REVERSE.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 exec.c               | 16 ++++++++++++++--
 include/qom/cpu.h    | 15 +++++++--------
 linux-user/main.c    |  2 +-
 linux-user/syscall.c |  2 +-
 4 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/exec.c b/exec.c
index ae6f416..58cd096 100644
--- a/exec.c
+++ b/exec.c
@@ -87,7 +87,7 @@ static MemoryRegion io_mem_unassigned;
 
 #endif
 
-struct CPUTailQ cpus = QTAILQ_HEAD_INITIALIZER(cpus);
+struct CPUTailQ cpus = QLIST_HEAD_INITIALIZER(cpus);
 /* current CPU in the current thread. It is only valid inside
    cpu_exec() */
 __thread CPUState *current_cpu;
@@ -596,7 +596,19 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
 #endif
         return;
     }
-    QTAILQ_INSERT_TAIL(&cpus, cpu, node);
+    /* poor man's QLIST_INSERT_TAIL_RCU */
+    if (QLIST_EMPTY_RCU(&cpus)) {
+        QLIST_INSERT_HEAD_RCU(&cpus, cpu, node);
+    } else {
+        CPUState *some_cpu;
+
+        CPU_FOREACH(some_cpu) {
+            if (QLIST_NEXT_RCU(some_cpu, node) == NULL) {
+                QLIST_INSERT_AFTER_RCU(some_cpu, cpu, node);
+                break;
+            }
+        }
+    }
 #if defined(CONFIG_USER_ONLY)
     cpu_list_unlock();
 #endif
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index f383c24..ab484be 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -30,6 +30,7 @@
 #include "qemu/seqlock.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
+#include "qemu/rcu_queue.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
                                      void *opaque);
@@ -300,7 +301,7 @@ struct CPUState {
     struct GDBRegisterState *gdb_regs;
     int gdb_num_regs;
     int gdb_num_g_regs;
-    QTAILQ_ENTRY(CPUState) node;
+    QLIST_ENTRY(CPUState) node;
 
     /* ice debug support */
     QTAILQ_HEAD(breakpoints_head, CPUBreakpoint) breakpoints;
@@ -338,15 +339,13 @@ struct CPUState {
     volatile sig_atomic_t tcg_exit_req;
 };
 
-QTAILQ_HEAD(CPUTailQ, CPUState);
+QLIST_HEAD(CPUTailQ, CPUState);
 extern struct CPUTailQ cpus;
-#define CPU_NEXT(cpu) QTAILQ_NEXT(cpu, node)
-#define CPU_FOREACH(cpu) QTAILQ_FOREACH(cpu, &cpus, node)
+#define CPU_NEXT(cpu) QLIST_NEXT_RCU(cpu, node)
+#define CPU_FOREACH(cpu) QLIST_FOREACH_RCU(cpu, &cpus, node)
 #define CPU_FOREACH_SAFE(cpu, next_cpu) \
-    QTAILQ_FOREACH_SAFE(cpu, &cpus, node, next_cpu)
-#define CPU_FOREACH_REVERSE(cpu) \
-    QTAILQ_FOREACH_REVERSE(cpu, &cpus, CPUTailQ, node)
-#define first_cpu QTAILQ_FIRST(&cpus)
+    QLIST_FOREACH_SAFE_RCU(cpu, &cpus, node, next_cpu)
+#define first_cpu QLIST_FIRST_RCU(&cpus)
 
 extern __thread CPUState *current_cpu;
 
diff --git a/linux-user/main.c b/linux-user/main.c
index 98ebe19..3e10bd8 100644
--- a/linux-user/main.c
+++ b/linux-user/main.c
@@ -121,7 +121,7 @@ void fork_end(int child)
            Discard information about the parent threads.  */
         CPU_FOREACH_SAFE(cpu, next_cpu) {
             if (cpu != thread_cpu) {
-                QTAILQ_REMOVE(&cpus, thread_cpu, node);
+                QLIST_REMOVE_RCU(thread_cpu, node);
             }
         }
         pending_cpus = 0;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 732936f..e166313 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -5604,7 +5604,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1,
 
             cpu_list_lock();
             /* Remove the CPU from the list.  */
-            QTAILQ_REMOVE(&cpus, cpu, node);
+            QLIST_REMOVE(cpu, node);
             cpu_list_unlock();
             ts = cpu->opaque;
             if (ts->child_tidptr) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (31 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 32/38] cpu list: convert to RCU QLIST Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  1:24   ` Paolo Bonzini
  2015-08-24  0:24 ` [Qemu-devel] [RFC 34/38] translate-all: use tcg_sched_work for tb_flush Emilio G. Cota
                   ` (6 subsequent siblings)
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This is similar in intent to the async_safe_work mechanism. The main
differences are:

- Work is run on a single CPU thread *after* all others are put to sleep

- Sleeping threads are woken up by the worker thread upon completing its job

- A flag as been added to tcg_ctx so that only one thread can schedule
  work at a time. The flag is checked every time tb_lock is acquired.

- Handles the possibility of CPU threads being created after the existing
  CPUs are put to sleep. This is easily triggered with many threads on
  a many-core host in usermode.

- Works for both softmmu and usermode

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c              | 89 +++++++++++++++++++++++++++++++++++++++++++++++++
 exec.c                  |  4 +++
 include/exec/exec-all.h |  5 +++
 include/qom/cpu.h       | 20 +++++++++++
 tcg/tcg.h               |  1 +
 translate-all.c         | 23 ++++++++++++-
 6 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index ff08da8..378ce52 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -393,6 +393,57 @@ static inline void cpu_exit_loop_lock_reset(CPUState *cpu)
 { }
 #endif
 
+static inline void cpu_sleep_other(CPUState *cpu, CPUState *curr)
+{
+    assert(cpu->tcg_sleep_owner == NULL);
+    qemu_mutex_lock(cpu->tcg_work_lock);
+    cpu->tcg_sleep_requests++;
+    cpu->tcg_sleep_owner = curr;
+    qemu_mutex_unlock(cpu->tcg_work_lock);
+#ifdef CONFIG_SOFTMMU
+    cpu_exit(cpu);
+#else
+    /* cannot call cpu_exit(); cpu->exit_request is not for usermode */
+    smp_wmb();
+    cpu->tcg_exit_req = 1;
+#endif
+}
+
+/* call with no locks held */
+static inline void cpu_sleep_others(CPUState *curr)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        if (cpu == curr) {
+            continue;
+        }
+        cpu_sleep_other(cpu, curr);
+    }
+    /* wait until all other threads are out of the execution loop */
+    synchronize_rcu();
+}
+
+static inline void cpu_wake_others(CPUState *curr)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        if (cpu == curr) {
+            continue;
+        }
+        if (cpu->tcg_sleep_owner != curr) {
+            assert(!cpu->inited);
+            continue;
+        }
+        qemu_mutex_lock(cpu->tcg_work_lock);
+        cpu->tcg_sleep_requests--;
+        cpu->tcg_sleep_owner = NULL;
+        qemu_cond_signal(cpu->tcg_work_cond);
+        qemu_mutex_unlock(cpu->tcg_work_lock);
+    }
+}
+
 /* main execution loop */
 
 int cpu_exec(CPUState *cpu)
@@ -410,6 +461,44 @@ int cpu_exec(CPUState *cpu)
 
     current_cpu = cpu;
 
+    /*
+     * Prevent threads that were created during a TCG work critical section
+     * (and that therefore didn't have cpu->tcg_work_owner set) from executing.
+     * What we do is then to not let them run by sending them out of the CPU
+     * loop until the tcg_work_pending flag goes down.
+     */
+    if (unlikely(!cpu->inited)) {
+        tb_lock();
+        tb_unlock();
+        cpu->inited = true;
+    }
+
+    if (cpu->tcg_work_func) {
+        cpu_sleep_others(cpu);
+        /*
+         * At this point all existing threads are sleeping.
+         * With the check above we make sure that threads that might be
+         * concurrently added at this point won't execute until the end of the
+         * work window, so we can safely call the work function.
+         */
+        cpu->tcg_work_func(cpu->tcg_work_arg);
+        cpu->tcg_work_func = NULL;
+        cpu->tcg_work_arg = NULL;
+
+        /* mark the end of the TCG work critical section */
+        tb_lock_nocheck();
+        tcg_ctx.tb_ctx.work_pending = false;
+        tb_unlock();
+        cpu_wake_others(cpu);
+    }
+
+    qemu_mutex_lock(cpu->tcg_work_lock);
+    assert(cpu->tcg_sleep_requests >= 0);
+    while (unlikely(cpu->tcg_sleep_requests)) {
+        qemu_cond_wait(cpu->tcg_work_cond, cpu->tcg_work_lock);
+    }
+    qemu_mutex_unlock(cpu->tcg_work_lock);
+
 #ifndef CONFIG_USER_ONLY
     /* FIXME: user-mode emulation probably needs a similar mechanism as well,
      * for example for tb_flush.
diff --git a/exec.c b/exec.c
index 58cd096..45a9761 100644
--- a/exec.c
+++ b/exec.c
@@ -579,6 +579,10 @@ void cpu_exec_init(CPUState *cpu, Error **errp)
 
     qemu_mutex_init(&cpu->tb_jmp_cache_lock);
     seqlock_init(&cpu->tb_jmp_cache_sequence, &cpu->tb_jmp_cache_lock);
+    cpu->tcg_work_cond = g_malloc(sizeof(*cpu->tcg_work_cond));
+    qemu_cond_init(cpu->tcg_work_cond);
+    cpu->tcg_work_lock = g_malloc(sizeof(*cpu->tcg_work_lock));
+    qemu_mutex_init(cpu->tcg_work_lock);
 #ifndef CONFIG_USER_ONLY
     cpu->as = &address_space_memory;
     cpu->thread_id = qemu_get_thread_id();
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 050e820..be8315c 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -198,6 +198,11 @@ struct TBContext {
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
+    /*
+     * This ensures that only one thread can perform safe work at a time.
+     * Protected by tb_lock; check the flag right after acquiring the lock.
+     */
+    bool work_pending;
 
     /* statistics */
     int tb_flush_count;
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index ab484be..aba7edb 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -273,6 +273,15 @@ struct CPUState {
     bool stop;
     bool stopped;
     bool cpu_loop_exit_locked;
+    bool inited;
+    /* tcg_work_* protected by tcg_work_lock */
+    QemuCond *tcg_work_cond;
+    QemuMutex *tcg_work_lock;
+    void (*tcg_work_func)(void *arg);
+    void *tcg_work_arg;
+    CPUState *tcg_sleep_owner;
+    int tcg_sleep_requests;
+
     volatile sig_atomic_t exit_request;
     uint32_t interrupt_request;
     int singlestep_enabled;
@@ -582,6 +591,17 @@ void async_run_safe_work_on_cpu(CPUState *cpu, void (*func)(void *data),
 bool async_safe_work_pending(void);
 
 /**
+ * cpu_tcg_sched_work:
+ * @cpu: CPU thread to schedule the work on
+ * @func: function to be called when all other CPU threads are asleep
+ * @arg: argument to be passed to @func
+ *
+ * Schedule work to be done while all other CPU threads are put to sleep.
+ * Call with tb_lock held.
+ */
+void cpu_tcg_sched_work(CPUState *cpu, void (*func)(void *arg), void *arg);
+
+/**
  * qemu_get_cpu:
  * @index: The CPUState@cpu_index value of the CPU to obtain.
  *
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 9a873ac..1229f7e 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -596,6 +596,7 @@ void tcg_pool_reset(TCGContext *s);
 void tcg_pool_delete(TCGContext *s);
 
 void tb_lock(void);
+void tb_lock_nocheck(void);
 void tb_unlock(void);
 bool tb_lock_recursive(void);
 void tb_lock_reset(void);
diff --git a/translate-all.c b/translate-all.c
index 8f8c402..f3f7fb2 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -133,13 +133,24 @@ TCGContext tcg_ctx;
 /* translation block context */
 __thread int have_tb_lock;
 
-void tb_lock(void)
+/* acquire tb_lock without checking for pending work */
+void tb_lock_nocheck(void)
 {
     assert(!have_tb_lock);
     qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock);
     have_tb_lock++;
 }
 
+void tb_lock(void)
+{
+    tb_lock_nocheck();
+    if (unlikely(tcg_ctx.tb_ctx.work_pending)) {
+        assert(current_cpu);
+        current_cpu->exception_index = EXCP_INTERRUPT;
+        cpu_loop_exit(current_cpu);
+    }
+}
+
 void tb_unlock(void)
 {
     assert(have_tb_lock);
@@ -961,6 +972,16 @@ static void tb_page_check(void)
 
 #endif
 
+void cpu_tcg_sched_work(CPUState *cpu, void (*func)(void *arg), void *arg)
+{
+    assert(have_tb_lock);
+    tcg_ctx.tb_ctx.work_pending = true;
+    cpu->tcg_work_func = func;
+    cpu->tcg_work_arg = arg;
+    cpu->exception_index = EXCP_INTERRUPT;
+    cpu_loop_exit(cpu);
+}
+
 static inline void tb_page_remove(TranslationBlock **ptb, TranslationBlock *tb)
 {
     TranslationBlock *tb1;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep
  2015-08-24  0:24 ` [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep Emilio G. Cota
@ 2015-08-24  1:24   ` Paolo Bonzini
  2015-08-25 22:18     ` Emilio G. Cota
  0 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:24 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:24, Emilio G. Cota wrote:
> This is similar in intent to the async_safe_work mechanism. The main
> differences are:
> 
> - Work is run on a single CPU thread *after* all others are put to sleep
> 
> - Sleeping threads are woken up by the worker thread upon completing its job
> 
> - A flag as been added to tcg_ctx so that only one thread can schedule
>   work at a time. The flag is checked every time tb_lock is acquired.
> 
> - Handles the possibility of CPU threads being created after the existing
>   CPUs are put to sleep. This is easily triggered with many threads on
>   a many-core host in usermode.
> 
> - Works for both softmmu and usermode
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

I think this is a duplicate of the existing run_on_cpu code.  If needed
in user-mode emulation, it should be extracted out of cpus.c.

Also I think it is dangerous (prone to deadlocks) to wait for other CPUs
with synchronize_cpu and condvar.  I would much rather prefer to _halt_
the CPUs if there is pending work, and keep it halted like this:

 static inline bool cpu_has_work(CPUState *cpu)
 {
     CPUClass *cc = CPU_GET_CLASS(cpu);

+    if (tcg_ctx.tb_ctx.tcg_has_work) {
+        return false;
+    }
     g_assert(cc->has_work);
     return cc->has_work(cpu);
 }

You can then run flush_queued_work from linux-user/main.c (and
bsd-user/main.c) when cpu_exec returns EXCP_HALTED.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep
  2015-08-24  1:24   ` Paolo Bonzini
@ 2015-08-25 22:18     ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 22:18 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:24:55 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:24, Emilio G. Cota wrote:
> > This is similar in intent to the async_safe_work mechanism. The main
> > differences are:
> > 
> > - Work is run on a single CPU thread *after* all others are put to sleep
> > 
> > - Sleeping threads are woken up by the worker thread upon completing its job
> > 
> > - A flag as been added to tcg_ctx so that only one thread can schedule
> >   work at a time. The flag is checked every time tb_lock is acquired.
> > 
> > - Handles the possibility of CPU threads being created after the existing
> >   CPUs are put to sleep. This is easily triggered with many threads on
> >   a many-core host in usermode.
> > 
> > - Works for both softmmu and usermode
> > 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> 
> I think this is a duplicate of the existing run_on_cpu code.  If needed
> in user-mode emulation, it should be extracted out of cpus.c.

They're similar, yes.

> Also I think it is dangerous (prone to deadlocks) to wait for other CPUs
> with synchronize_cpu and condvar.

The key to avoid deadlocks is not to hold any locks that might be
acquired within an RCU read critical section when calling
synchronize_rcu(). The condvars are for the sleeping threads so
that they can be woken up; sleepers don't call synchronize_rcu().

>  I would much rather prefer to _halt_
> the CPUs if there is pending work, and keep it halted like this:
> 
>  static inline bool cpu_has_work(CPUState *cpu)
>  {
>      CPUClass *cc = CPU_GET_CLASS(cpu);
> 
> +    if (tcg_ctx.tb_ctx.tcg_has_work) {
> +        return false;
> +    }
>      g_assert(cc->has_work);
>      return cc->has_work(cpu);
>  }
> 
> You can then run flush_queued_work from linux-user/main.c (and
> bsd-user/main.c) when cpu_exec returns EXCP_HALTED.

OK. Will try something like this.

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 34/38] translate-all: use tcg_sched_work for tb_flush
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (32 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  0:24 ` [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all Emilio G. Cota
                   ` (5 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

While at it, add an assertion in tb_flush to check for tb_lock
being held.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 translate-all.c | 40 +++++++++++-----------------------------
 1 file changed, 11 insertions(+), 29 deletions(-)

diff --git a/translate-all.c b/translate-all.c
index f3f7fb2..378517d 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -869,30 +869,24 @@ static void page_flush_tb(void)
     }
 }
 
-#ifdef CONFIG_USER_ONLY
-void tb_flush_safe(CPUState *cpu)
+static void tb_flush_work(void *arg)
 {
-    tb_flush(cpu);
-}
-#else
-static void tb_flush_work(void *opaque)
-{
-    CPUState *cpu = opaque;
-    tb_flush(cpu);
-}
+    CPUState *cpu = arg;
 
-void tb_flush_safe(CPUState *cpu)
-{
+    /*
+     * Really no need to acquire tb_lock since all other threads are
+     * asleep; let's just acquire it to pass the assertion in
+     * tb_flush and for the wmb when unlocking.
+     */
+    tb_lock_nocheck();
     tb_flush(cpu);
-    async_run_safe_work_on_cpu(cpu, tb_flush_work, cpu);
+    tb_unlock();
 }
-#endif
 
 /* flush all the translation blocks */
-/* XXX: tb_flush is currently not thread safe */
 void tb_flush(CPUState *cpu)
 {
-    tb_lock();
+    assert(have_tb_lock);
 
 #if defined(DEBUG_FLUSH)
     printf("qemu: flush code_size=%ld nb_tbs=%d avg_tb_size=%ld\n",
@@ -919,8 +913,6 @@ void tb_flush(CPUState *cpu)
     /* XXX: flush processor icache at this point if cache flush is
        expensive */
     tcg_ctx.tb_ctx.tb_flush_count++;
-
-    tb_unlock();
 }
 
 #ifdef DEBUG_TB_CHECK
@@ -1164,17 +1156,7 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     tb = tb_alloc(pc);
     if (!tb) {
         /* flush must be done */
-#ifdef CONFIG_USER_ONLY
-        /* FIXME: kick all other CPUs out also for user-mode emulation.  */
-        tb_flush(cpu);
-        mmap_unlock();
-#else
-        tb_flush_safe(cpu);
-#endif
-        cpu_loop_exit(cpu);
-        tb_flush(cpu);
-        /* cannot fail at this point */
-        tb = tb_alloc(pc);
+        cpu_tcg_sched_work(cpu, tb_flush_work, cpu);
     }
 
     tb->tc_ptr = tcg_ctx.code_gen_ptr;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (33 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 34/38] translate-all: use tcg_sched_work for tb_flush Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  1:29   ` Paolo Bonzini
  2015-09-01 16:10   ` Alex Bennée
  2015-08-24  0:24 ` [Qemu-devel] [RFC 36/38] cputlb: use tcg_sched_work for tlb_flush_page_all Emilio G. Cota
                   ` (4 subsequent siblings)
  39 siblings, 2 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cputlb.c | 41 +++++++++++------------------------------
 1 file changed, 11 insertions(+), 30 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 1b3673e..d81a4eb 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -73,43 +73,24 @@ void tlb_flush(CPUState *cpu, int flush_global)
     tlb_flush_count++;
 }
 
-struct TLBFlushParams {
-    CPUState *cpu;
-    int flush_global;
-};
-
-static void tlb_flush_async_work(void *opaque)
+static void __tlb_flush_all(void *arg)
 {
-    struct TLBFlushParams *params = opaque;
+    CPUState *cpu;
+    int flush_global = *(int *)arg;
 
-    tlb_flush(params->cpu, params->flush_global);
-    g_free(params);
+    CPU_FOREACH(cpu) {
+        tlb_flush(cpu, flush_global);
+    }
+    g_free(arg);
 }
 
 void tlb_flush_all(int flush_global)
 {
-    CPUState *cpu;
-    struct TLBFlushParams *params;
+    int *arg = g_malloc(sizeof(*arg));
 
-#if 0 /* MTTCG */
-    CPU_FOREACH(cpu) {
-        tlb_flush(cpu, flush_global);
-    }
-#else
-    CPU_FOREACH(cpu) {
-        if (qemu_cpu_is_self(cpu)) {
-            /* async_run_on_cpu handle this case but this just avoid a malloc
-             * here.
-             */
-            tlb_flush(cpu, flush_global);
-        } else {
-            params = g_malloc(sizeof(struct TLBFlushParams));
-            params->cpu = cpu;
-            params->flush_global = flush_global;
-            async_run_on_cpu(cpu, tlb_flush_async_work, params);
-        }
-    }
-#endif /* MTTCG */
+    *arg = flush_global;
+    tb_lock();
+    cpu_tcg_sched_work(current_cpu, __tlb_flush_all, arg);
 }
 
 static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-08-24  0:24 ` [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all Emilio G. Cota
@ 2015-08-24  1:29   ` Paolo Bonzini
  2015-08-25 22:31     ` Emilio G. Cota
  2015-09-01 16:10   ` Alex Bennée
  1 sibling, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:29 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:24, Emilio G. Cota wrote:
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cputlb.c | 41 +++++++++++------------------------------
>  1 file changed, 11 insertions(+), 30 deletions(-)

As suggested by me and Peter, synchronization on TLB flushes should be
arch-specific.  CPUs can halt on a dmb if they have pending TLB flush
requests on other CPUs, and the CPU can be woken up from the run_on_cpu
callback with something like:

    if (--caller_cpu->pending_tlb_flush_request) {
        caller_cpu->interrupt_request |= CPU_INTERRUPT_TLB_DONE;
        qemu_cpu_kick(caller_cpu);
    }


...


static bool arm_cpu_has_work(CPUState *cs)
{
    ARMCPU *cpu = ARM_CPU(cs);

    return !cpu->pending_tlb_flush_request && !cpu->powered_off
        && cs->interrupt_request &
        (CPU_INTERRUPT_FIQ | CPU_INTERRUPT_HARD
         | CPU_INTERRUPT_VFIQ | CPU_INTERRUPT_VIRQ
         | CPU_INTERRUPT_EXITTB | CPU_INTERRUPT_TLB_DONE);
}

Paolo

> diff --git a/cputlb.c b/cputlb.c
> index 1b3673e..d81a4eb 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -73,43 +73,24 @@ void tlb_flush(CPUState *cpu, int flush_global)
>      tlb_flush_count++;
>  }
>  
> -struct TLBFlushParams {
> -    CPUState *cpu;
> -    int flush_global;
> -};
> -
> -static void tlb_flush_async_work(void *opaque)
> +static void __tlb_flush_all(void *arg)
>  {
> -    struct TLBFlushParams *params = opaque;
> +    CPUState *cpu;
> +    int flush_global = *(int *)arg;
>  
> -    tlb_flush(params->cpu, params->flush_global);
> -    g_free(params);
> +    CPU_FOREACH(cpu) {
> +        tlb_flush(cpu, flush_global);
> +    }
> +    g_free(arg);
>  }
>  
>  void tlb_flush_all(int flush_global)
>  {
> -    CPUState *cpu;
> -    struct TLBFlushParams *params;
> +    int *arg = g_malloc(sizeof(*arg));
>  
> -#if 0 /* MTTCG */
> -    CPU_FOREACH(cpu) {
> -        tlb_flush(cpu, flush_global);
> -    }
> -#else
> -    CPU_FOREACH(cpu) {
> -        if (qemu_cpu_is_self(cpu)) {
> -            /* async_run_on_cpu handle this case but this just avoid a malloc
> -             * here.
> -             */
> -            tlb_flush(cpu, flush_global);
> -        } else {
> -            params = g_malloc(sizeof(struct TLBFlushParams));
> -            params->cpu = cpu;
> -            params->flush_global = flush_global;
> -            async_run_on_cpu(cpu, tlb_flush_async_work, params);
> -        }
> -    }
> -#endif /* MTTCG */
> +    *arg = flush_global;
> +    tb_lock();
> +    cpu_tcg_sched_work(current_cpu, __tlb_flush_all, arg);
>  }
>  
>  static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-08-24  1:29   ` Paolo Bonzini
@ 2015-08-25 22:31     ` Emilio G. Cota
  2015-08-26  0:25       ` Paolo Bonzini
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 22:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 18:29:33 -0700, Paolo Bonzini wrote:
> 
> 
> On 23/08/2015 17:24, Emilio G. Cota wrote:
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  cputlb.c | 41 +++++++++++------------------------------
> >  1 file changed, 11 insertions(+), 30 deletions(-)
> 
> As suggested by me and Peter, synchronization on TLB flushes should be
> arch-specific.  CPUs can halt on a dmb if they have pending TLB flush
> requests on other CPUs,

I'm not sure I understand. With the patches I sent, a CPU that wants
to flush other TLBs does not continue execution until all of those TLBs
are flushed. So dsb/dmb whatever comes next would have nothing to
wait for. What am I missing?

> and the CPU can be woken up from the run_on_cpu
> callback with something like:
> 
>     if (--caller_cpu->pending_tlb_flush_request) {
>         caller_cpu->interrupt_request |= CPU_INTERRUPT_TLB_DONE;
>         qemu_cpu_kick(caller_cpu);
>     }
> 
> 
> ...
> 
> 
> static bool arm_cpu_has_work(CPUState *cs)
> {
>     ARMCPU *cpu = ARM_CPU(cs);
> 
>     return !cpu->pending_tlb_flush_request && !cpu->powered_off
>         && cs->interrupt_request &
>         (CPU_INTERRUPT_FIQ | CPU_INTERRUPT_HARD
>          | CPU_INTERRUPT_VFIQ | CPU_INTERRUPT_VIRQ
>          | CPU_INTERRUPT_EXITTB | CPU_INTERRUPT_TLB_DONE);
> }

Another option, which I tried but my TCG skills fail me, is to
protect each TLB with a seqlock.

The advantage of this is that TLB flushes would always complete
immediately, so there's no need to halt execution.

The disadvantage is the performance hit, but at least on TSO this
seems to me worth a shot.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-08-25 22:31     ` Emilio G. Cota
@ 2015-08-26  0:25       ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-26  0:25 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark burton, a rigo, qemu-devel, guillaume delbergue,
	alex bennee, Frederic Konrad



----- Original Message -----
> From: "Emilio G. Cota" <cota@braap.org>
> To: "Paolo Bonzini" <pbonzini@redhat.com>
> Cc: qemu-devel@nongnu.org, mttcg@greensocs.com, "guillaume delbergue" <guillaume.delbergue@greensocs.com>, "alex
> bennee" <alex.bennee@linaro.org>, "mark burton" <mark.burton@greensocs.com>, "a rigo"
> <a.rigo@virtualopensystems.com>, "Frederic Konrad" <fred.konrad@greensocs.com>
> Sent: Wednesday, August 26, 2015 12:31:22 AM
> Subject: Re: [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
> 
> On Sun, Aug 23, 2015 at 18:29:33 -0700, Paolo Bonzini wrote:
> > 
> > 
> > On 23/08/2015 17:24, Emilio G. Cota wrote:
> > > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > > ---
> > >  cputlb.c | 41 +++++++++++------------------------------
> > >  1 file changed, 11 insertions(+), 30 deletions(-)
> > 
> > As suggested by me and Peter, synchronization on TLB flushes should be
> > arch-specific.  CPUs can halt on a dmb if they have pending TLB flush
> > requests on other CPUs,
> 
> I'm not sure I understand. With the patches I sent, a CPU that wants
> to flush other TLBs does not continue execution until all of those TLBs
> are flushed. So dsb/dmb whatever comes next would have nothing to
> wait for. What am I missing?

Probably nothing.  Still, I didn't have enough time to study your
cpu_tcg_sched_work patches well, and I'm terribly worried of deadlocks
here. :)  Ensuring that the CPU loop keeps running, and can always be
woken up via halt_cond, is the simplest way to avoid deadlocks.

> Another option, which I tried but my TCG skills fail me, is to
> protect each TLB with a seqlock.
> 
> The advantage of this is that TLB flushes would always complete
> immediately, so there's no need to halt execution.
> 
> The disadvantage is the performance hit, but at least on TSO this
> seems to me worth a shot.

The other disadvantage is that you'd have to modify all TCG backends. :(

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-08-24  0:24 ` [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all Emilio G. Cota
  2015-08-24  1:29   ` Paolo Bonzini
@ 2015-09-01 16:10   ` Alex Bennée
  2015-09-01 19:38     ` Emilio G. Cota
  1 sibling, 1 reply; 110+ messages in thread
From: Alex Bennée @ 2015-09-01 16:10 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad


Emilio G. Cota <cota@braap.org> writes:

> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  cputlb.c | 41 +++++++++++------------------------------
>  1 file changed, 11 insertions(+), 30 deletions(-)

I bisected my Jessie boot failure to this commit. Before it boots up
fine, here it just hangs before the kernel starts init.

17:05 alex@zen/x86_64  [qemu.git/bisect:???]
>./arm-softmmu/qemu-system-arm -machine virt -cpu cortex-a15 -machine
type=virt -display none -serial telnet:127.0.0.1:4444 -monitor stdio
-smp 4 -m 4096 -kernel ../images/aarch32-current-linux-kernel-only.img
 --append "console=ttyAMA0 root=/dev/vda1" -drive
 file=../images/jessie-arm32.qcow2,id=myblock,index=0,if=none -device
 virtio-b
 lk-device,drive=myblock -netdev user,id=unet,hostfwd=tcp::2222-:22
 -device virtio-net-device,netdev=unet -D /tmp/qemu.log -d un
 imp -name debug-threads=on

See people.linaro.org/~alex.bennee/images


>
> diff --git a/cputlb.c b/cputlb.c
> index 1b3673e..d81a4eb 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -73,43 +73,24 @@ void tlb_flush(CPUState *cpu, int flush_global)
>      tlb_flush_count++;
>  }
>  
> -struct TLBFlushParams {
> -    CPUState *cpu;
> -    int flush_global;
> -};
> -
> -static void tlb_flush_async_work(void *opaque)
> +static void __tlb_flush_all(void *arg)
>  {
> -    struct TLBFlushParams *params = opaque;
> +    CPUState *cpu;
> +    int flush_global = *(int *)arg;
>  
> -    tlb_flush(params->cpu, params->flush_global);
> -    g_free(params);
> +    CPU_FOREACH(cpu) {
> +        tlb_flush(cpu, flush_global);
> +    }
> +    g_free(arg);
>  }
>  
>  void tlb_flush_all(int flush_global)
>  {
> -    CPUState *cpu;
> -    struct TLBFlushParams *params;
> +    int *arg = g_malloc(sizeof(*arg));
>  
> -#if 0 /* MTTCG */
> -    CPU_FOREACH(cpu) {
> -        tlb_flush(cpu, flush_global);
> -    }
> -#else
> -    CPU_FOREACH(cpu) {
> -        if (qemu_cpu_is_self(cpu)) {
> -            /* async_run_on_cpu handle this case but this just avoid a malloc
> -             * here.
> -             */
> -            tlb_flush(cpu, flush_global);
> -        } else {
> -            params = g_malloc(sizeof(struct TLBFlushParams));
> -            params->cpu = cpu;
> -            params->flush_global = flush_global;
> -            async_run_on_cpu(cpu, tlb_flush_async_work, params);
> -        }
> -    }
> -#endif /* MTTCG */
> +    *arg = flush_global;
> +    tb_lock();
> +    cpu_tcg_sched_work(current_cpu, __tlb_flush_all, arg);
>  }
>  
>  static inline void tlb_flush_entry(CPUTLBEntry *tlb_entry, target_ulong addr)

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-09-01 16:10   ` Alex Bennée
@ 2015-09-01 19:38     ` Emilio G. Cota
  2015-09-01 20:18       ` Peter Maydell
  0 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-09-01 19:38 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	pbonzini, Frederic Konrad

On Tue, Sep 01, 2015 at 17:10:30 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > Signed-off-by: Emilio G. Cota <cota@braap.org>
> > ---
> >  cputlb.c | 41 +++++++++++------------------------------
> >  1 file changed, 11 insertions(+), 30 deletions(-)
> 
> I bisected my Jessie boot failure to this commit. Before it boots up
> fine, here it just hangs before the kernel starts init.
> 
> 17:05 alex@zen/x86_64  [qemu.git/bisect:???]
> >./arm-softmmu/qemu-system-arm -machine virt -cpu cortex-a15 -machine
> type=virt -display none -serial telnet:127.0.0.1:4444 -monitor stdio
> -smp 4 -m 4096 -kernel ../images/aarch32-current-linux-kernel-only.img
>  --append "console=ttyAMA0 root=/dev/vda1" -drive
>  file=../images/jessie-arm32.qcow2,id=myblock,index=0,if=none -device
>  virtio-b
>  lk-device,drive=myblock -netdev user,id=unet,hostfwd=tcp::2222-:22
>  -device virtio-net-device,netdev=unet -D /tmp/qemu.log -d un
>  imp -name debug-threads=on
> 
> See people.linaro.org/~alex.bennee/images

Thanks for testing!

I can replicate it; what's happening is that tlb_flush_all calls
cpu_loop_exit(), then re-enters the cpu loop, performs the
job while other CPUs are asleep(i.e. __tlb_flush_all in this case),
but then when it continues execution it loads the same instruction
(say a TLBIALLIS) again. So we end up with the same CPU calling
tlb_flush_all in an infinite loop.

A possible way to fix this is to finish the TB right after the
helper and then add a flag in cpu_sched_work to not call
cpu_exit_loop, raising an exit interrupt instead.
(Note that cpu_exit_loop is still necessary when doing work
out-of-band wrt to execution, e.g. we *want* to come back
to the same PC when doing a tb_flush.)

I've tried doing this but I can't see an obvious place to insert
the call to tcg_gen_exit_tb()--I see the calls to the TLB helpers
are embedded in structs that I presume are called by some generic
helper code. A little bit of help here would be appreciated, I'm
not very familiar with target-arm.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all
  2015-09-01 19:38     ` Emilio G. Cota
@ 2015-09-01 20:18       ` Peter Maydell
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Maydell @ 2015-09-01 20:18 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, Mark Burton, Alvise Rigo, QEMU Developers,
	Guillaume Delbergue, Paolo Bonzini, Alex Bennée,
	Frederic Konrad

On 1 September 2015 at 20:38, Emilio G. Cota <cota@braap.org> wrote:
> I can replicate it; what's happening is that tlb_flush_all calls
> cpu_loop_exit(), then re-enters the cpu loop, performs the
> job while other CPUs are asleep(i.e. __tlb_flush_all in this case),
> but then when it continues execution it loads the same instruction
> (say a TLBIALLIS) again. So we end up with the same CPU calling
> tlb_flush_all in an infinite loop.
>
> A possible way to fix this is to finish the TB right after the
> helper and then add a flag in cpu_sched_work to not call
> cpu_exit_loop, raising an exit interrupt instead.

Sounds like a good idea.

> (Note that cpu_exit_loop is still necessary when doing work
> out-of-band wrt to execution, e.g. we *want* to come back
> to the same PC when doing a tb_flush.)

Really? I haven't looked at any of this code, but that sounds
a bit odd...

> I've tried doing this but I can't see an obvious place to insert
> the call to tcg_gen_exit_tb()--I see the calls to the TLB helpers
> are embedded in structs that I presume are called by some generic
> helper code. A little bit of help here would be appreciated, I'm
> not very familiar with target-arm.

The code (for 32-bit) is in disas_coproc_insn(). Any coprocessor
which isn't a CP_SPECIAL case (ie NOP or WFI) will always be the last
thing in its TB anyway, unless this is suppressed with the
ARM_CP_SUPPRESS_TB_END flag in the reginfo struct.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 36/38] cputlb: use tcg_sched_work for tlb_flush_page_all
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (34 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  0:24 ` [Qemu-devel] [RFC 37/38] cpus: remove async_run_safe_work_on_cpu Emilio G. Cota
                   ` (3 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cputlb.c | 39 +++++++++++----------------------------
 1 file changed, 11 insertions(+), 28 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index d81a4eb..717a856 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -145,41 +145,24 @@ void tlb_flush_page(CPUState *cpu, target_ulong addr)
     tb_flush_jmp_cache(cpu, addr);
 }
 
-struct TLBFlushPageParams {
-    CPUState *cpu;
-    target_ulong addr;
-};
-
-static void tlb_flush_page_async_work(void *opaque)
+static void __tlb_flush_page_all(void *arg)
 {
-    struct TLBFlushPageParams *params = opaque;
+    target_ulong addr = *(target_ulong *)arg;
+    CPUState *cpu;
 
-    tlb_flush_page(params->cpu, params->addr);
-    g_free(params);
+    CPU_FOREACH(cpu) {
+        tlb_flush_page(cpu, addr);
+    }
+    g_free(arg);
 }
 
 void tlb_flush_page_all(target_ulong addr)
 {
-    CPUState *cpu;
-    struct TLBFlushPageParams *params;
+    target_ulong *arg = g_malloc(sizeof(*arg));
 
-    CPU_FOREACH(cpu) {
-#if 0 /* !MTTCG */
-        tlb_flush_page(cpu, addr);
-#else
-        if (qemu_cpu_is_self(cpu)) {
-            /* async_run_on_cpu handle this case but this just avoid a malloc
-             * here.
-             */
-            tlb_flush_page(cpu, addr);
-        } else {
-            params = g_malloc(sizeof(struct TLBFlushPageParams));
-            params->cpu = cpu;
-            params->addr = addr;
-            async_run_on_cpu(cpu, tlb_flush_page_async_work, params);
-        }
-#endif /* MTTCG */
-    }
+    *arg = addr;
+    tb_lock();
+    cpu_tcg_sched_work(current_cpu, __tlb_flush_page_all, arg);
 }
 
 /* update the TLBs so that writes to code in the virtual page 'addr'
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 37/38] cpus: remove async_run_safe_work_on_cpu
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (35 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 36/38] cputlb: use tcg_sched_work for tlb_flush_page_all Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  0:24 ` [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE" Emilio G. Cota
                   ` (2 subsequent siblings)
  39 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

It has no callers left.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 cpu-exec.c        | 10 ---------
 cpus.c            | 64 +------------------------------------------------------
 include/qom/cpu.h | 24 +--------------------
 3 files changed, 2 insertions(+), 96 deletions(-)

diff --git a/cpu-exec.c b/cpu-exec.c
index 378ce52..6d7bcc0 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -499,16 +499,6 @@ int cpu_exec(CPUState *cpu)
     }
     qemu_mutex_unlock(cpu->tcg_work_lock);
 
-#ifndef CONFIG_USER_ONLY
-    /* FIXME: user-mode emulation probably needs a similar mechanism as well,
-     * for example for tb_flush.
-     */
-    if (async_safe_work_pending()) {
-        cpu->exit_request = 1;
-        return 0;
-    }
-#endif
-
     if (cpu->halted) {
         if (!cpu_has_work(cpu)) {
             return EXCP_HALTED;
diff --git a/cpus.c b/cpus.c
index 0fe6576..e1033e1 100644
--- a/cpus.c
+++ b/cpus.c
@@ -68,8 +68,6 @@
 int64_t max_delay;
 int64_t max_advance;
 
-int safe_work_pending; /* Number of safe work pending for all VCPUs. */
-
 bool cpu_is_stopped(CPUState *cpu)
 {
     return cpu->stopped || !runstate_is_running();
@@ -77,7 +75,7 @@ bool cpu_is_stopped(CPUState *cpu)
 
 static bool cpu_thread_is_idle(CPUState *cpu)
 {
-    if (cpu->stop || cpu->queued_work_first || cpu->queued_safe_work_first) {
+    if (cpu->stop || cpu->queued_work_first) {
         return false;
     }
     if (cpu_is_stopped(cpu)) {
@@ -860,63 +858,6 @@ void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     qemu_cpu_kick(cpu);
 }
 
-void async_run_safe_work_on_cpu(CPUState *cpu, void (*func)(void *data),
-                                void *data)
-{
-    struct qemu_work_item *wi;
-
-    wi = g_malloc0(sizeof(struct qemu_work_item));
-    wi->func = func;
-    wi->data = data;
-    wi->free = true;
-
-    atomic_inc(&safe_work_pending);
-    qemu_mutex_lock(&cpu->work_mutex);
-    if (cpu->queued_safe_work_first == NULL) {
-        cpu->queued_safe_work_first = wi;
-    } else {
-        cpu->queued_safe_work_last->next = wi;
-    }
-    cpu->queued_safe_work_last = wi;
-    wi->next = NULL;
-    wi->done = false;
-    qemu_mutex_unlock(&cpu->work_mutex);
-
-    CPU_FOREACH(cpu) {
-        qemu_cpu_kick(cpu);
-    }
-}
-
-static void flush_queued_safe_work(CPUState *cpu)
-{
-    struct qemu_work_item *wi;
-
-    if (cpu->queued_safe_work_first == NULL) {
-        return;
-    }
-
-    qemu_mutex_lock(&cpu->work_mutex);
-    while ((wi = cpu->queued_safe_work_first)) {
-        cpu->queued_safe_work_first = wi->next;
-        qemu_mutex_unlock(&cpu->work_mutex);
-        wi->func(wi->data);
-        qemu_mutex_lock(&cpu->work_mutex);
-        wi->done = true;
-        if (wi->free) {
-            g_free(wi);
-        }
-        atomic_dec(&safe_work_pending);
-    }
-    cpu->queued_safe_work_last = NULL;
-    qemu_mutex_unlock(&cpu->work_mutex);
-    qemu_cond_broadcast(&qemu_work_cond);
-}
-
-bool async_safe_work_pending(void)
-{
-    return safe_work_pending != 0;
-}
-
 static void flush_queued_work(CPUState *cpu)
 {
     struct qemu_work_item *wi;
@@ -953,9 +894,6 @@ static void qemu_wait_io_event_common(CPUState *cpu)
         cpu->stopped = true;
         qemu_cond_signal(&qemu_pause_cond);
     }
-    qemu_mutex_unlock_iothread();
-    flush_queued_safe_work(cpu);
-    qemu_mutex_lock_iothread();
     flush_queued_work(cpu);
 }
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index aba7edb..79045b4 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -245,9 +245,8 @@ struct kvm_run;
  * @mem_io_pc: Host Program Counter at which the memory was accessed.
  * @mem_io_vaddr: Target virtual address at which the memory was accessed.
  * @kvm_fd: vCPU file descriptor for KVM.
- * @work_mutex: Lock to prevent multiple access to queued_* qemu_work_item.
+ * @work_mutex: Lock to prevent multiple access to queued_work_*.
  * @queued_work_first: First asynchronous work pending.
- * @queued_safe_work_first: First item of safe work pending.
  *
  * State of one CPU core or thread.
  */
@@ -290,7 +289,6 @@ struct CPUState {
 
     QemuMutex work_mutex;
     struct qemu_work_item *queued_work_first, *queued_work_last;
-    struct qemu_work_item *queued_safe_work_first, *queued_safe_work_last;
 
     AddressSpace *as;
     struct AddressSpaceDispatch *memory_dispatch;
@@ -571,26 +569,6 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
 
 /**
- * async_run_safe_work_on_cpu:
- * @cpu: The vCPU to run on.
- * @func: The function to be executed.
- * @data: Data to pass to the function.
- *
- * Schedules the function @func for execution on the vCPU @cpu asynchronously
- * when all the VCPUs are outside their loop.
- */
-void async_run_safe_work_on_cpu(CPUState *cpu, void (*func)(void *data),
-                                void *data);
-
-/**
- * async_safe_work_pending:
- *
- * Check whether any safe work is pending on any VCPUs.
- * Returns: @true if a safe work is pending, @false otherwise.
- */
-bool async_safe_work_pending(void);
-
-/**
  * cpu_tcg_sched_work:
  * @cpu: CPU thread to schedule the work on
  * @func: function to be called when all other CPU threads are asleep
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE"
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (36 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 37/38] cpus: remove async_run_safe_work_on_cpu Emilio G. Cota
@ 2015-08-24  0:24 ` Emilio G. Cota
  2015-08-24  1:29   ` Paolo Bonzini
  2015-08-24  2:01 ` [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Paolo Bonzini
  2015-08-24 16:08 ` Artyom Tarasenko
  39 siblings, 1 reply; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24  0:24 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: mark.burton, a.rigo, guillaume.delbergue, pbonzini, alex.bennee,
	Frederic Konrad

This reverts commit 81f3053b77f7d3a4d9100c425cd8cec99ee7a3d4.

The interrupt raised by the change in the commit above
kills performance when running many idling VCPUs. For example,
on my 64-core host when running a workload where cores are
idling often (e.g. blackscholes), performance drops significantly
because threads are most of the time just exiting the CPU loop,
thereby causing great contention on the BQL.

Fix it by reverting to the old behaviour by which no
interrupt is raised, which shouldn't be an issue given that
we have now one thread per VCPU.

Signed-off-by: Emilio G. Cota <cota@braap.org>

Conflicts:
	target-i386/misc_helper.c
---
 target-i386/helper.h      |  1 -
 target-i386/misc_helper.c | 22 ++--------------------
 target-i386/translate.c   |  5 +----
 3 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/target-i386/helper.h b/target-i386/helper.h
index 7d92140..495d9f8 100644
--- a/target-i386/helper.h
+++ b/target-i386/helper.h
@@ -56,7 +56,6 @@ DEF_HELPER_2(sysret, void, env, int)
 DEF_HELPER_2(hlt, void, env, int)
 DEF_HELPER_2(monitor, void, env, tl)
 DEF_HELPER_2(mwait, void, env, int)
-DEF_HELPER_2(pause, void, env, int)
 DEF_HELPER_1(debug, void, env)
 DEF_HELPER_1(reset_rf, void, env)
 DEF_HELPER_3(raise_interrupt, void, env, int, int)
diff --git a/target-i386/misc_helper.c b/target-i386/misc_helper.c
index 52c5d65..0389df2 100644
--- a/target-i386/misc_helper.c
+++ b/target-i386/misc_helper.c
@@ -556,15 +556,6 @@ void helper_rdmsr(CPUX86State *env)
 }
 #endif
 
-static void do_pause(X86CPU *cpu)
-{
-    CPUState *cs = CPU(cpu);
-
-    /* Just let another CPU run.  */
-    cs->exception_index = EXCP_INTERRUPT;
-    cpu_loop_exit(cs);
-}
-
 static void do_hlt(X86CPU *cpu)
 {
     CPUState *cs = CPU(cpu);
@@ -610,22 +601,13 @@ void helper_mwait(CPUX86State *env, int next_eip_addend)
     cs = CPU(cpu);
     /* XXX: not complete but not completely erroneous */
     if (cs->cpu_index != 0 || CPU_NEXT(cs) != NULL) {
-        do_pause(cpu);
+        /* more than one CPU: do not sleep because another CPU may
+           wake this one */
     } else {
         do_hlt(cpu);
     }
 }
 
-void helper_pause(CPUX86State *env, int next_eip_addend)
-{
-    X86CPU *cpu = x86_env_get_cpu(env);
-
-    cpu_svm_check_intercept_param(env, SVM_EXIT_PAUSE, 0);
-    env->eip += next_eip_addend;
-
-    do_pause(cpu);
-}
-
 void helper_debug(CPUX86State *env)
 {
     CPUState *cs = CPU(x86_env_get_cpu(env));
diff --git a/target-i386/translate.c b/target-i386/translate.c
index 4d6030f..3b68660 100644
--- a/target-i386/translate.c
+++ b/target-i386/translate.c
@@ -6934,10 +6934,7 @@ static target_ulong disas_insn(CPUX86State *env, DisasContext *s,
             goto do_xchg_reg_eax;
         }
         if (prefixes & PREFIX_REPZ) {
-            gen_update_cc_op(s);
-            gen_jmp_im(pc_start - s->cs_base);
-            gen_helper_pause(cpu_env, tcg_const_i32(s->pc - pc_start));
-            s->is_jmp = DISAS_TB_JUMP;
+            gen_svm_check_intercept(s, pc_start, SVM_EXIT_PAUSE);
         }
         break;
     case 0x9b: /* fwait */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE"
  2015-08-24  0:24 ` [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE" Emilio G. Cota
@ 2015-08-24  1:29   ` Paolo Bonzini
  0 siblings, 0 replies; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  1:29 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:24, Emilio G. Cota wrote:
> This reverts commit 81f3053b77f7d3a4d9100c425cd8cec99ee7a3d4.
> 
> The interrupt raised by the change in the commit above
> kills performance when running many idling VCPUs. For example,
> on my 64-core host when running a workload where cores are
> idling often (e.g. blackscholes), performance drops significantly
> because threads are most of the time just exiting the CPU loop,
> thereby causing great contention on the BQL.
> 
> Fix it by reverting to the old behaviour by which no
> interrupt is raised, which shouldn't be an issue given that
> we have now one thread per VCPU.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>

Agreed, this is not necessary anymore!

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (37 preceding siblings ...)
  2015-08-24  0:24 ` [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE" Emilio G. Cota
@ 2015-08-24  2:01 ` Paolo Bonzini
  2015-08-25 22:36   ` Emilio G. Cota
  2015-08-24 16:08 ` Artyom Tarasenko
  39 siblings, 1 reply; 110+ messages in thread
From: Paolo Bonzini @ 2015-08-24  2:01 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel, mttcg
  Cc: guillaume.delbergue, alex.bennee, mark.burton, a.rigo,
	Frederic Konrad



On 23/08/2015 17:23, Emilio G. Cota wrote:
> Hi all,
> 
> Here is MTTCG code I've been working on out-of-tree for the last few months.
> 
> The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
> Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg
> 
> The highlights of the patchset are as follows:
> 
> - The first 5 patches are direct fixes to bugs only in the mttcg
>   branch.

I'm taking this as a review of the first ten patches in the mttcg
thread, so I will send it in a pull request.

Also squashed patches 1, 2, 4 and 5 in the relevant mttcg branch patches.

> - Patches 6-12 fix issues in the master branch.

Applied all 7, thanks.

I hope to send the pending work before I leave for vacation.  Otherwise
I'll just push something to the mttcg branch.

> - The remaining patches are really the meat of this patchset.
>   The main features are:
> 
>   * Support of MTTCG for both user and system mode.

Nice. :)

>   * Design: per-CPU TB jump list protected by a seqlock,
>     if the TB is not found there then check on the global, RCU-protected 'hash table'
>     (i.e. fixed number of buckets), if not there then grab lock, check again,
>     and if it's not there then add generate the code and add the TB to the hash table.
> 
>     It makes sense that Paolo's recent work on the mttcg branch ended up
>     being almost identical to this--it's simple and it scales well.

To be honest it's really Fred's, I just extracted his patch and modified
it to avoid using "safe work" when invalidating TBs.

I think I didn't need a seqlock, but it's possible that my branch has a
race.


>   * tb_lock must be held every time code is generated. The rationale is
>     that most of the time QEMU is executing code, not generating it.

Makes sense, this is really the same as Fred's work.

>   * tb_flush: do it once all other CPUs have been put to sleep by calling
>     rcu_synchronize().
>     We also instrument tb_lock to make sure that only one tb_flush request can
>     happen at a given time.

What do you think about just protecting code_gen_buffer with RCU?

>     For this a mechanism to schedule work is added to
>     supersede cpu_sched_safe_work, which cannot work in usermode.  Here I've
>     toyed with an alternative version that doesn't force the flushing CPU to
>     exit, but in order to make this work we have save/restore the RCU read
>     lock while tb_lock is held in order to avoid deadlocks. This isn't too
>     pretty but it's good to know that the option is there.
> 
>   * I focused on x86 since it is a complex ISA and we support many cores via -smp.
>     I work on a 64-core machine so concurrency bugs show up relatively easily.
> 
>     Atomics are modeled using spinlocks, i.e. one host lock per guest cache line.
>     Note that spinlocks are way better than mutexes for this--perf on 64-cores
>     is 2X with spinlocks on highly concurrent workloads (synchrobench, see below).
> 
>     + Works unchanged for both system and user modes. As far as I can
>       tell the TLB-based approach that Alvise is working on couldn't
>       be used without the TLB--correct me if I'm wrong, it's been
>       quite some time since I looked at that work.

Yes, that's correct.

Paolo

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
  2015-08-24  2:01 ` [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Paolo Bonzini
@ 2015-08-25 22:36   ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-25 22:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, mark.burton, a.rigo, qemu-devel, guillaume.delbergue,
	alex.bennee, Frederic Konrad

On Sun, Aug 23, 2015 at 19:01:28 -0700, Paolo Bonzini wrote:
> >   * tb_flush: do it once all other CPUs have been put to sleep by calling
> >     rcu_synchronize().
> >     We also instrument tb_lock to make sure that only one tb_flush request can
> >     happen at a given time.
> 
> What do you think about just protecting code_gen_buffer with RCU?

I'm not sure of what you mean. Isn't essentially that what the
mechanism I sent (cpu_tcg_sched_work) is doing? I mean, the assumption
is: if any thread is on an RCU read critical section, then code_gen_buffer
cannot be modified. That's why tb_flush is only called after
rcu_synchronize() returns, to make sure that no existing threads
are executing code from the buffer.

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
  2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
                   ` (38 preceding siblings ...)
  2015-08-24  2:01 ` [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Paolo Bonzini
@ 2015-08-24 16:08 ` Artyom Tarasenko
  2015-08-24 20:16   ` Emilio G. Cota
  39 siblings, 1 reply; 110+ messages in thread
From: Artyom Tarasenko @ 2015-08-24 16:08 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	Paolo Bonzini, Alex Bennée, Frederic Konrad

On Mon, Aug 24, 2015 at 2:23 AM, Emilio G. Cota <cota@braap.org> wrote:
> Hi all,
>
> Here is MTTCG code I've been working on out-of-tree for the last few months.
>
> The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f.
> Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg
>
> The highlights of the patchset are as follows:
>
> - The first 5 patches are direct fixes to bugs only in the mttcg
>   branch.
>
> - Patches 6-12 fix issues in the master branch.
>
> - The remaining patches are really the meat of this patchset.
>   The main features are:
>
>   * Support of MTTCG for both user and system mode.
>
>   * Design: per-CPU TB jump list protected by a seqlock,
>     if the TB is not found there then check on the global, RCU-protected 'hash table'
>     (i.e. fixed number of buckets), if not there then grab lock, check again,
>     and if it's not there then add generate the code and add the TB to the hash table.
>
>     It makes sense that Paolo's recent work on the mttcg branch ended up
>     being almost identical to this--it's simple and it scales well.
>
>   * tb_lock must be held every time code is generated. The rationale is
>     that most of the time QEMU is executing code, not generating it.

While this is indeed true for an ideal case,  currently there are
situations where it's not:
 running a g++ process under qemu-system-sparc64 the comparable amount
of time is spent on executing and generating the code [1].
Does this lock imply the translation performance won't gain anything
when emulating a single core machine on a multi-core one?

Artyom

1. https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg02194.html


-- 
Regards,
Artyom Tarasenko

SPARC and PPC PReP under qemu blog: http://tyom.blogspot.com/search/label/qemu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode
  2015-08-24 16:08 ` Artyom Tarasenko
@ 2015-08-24 20:16   ` Emilio G. Cota
  0 siblings, 0 replies; 110+ messages in thread
From: Emilio G. Cota @ 2015-08-24 20:16 UTC (permalink / raw)
  To: Artyom Tarasenko
  Cc: mttcg, mark.burton, qemu-devel, a.rigo, guillaume.delbergue,
	Paolo Bonzini, Alex Bennée, Frederic Konrad

On Mon, Aug 24, 2015 at 18:08:37 +0200, Artyom Tarasenko wrote:
> On Mon, Aug 24, 2015 at 2:23 AM, Emilio G. Cota <cota@braap.org> wrote:
> >   * tb_lock must be held every time code is generated. The rationale is
> >     that most of the time QEMU is executing code, not generating it.
> 
> While this is indeed true for an ideal case,  currently there are
> situations where it's not:
>  running a g++ process under qemu-system-sparc64 the comparable amount
> of time is spent on executing and generating the code [1].
> Does this lock imply the translation performance won't gain anything
> when emulating a single core machine on a multi-core one?
>
> 1. https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg02194.html

AFAICT we can't say that's the desired TCG behavior, right? It seems we might
be translating more often than we should for sparc64:
  https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg02531.html

I'll run multi-programmed workloads (i.e. several instances running
at the same time, for instance doing a 'make -j' kernel build) on x86
to see how far up translation can go--in general I'd expect any multi-programmed
workload in full-system mode to require more translations than a
multi-threaded one, since in the latter code is the same for all threads.

But really I'd only expect self-modifying code to be slow/non-scalable--and
I don't think we should worry about it too much.

If you can think of other workloads that might trigger more translations
than usual, please let me know.

> Does this lock imply the translation performance won't gain anything
> when emulating a single core machine on a multi-core one?

The goal so far has been to emulate each VCPU on its own thread; as you
can see in the perf results in this thread this provides huge perf
gains when emulating multi-core guests on large enough hosts.

Code generation is done by the VCPU threads as they need it, and for
that they need to hold a lock to prevent corrupting TCG data structures
--for instance there's a single hash of TB's, and a single code_gen_buffer.

So to answer your question: speeding up a single-core guest on a multi-core
host is not something we're trying to do. If you think about it,
*if* the premise that QEMU is mostly executing (and not translating) code
holds true (and I'd say it holds for most workloads), then one host thread
per VCPU is the right design.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2015-09-21 20:59 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-24  0:23 [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 01/38] cpu-exec: add missing mmap_lock in tb_find_slow Emilio G. Cota
2015-09-07 15:33   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 02/38] hw/i386/kvmvapic: add missing include of tcg.h Emilio G. Cota
2015-09-07 15:49   ` Alex Bennée
2015-09-07 16:11     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 03/38] cpu-exec: set current_cpu at cpu_exec() Emilio G. Cota
2015-08-24  1:03   ` Paolo Bonzini
2015-08-25  0:41     ` [Qemu-devel] [PATCH 1/4] cpus: add qemu_cpu_thread_init_common() to avoid code duplication Emilio G. Cota
2015-08-25  0:41       ` [Qemu-devel] [PATCH 2/4] linux-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
2015-08-25  0:41       ` [Qemu-devel] [PATCH 3/4] linux-user: call rcu_(un)register_thread on thread creation/deletion Emilio G. Cota
2015-08-26  0:22         ` Paolo Bonzini
2015-08-25  0:41       ` [Qemu-devel] [PATCH 4/4] bsd-user: add helper to set current_cpu before cpu_loop() Emilio G. Cota
2015-08-25 18:07         ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 04/38] translate-all: remove volatile from have_tb_lock Emilio G. Cota
2015-09-07 15:50   ` Alex Bennée
2015-09-07 16:12     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions Emilio G. Cota
2015-08-24  1:04   ` Paolo Bonzini
2015-08-25  2:30     ` Emilio G. Cota
2015-08-25 19:30       ` Emilio G. Cota
2015-08-25 22:53         ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 06/38] seqlock: add missing 'inline' to seqlock_read_retry Emilio G. Cota
2015-09-07 15:50   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 07/38] seqlock: read sequence number atomically Emilio G. Cota
2015-09-07 15:53   ` Alex Bennée
2015-09-07 16:13     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 08/38] rcu: init rcu_registry_lock after fork Emilio G. Cota
2015-09-08 17:34   ` Alex Bennée
2015-09-08 19:03     ` Emilio G. Cota
2015-09-09  9:35       ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 09/38] rcu: fix comment with s/rcu_gp_lock/rcu_registry_lock/ Emilio G. Cota
2015-09-10 11:18   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 10/38] translate-all: remove obsolete comment about l1_map Emilio G. Cota
2015-09-10 11:59   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 11/38] qemu-thread: handle spurious futex_wait wakeups Emilio G. Cota
2015-09-10 13:22   ` Alex Bennée
2015-09-10 17:46     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 12/38] linux-user: call rcu_(un)register_thread on pthread_(exit|create) Emilio G. Cota
2015-08-25  0:45   ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 13/38] cputlb: add physical address to CPUTLBEntry Emilio G. Cota
2015-09-10 13:49   ` Alex Bennée
2015-09-10 17:50     ` Emilio G. Cota
2015-09-21  5:01   ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 14/38] softmmu: add helpers to get ld/st physical addresses Emilio G. Cota
2015-08-24  2:02   ` Paolo Bonzini
2015-08-25  2:47     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 15/38] radix-tree: add generic lockless radix tree module Emilio G. Cota
2015-09-10 14:25   ` Alex Bennée
2015-09-10 18:00     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 16/38] aie: add module for Atomic Instruction Emulation Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 17/38] aie: add target helpers Emilio G. Cota
2015-09-17 15:14   ` Alex Bennée
2015-09-21  5:18   ` Paolo Bonzini
2015-09-21 20:59     ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 18/38] tcg: add fences Emilio G. Cota
2015-09-10 15:28   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 19/38] tcg: add tcg_gen_smp_rmb() Emilio G. Cota
2015-09-10 16:01   ` Alex Bennée
2015-09-10 18:05     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 20/38] tcg/i386: implement fences Emilio G. Cota
2015-08-24  1:32   ` Paolo Bonzini
2015-08-25  3:02     ` Emilio G. Cota
2015-08-25 22:55       ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 21/38] target-i386: emulate atomic instructions + barriers using AIE Emilio G. Cota
2015-09-17 15:30   ` Alex Bennée
2015-08-24  0:23 ` [Qemu-devel] [RFC 22/38] cpu: update interrupt_request atomically Emilio G. Cota
2015-08-24  1:09   ` Paolo Bonzini
2015-08-25 20:36     ` Emilio G. Cota
2015-08-25 22:52       ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 23/38] cpu-exec: grab iothread lock during interrupt handling Emilio G. Cota
2015-09-09 10:13   ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 24/38] cpu-exec: reset mmap_lock after exiting the CPU loop Emilio G. Cota
2015-08-24  2:01   ` Paolo Bonzini
2015-08-25 21:16     ` Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 25/38] cpu: add barriers around cpu->tcg_exit_req Emilio G. Cota
2015-08-24  2:01   ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 26/38] cpu: protect tb_jmp_cache with seqlock Emilio G. Cota
2015-08-24  1:14   ` Paolo Bonzini
2015-08-25 21:46     ` Emilio G. Cota
2015-08-25 22:49       ` Paolo Bonzini
2015-09-04  8:50   ` Paolo Bonzini
2015-09-04 10:04     ` Paolo Bonzini
2015-08-24  0:23 ` [Qemu-devel] [RFC 27/38] cpu-exec: convert tb_invalidated_flag into a per-TB flag Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 28/38] cpu-exec: use RCU to perform lockless TB lookups Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 29/38] tcg: export have_tb_lock Emilio G. Cota
2015-08-24  0:23 ` [Qemu-devel] [RFC 30/38] translate-all: add tb_lock assertions Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 31/38] cpu: protect l1_map with tb_lock in full-system mode Emilio G. Cota
2015-08-24  1:07   ` Paolo Bonzini
2015-08-25 21:54     ` Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 32/38] cpu list: convert to RCU QLIST Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 33/38] cpu: introduce cpu_tcg_sched_work to run work while other CPUs sleep Emilio G. Cota
2015-08-24  1:24   ` Paolo Bonzini
2015-08-25 22:18     ` Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 34/38] translate-all: use tcg_sched_work for tb_flush Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 35/38] cputlb: use cpu_tcg_sched_work for tlb_flush_all Emilio G. Cota
2015-08-24  1:29   ` Paolo Bonzini
2015-08-25 22:31     ` Emilio G. Cota
2015-08-26  0:25       ` Paolo Bonzini
2015-09-01 16:10   ` Alex Bennée
2015-09-01 19:38     ` Emilio G. Cota
2015-09-01 20:18       ` Peter Maydell
2015-08-24  0:24 ` [Qemu-devel] [RFC 36/38] cputlb: use tcg_sched_work for tlb_flush_page_all Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 37/38] cpus: remove async_run_safe_work_on_cpu Emilio G. Cota
2015-08-24  0:24 ` [Qemu-devel] [RFC 38/38] Revert "target-i386: yield to another VCPU on PAUSE" Emilio G. Cota
2015-08-24  1:29   ` Paolo Bonzini
2015-08-24  2:01 ` [Qemu-devel] [RFC 00/38] MTTCG: i386, user+system mode Paolo Bonzini
2015-08-25 22:36   ` Emilio G. Cota
2015-08-24 16:08 ` Artyom Tarasenko
2015-08-24 20:16   ` Emilio G. Cota

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).