[Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
@ 2016-04-15 14:23 Alex Bennée
  2016-04-15 14:23 ` Alex Bennée
                   ` (13 more replies)
  0 siblings, 14 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée

Hi,

This series finally completes the re-build of Fred's multi_tcg_v8 tree
by enabling MTTCG for armv7 guests on x86 hosts. This applies on top
of the previous series:

  - [RFC v2 00/11] Base enabling patches for MTTCG

You can find the final tree at:

  https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1

which builds on:

  https://github.com/stsquad/qemu/tree/mttcg/base-patches-v2

which includes Sergey's:

  https://github.com/stsquad/qemu/tree/mttcg/tb-and-tcg-cleanups

I've tested this with a Debian Jessie guest as well as my extensive
MTTCG focused torture tests built on kvm-unit-tests.

Series Breakdown
================

The first 3 patches have been cherry-picked from other series and can
be skipped while reviewing. Paolo's is just some simple header
house-cleaning that made the run_on_cpu changes easier. Sergey's
thread safe patching is being reviewed elsewhere but does prevent a
crash I could stimulate with heavy TB invalidation. The final patch is
a squash patch from Emilio's QHT tree which provides a QemuSpinLock
which is used by the atomic patch later on.

The next 2 introduce a few more atomic primitives which I use later
on in the series.

The next 2 patches are concerned with async work. The first cleans up
the existing async work to pass CPUState which minimises the need to
malloc structures later on. The new async_run_safe_work_on_cpu has
been changed a bit from Fred's tree - it operates from a single queue
in an effort to ensure all deferred operations were handled in a
timely manner. There has also been an attempt to minimise the amount
of dynamic allocation done by using a pre-allocated array combined
with the dealt CPUState passing of earlier.

Then there are 2 patches which take advantage of this functionality
are a few cputlb flush routines as well as the translation buffer
overflow case.

The final patches involve architecture specific changes to ensure the
ARM flush operations use the async'd cputlb functions. The final STREX
patches are a temporary fix for atomicity which I've put at the end of
the series so Alvise can easily drop them for his LL/SC based approach.

The last patch makes MTTCG the default for the common case of running
ARMv7 on an x86 backend. I know there is debate about the benefit of
having a control knob for MTTCG but certainly while developing it is
handy to have. There are also cases where MTTCG will be incompatible
with other features such as record/replay.

Benchmarks
==========

The benchmark is a simple boot and build test which builds stress-ng
with -j ${NR_CPUS} and shuts down to facilitate easy repetition.

arm-softmmu/qemu-system-arm -machine type=virt -display none -m 4096 \
    -cpu cortex-a15 -serial telnet:127.0.0.1:4444 \
    -monitor stdio -netdev user,id=unet,hostfwd=tcp::2222-:22 \
    -device virtio-net -device,netdev=unet \
    -drive file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none \
    -device virtio-blk-device,drive=myblock
    -append "console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1"
    -kernel /home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img

| -smp 1 (mttcg=off) | -smp 4 (mttcg=off) | -smp 4 (mttcg=on) |
|--------------------+--------------------+-------------------|
| 301.60 (5 runs)    | 312.27 (4 runs)    |  573.26 (5 runs)  |

As the results show currently the performance for mttcg is worse than
the single threaded version. However this tree doesn't have the
lockless tb_find_fast which means every time there is a transition
from one page to the next the lock needs to be taken. There is still
work to be done for performance ;-)

Alex Bennée (5):
  qemu-thread: add simple test-and-set spinlock
  atomic: introduce atomic_dec_fetch.
  atomic: introduce cmpxchg_bool
  cpus: pass CPUState to run_on_cpu helpers
  cpus: default MTTCG to on for 32 bit ARM on x86

KONRAD Frederic (5):
  cpus: introduce async_safe_run_on_cpu.
  cputlb: introduce tlb_flush_* async work.
  translate-all: introduces tb_flush_safe.
  arm: use tlb_flush_page_all for tlbimva[a]
  arm: atomically check the exclusive value in a STREX

Paolo Bonzini (1):
  include: move CPU-related definitions out of qemu-common.h

Sergey Fedorov (1):
  tcg/i386: Make direct jump patching thread-safe

 cpu-exec-common.c         |   1 +
 cpu-exec.c                |  11 ++++
 cpus.c                    | 137 +++++++++++++++++++++++++++++++++++++++++-----
 cputlb.c                  |  61 ++++++++++++++++-----
 hw/i386/kvm/apic.c        |   3 +-
 hw/i386/kvmvapic.c        |   8 +--
 hw/ppc/ppce500_spin.c     |   3 +-
 hw/ppc/spapr.c            |   6 +-
 hw/ppc/spapr_hcall.c      |  12 ++--
 include/exec/exec-all.h   |   7 ++-
 include/qemu-common.h     |  24 --------
 include/qemu/atomic.h     |  15 +++++
 include/qemu/processor.h  |  28 ++++++++++
 include/qemu/thread.h     |  34 ++++++++++++
 include/qemu/timer.h      |   1 +
 include/qom/cpu.h         |  34 +++++++++++-
 include/sysemu/cpus.h     |  13 +++++
 kvm-all.c                 |  20 +++----
 stubs/cpu-get-icount.c    |   1 +
 target-arm/cpu.c          |  21 +++++++
 target-arm/cpu.h          |   6 ++
 target-arm/helper.c       |  28 ++++++----
 target-arm/helper.h       |   6 ++
 target-arm/op_helper.c    | 130 ++++++++++++++++++++++++++++++++++++++++++-
 target-arm/translate.c    |  96 ++++++--------------------------
 target-i386/helper.c      |   3 +-
 target-i386/kvm.c         |   6 +-
 target-s390x/cpu.c        |   4 +-
 target-s390x/cpu.h        |   7 +--
 tcg/i386/tcg-target.inc.c |  17 ++++++
 translate-all.c           |  34 +++++++++---
 translate-common.c        |   1 +
 vl.c                      |   1 +
 33 files changed, 582 insertions(+), 197 deletions(-)
 create mode 100644 include/qemu/processor.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 01/12] include: move CPU-related definitions out of qemu-common.h Alex Bennée
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée

Hi,

The finally completes the re-build of Fred's multi_tcg_v8 tree by
enabling MTTCG for armv7 guests on x86 hosts. This applies on top of
the previous series:

  - [RFC v2 00/11] Base enabling patches for MTTCG

You can find the final tree at:

  https://github.com/stsquad/qemu/tree/mttcg/enable-mttcg-for-armv7-v1

which builds on:

  https://github.com/stsquad/qemu/tree/mttcg/base-patches-v2

which includes Sergey's:

  https://github.com/stsquad/qemu/tree/mttcg/tb-and-tcg-cleanups

I've tested this with a Debian Jessie guest as well as my extensive
MTTCG focused torture tests built on kvm-unit-tests.

Series Breakdown
================

The first 3 patches have been cherry-picked from other series and can
be skipped while reviewing. Paolo's is just some simple header
house-cleaning that made the run_on_cpu changes easier. Sergey's
thread safe patching is being reviewed elsewhere but does prevent a
crash I could stimulate with heavy TB invalidation. The final patch is
a squash patch from Emilio's QHT tree which provides a QemuSpinLock
which is used by the atomic patch later on.

The next 2 introduce a few more atomic primitives which I use later
on in the series.

The next 2 patches are concerned with async work. The first cleans up
the existing async work to pass CPUState which minimises the need to
malloc structures later on. The new async_run_safe_work_on_cpu has
been changed a bit from Fred's tree - it operates from a single queue
in an effort to ensure all deferred operations were handled in a
timely manner. There has also been an attempt to minimise the amount
of dynamic allocation done by using a pre-allocated array combined
with the dealt CPUState passing of earlier.

Then there are 2 patches which take advantage of this functionality
are a few cputlb flush routines as well as the translation buffer
overflow case.

The final patches involve architecture specific changes to ensure the
ARM flush operations use the async'd cputlb functions. The final STREX
patches are a temporary fix for atomicity which I've put at the end of
the series so Alvise can easily drop them for his LL/SC based approach.

The last patch makes MTTCG the default for the common case of running
ARMv7 on an x86 backend. I know there is debate about the benefit of
having a control knob for MTTCG but certainly while developing it is
handy to have. There are also cases where MTTCG will be incompatible
with other features such as record/replay.

Benchmarks
==========

The benchmark is a simple boot and build test which builds stress-ng
with -j ${NR_CPUS} and shuts down to facilitate easy repetition.

arm-softmmu/qemu-system-arm -machine type=virt -display none -m 4096 \
    -cpu cortex-a15 -serial telnet:127.0.0.1:4444 \
    -monitor stdio -netdev user,id=unet,hostfwd=tcp::2222-:22 \
    -device virtio-net -device,netdev=unet \
    -drive file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none \
    -device virtio-blk-device,drive=myblock
    -append "console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1"
    -kernel /home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img

| -smp 1 (mttcg=off) | -smp 4 (mttcg=off) | -smp 4 (mttcg=on) |
|--------------------+--------------------+-------------------|
| 301.60 (5 runs)    | 312.27 (4 runs)    |  573.26 (5 runs)  |

As the results show currently the performance for mttcg is worse than
the single threaded version. However this tree doesn't have the
lockless tb_find_fast which means every time there is a transition
from one page to the next the lock needs to be taken. There is still
work to be done for performance ;-)

Alex Bennée (5):
  qemu-thread: add simple test-and-set spinlock
  atomic: introduce atomic_dec_fetch.
  atomic: introduce cmpxchg_bool
  cpus: pass CPUState to run_on_cpu helpers
  cpus: default MTTCG to on for 32 bit ARM on x86

KONRAD Frederic (5):
  cpus: introduce async_safe_run_on_cpu.
  cputlb: introduce tlb_flush_* async work.
  translate-all: introduces tb_flush_safe.
  arm: use tlb_flush_page_all for tlbimva[a]
  arm: atomically check the exclusive value in a STREX

Paolo Bonzini (1):
  include: move CPU-related definitions out of qemu-common.h

Sergey Fedorov (1):
  tcg/i386: Make direct jump patching thread-safe

 cpu-exec-common.c         |   1 +
 cpu-exec.c                |  11 ++++
 cpus.c                    | 137 +++++++++++++++++++++++++++++++++++++++++-----
 cputlb.c                  |  61 ++++++++++++++++-----
 hw/i386/kvm/apic.c        |   3 +-
 hw/i386/kvmvapic.c        |   8 +--
 hw/ppc/ppce500_spin.c     |   3 +-
 hw/ppc/spapr.c            |   6 +-
 hw/ppc/spapr_hcall.c      |  12 ++--
 include/exec/exec-all.h   |   7 ++-
 include/qemu-common.h     |  24 --------
 include/qemu/atomic.h     |  15 +++++
 include/qemu/processor.h  |  28 ++++++++++
 include/qemu/thread.h     |  34 ++++++++++++
 include/qemu/timer.h      |   1 +
 include/qom/cpu.h         |  34 +++++++++++-
 include/sysemu/cpus.h     |  13 +++++
 kvm-all.c                 |  20 +++----
 stubs/cpu-get-icount.c    |   1 +
 target-arm/cpu.c          |  21 +++++++
 target-arm/cpu.h          |   6 ++
 target-arm/helper.c       |  28 ++++++----
 target-arm/helper.h       |   6 ++
 target-arm/op_helper.c    | 130 ++++++++++++++++++++++++++++++++++++++++++-
 target-arm/translate.c    |  96 ++++++--------------------------
 target-i386/helper.c      |   3 +-
 target-i386/kvm.c         |   6 +-
 target-s390x/cpu.c        |   4 +-
 target-s390x/cpu.h        |   7 +--
 tcg/i386/tcg-target.inc.c |  17 ++++++
 translate-all.c           |  34 +++++++++---
 translate-common.c        |   1 +
 vl.c                      |   1 +
 33 files changed, 582 insertions(+), 197 deletions(-)
 create mode 100644 include/qemu/processor.h

-- 
2.7.4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 01/12] include: move CPU-related definitions out of qemu-common.h
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
  2016-04-15 14:23 ` Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 02/12] tcg/i386: Make direct jump patching thread-safe Alex Bennée
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Andreas Färber, Peter Crosthwaite

From: Paolo Bonzini <pbonzini@redhat.com>

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu-common.h  | 24 ------------------------
 include/qemu/timer.h   |  1 +
 include/qom/cpu.h      |  9 +++++++++
 include/sysemu/cpus.h  | 13 +++++++++++++
 stubs/cpu-get-icount.c |  1 +
 translate-common.c     |  1 +
 vl.c                   |  1 +
 7 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/include/qemu-common.h b/include/qemu-common.h
index 163bcbb..f0d74076 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -23,17 +23,6 @@
 #include "qemu/option.h"
 #include "qemu/host-utils.h"
 
-void cpu_ticks_init(void);
-
-/* icount */
-void configure_icount(QemuOpts *opts, Error **errp);
-extern int use_icount;
-extern int icount_align_option;
-/* drift information for info jit command */
-extern int64_t max_delay;
-extern int64_t max_advance;
-void dump_drift_info(FILE *f, fprintf_function cpu_fprintf);
-
 #include "qemu/bswap.h"
 
 /* FIXME: Remove NEED_CPU_H.  */
@@ -100,19 +89,6 @@ bool tcg_enabled(void);
 
 void cpu_exec_init_all(void);
 
-/* Unblock cpu */
-void qemu_cpu_kick_self(void);
-
-/* work queue */
-struct qemu_work_item {
-    struct qemu_work_item *next;
-    void (*func)(void *data);
-    void *data;
-    int done;
-    bool free;
-};
-
-
 /**
  * Sends a (part of) iovec down a socket, yielding when the socket is full, or
  * Receives data into a (part of) iovec from a socket,
diff --git a/include/qemu/timer.h b/include/qemu/timer.h
index 471969a..309f3d0 100644
--- a/include/qemu/timer.h
+++ b/include/qemu/timer.h
@@ -4,6 +4,7 @@
 #include "qemu-common.h"
 #include "qemu/notify.h"
 #include "qemu/host-utils.h"
+#include "sysemu/cpus.h"
 
 #define NANOSECONDS_PER_SECOND 1000000000LL
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 5e3826c..ab08f1a 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -222,6 +222,15 @@ struct kvm_run;
 #define TB_JMP_CACHE_BITS 12
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
+/* work queue */
+struct qemu_work_item {
+    struct qemu_work_item *next;
+    void (*func)(void *data);
+    void *data;
+    int done;
+    bool free;
+};
+
 /**
  * CPUState:
  * @cpu_index: CPU index (informative).
diff --git a/include/sysemu/cpus.h b/include/sysemu/cpus.h
index 606426f..4948c40 100644
--- a/include/sysemu/cpus.h
+++ b/include/sysemu/cpus.h
@@ -7,6 +7,19 @@ void qemu_init_cpu_loop(void);
 void resume_all_vcpus(void);
 void pause_all_vcpus(void);
 void cpu_stop_current(void);
+void cpu_ticks_init(void);
+
+void configure_icount(QemuOpts *opts, Error **errp);
+extern int use_icount;
+extern int icount_align_option;
+
+/* drift information for info jit command */
+extern int64_t max_delay;
+extern int64_t max_advance;
+void dump_drift_info(FILE *f, fprintf_function cpu_fprintf);
+
+/* Unblock cpu */
+void qemu_cpu_kick_self(void);
 
 void cpu_synchronize_all_states(void);
 void cpu_synchronize_all_post_reset(void);
diff --git a/stubs/cpu-get-icount.c b/stubs/cpu-get-icount.c
index 3a6f2ab..2e8b63b 100644
--- a/stubs/cpu-get-icount.c
+++ b/stubs/cpu-get-icount.c
@@ -1,6 +1,7 @@
 #include "qemu/osdep.h"
 #include "qemu-common.h"
 #include "qemu/timer.h"
+#include "sysemu/cpus.h"
 
 int use_icount;
 
diff --git a/translate-common.c b/translate-common.c
index ffbfe85..5e989cd 100644
--- a/translate-common.c
+++ b/translate-common.c
@@ -20,6 +20,7 @@
 #include "qemu/osdep.h"
 #include "qemu-common.h"
 #include "qom/cpu.h"
+#include "sysemu/cpus.h"
 
 uintptr_t qemu_real_host_page_size;
 intptr_t qemu_real_host_page_mask;
diff --git a/vl.c b/vl.c
index 51bbdbc..f6f6ae9 100644
--- a/vl.c
+++ b/vl.c
@@ -87,6 +87,7 @@ int main(int argc, char **argv)
 #include "sysemu/dma.h"
 #include "audio/audio.h"
 #include "migration/migration.h"
+#include "sysemu/cpus.h"
 #include "sysemu/kvm.h"
 #include "qapi/qmp/qjson.h"
 #include "qemu/option.h"
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 02/12] tcg/i386: Make direct jump patching thread-safe
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
  2016-04-15 14:23 ` Alex Bennée
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 01/12] include: move CPU-related definitions out of qemu-common.h Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 03/12] qemu-thread: add simple test-and-set spinlock Alex Bennée
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Sergey Fedorov, Peter Crosthwaite

From: Sergey Fedorov <serge.fdrv@gmail.com>

Ensure direct jump patching in i386 is atomic by:
 * naturally aligning a location of direct jump address;
 * using atomic_read()/atomic_set() for code patching.

Signed-off-by: Sergey Fedorov <serge.fdrv@gmail.com>
Signed-off-by: Sergey Fedorov <sergey.fedorov@linaro.org>
---
 include/exec/exec-all.h   |  2 +-
 tcg/i386/tcg-target.inc.c | 17 +++++++++++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index cde4b7a..9144ee0 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -324,7 +324,7 @@ void ppc_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
 static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 {
     /* patch the branch destination */
-    stl_le_p((void*)jmp_addr, addr - (jmp_addr + 4));
+    atomic_set((int32_t *)jmp_addr, addr - (jmp_addr + 4));
     /* no need to flush icache explicitly */
 }
 #elif defined(__s390x__)
diff --git a/tcg/i386/tcg-target.inc.c b/tcg/i386/tcg-target.inc.c
index 2f98cae..8fd37f4 100644
--- a/tcg/i386/tcg-target.inc.c
+++ b/tcg/i386/tcg-target.inc.c
@@ -1123,6 +1123,19 @@ static void tcg_out_jmp(TCGContext *s, tcg_insn_unit *dest)
     tcg_out_branch(s, 0, dest);
 }
 
+static void tcg_out_nopn(TCGContext *s, int n)
+{
+    static const uint8_t nop1[] = { 0x90 };
+    static const uint8_t nop2[] = { 0x66, 0x90 };
+    static const uint8_t nop3[] = { 0x8d, 0x76, 0x00 };
+    static const uint8_t *const nopn[] = { nop1, nop2, nop3 };
+    int i;
+    assert(n <= ARRAY_SIZE(nopn));
+    for (i = 0; i < n; ++i) {
+        tcg_out8(s, nopn[n - 1][i]);
+    }
+}
+
 #if defined(CONFIG_SOFTMMU)
 /* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
  *                                     int mmu_idx, uintptr_t ra)
@@ -1777,6 +1790,10 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_goto_tb:
         if (s->tb_jmp_insn_offset) {
             /* direct jump method */
+            /* align jump displacement for atomic pathing */
+            if (((uintptr_t)s->code_ptr & 3) != 3) {
+                tcg_out_nopn(s, 3 - ((uintptr_t)s->code_ptr & 3));
+            }
             tcg_out8(s, OPC_JMP_long); /* jmp im */
             s->tb_jmp_insn_offset[args[0]] = tcg_current_code_size(s);
             tcg_out32(s, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 03/12] qemu-thread: add simple test-and-set spinlock
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (2 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 02/12] tcg/i386: Make direct jump patching thread-safe Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 04/12] atomic: introduce atomic_dec_fetch Alex Bennée
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Guillaume Delbergue

This is a temporary squashed patch of the following patches from
Emilio's qht series. This includes:

  - include/processor.h: define cpu_relax()
  - qemu-thread: add simple test-and-set spinlock
  - qemu-thread: call cpu_relax() while spinning
  - qemu-thread: optimize spin_lock for uncontended locks

The sign-offs where from:
Signed-off-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Guillaume Delbergue <guillaume.delbergue@greensocs.com>
[Rewritten. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/qemu/processor.h | 28 ++++++++++++++++++++++++++++
 include/qemu/thread.h    | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)
 create mode 100644 include/qemu/processor.h

diff --git a/include/qemu/processor.h b/include/qemu/processor.h
new file mode 100644
index 0000000..675a00a
--- /dev/null
+++ b/include/qemu/processor.h
@@ -0,0 +1,28 @@
+#ifndef QEMU_PROCESSOR_H
+#define QEMU_PROCESSOR_H
+
+#include "qemu/atomic.h"
+
+#if defined(__i386__) || defined(__x86_64__)
+#define cpu_relax() asm volatile("rep; nop" ::: "memory")
+#endif
+
+#ifdef __ia64__
+#define cpu_relax() asm volatile("hint @pause" ::: "memory")
+#endif
+
+#ifdef __aarch64__
+#define cpu_relax() asm volatile("yield" ::: "memory")
+#endif
+
+#if defined(__powerpc64__)
+/* set Hardware Multi-Threading (HMT) priority to low; then back to medium */
+#define cpu_relax() asm volatile("or 1, 1, 1;"
+                                 "or 2, 2, 2;" ::: "memory")
+#endif
+
+#ifndef cpu_relax
+#define cpu_relax() barrier()
+#endif
+
+#endif /* QEMU_PROCESSOR_H */
diff --git a/include/qemu/thread.h b/include/qemu/thread.h
index bdae6df..e2af57c 100644
--- a/include/qemu/thread.h
+++ b/include/qemu/thread.h
@@ -1,6 +1,9 @@
 #ifndef __QEMU_THREAD_H
 #define __QEMU_THREAD_H 1
 
+#include <errno.h>
+#include "qemu/processor.h"
+#include "qemu/atomic.h"
 
 typedef struct QemuMutex QemuMutex;
 typedef struct QemuCond QemuCond;
@@ -60,4 +63,35 @@ struct Notifier;
 void qemu_thread_atexit_add(struct Notifier *notifier);
 void qemu_thread_atexit_remove(struct Notifier *notifier);
 
+typedef struct QemuSpin {
+    int value;
+} QemuSpin;
+
+static inline void qemu_spin_init(QemuSpin *spin)
+{
+    spin->value = 0;
+}
+
+static inline void qemu_spin_lock(QemuSpin *spin)
+{
+    while (atomic_xchg(&spin->value, true)) {
+        while (atomic_read(&spin->value)) {
+            cpu_relax();
+        }
+    }
+}
+
+static inline int qemu_spin_trylock(QemuSpin *spin)
+{
+    if (atomic_read(&spin->value) || atomic_xchg(&spin->value, true)) {
+        return -EBUSY;
+    }
+    return 0;
+}
+
+static inline void qemu_spin_unlock(QemuSpin *spin)
+{
+    atomic_mb_set(&spin->value, 0);
+}
+
 #endif
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 04/12] atomic: introduce atomic_dec_fetch.
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (3 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 03/12] qemu-thread: add simple test-and-set spinlock Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-06-02 20:34   ` Sergey Fedorov
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool Alex Bennée
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée

Useful for counting down.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/qemu/atomic.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 8f1d8d9..5dba7db 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -131,6 +131,8 @@
 #define atomic_fetch_and(ptr, n) __atomic_fetch_and(ptr, n, __ATOMIC_SEQ_CST)
 #define atomic_fetch_or(ptr, n)  __atomic_fetch_or(ptr, n, __ATOMIC_SEQ_CST)
 
+#define atomic_dec_fetch(ptr)  __atomic_sub_fetch(ptr, 1, __ATOMIC_SEQ_CST)
+
 /* And even shorter names that return void.  */
 #define atomic_inc(ptr)    ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST))
 #define atomic_dec(ptr)    ((void) __atomic_fetch_sub(ptr, 1, __ATOMIC_SEQ_CST))
@@ -326,6 +328,8 @@
 #define atomic_fetch_or        __sync_fetch_and_or
 #define atomic_cmpxchg         __sync_val_compare_and_swap
 
+#define atomic_dec_fetch(ptr)  __sync_sub_and_fetch(ptr, 1)
+
 /* And even shorter names that return void.  */
 #define atomic_inc(ptr)        ((void) __sync_fetch_and_add(ptr, 1))
 #define atomic_dec(ptr)        ((void) __sync_fetch_and_add(ptr, -1))
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 04/12] atomic: introduce atomic_dec_fetch.
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 04/12] atomic: introduce atomic_dec_fetch Alex Bennée
@ 2016-06-02 20:34   ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-02 20:34 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana

On 15/04/16 17:23, Alex Bennée wrote:
> Useful for counting down.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

Reviewed-by: Sergey Fedorov <sergey.fedorov@linaro.org>

> ---
>  include/qemu/atomic.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 8f1d8d9..5dba7db 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -131,6 +131,8 @@
>  #define atomic_fetch_and(ptr, n) __atomic_fetch_and(ptr, n, __ATOMIC_SEQ_CST)
>  #define atomic_fetch_or(ptr, n)  __atomic_fetch_or(ptr, n, __ATOMIC_SEQ_CST)
>  
> +#define atomic_dec_fetch(ptr)  __atomic_sub_fetch(ptr, 1, __ATOMIC_SEQ_CST)
> +
>  /* And even shorter names that return void.  */
>  #define atomic_inc(ptr)    ((void) __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST))
>  #define atomic_dec(ptr)    ((void) __atomic_fetch_sub(ptr, 1, __ATOMIC_SEQ_CST))
> @@ -326,6 +328,8 @@
>  #define atomic_fetch_or        __sync_fetch_and_or
>  #define atomic_cmpxchg         __sync_val_compare_and_swap
>  
> +#define atomic_dec_fetch(ptr)  __sync_sub_and_fetch(ptr, 1)
> +
>  /* And even shorter names that return void.  */
>  #define atomic_inc(ptr)        ((void) __sync_fetch_and_add(ptr, 1))
>  #define atomic_dec(ptr)        ((void) __sync_fetch_and_add(ptr, -1))

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (4 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 04/12] atomic: introduce atomic_dec_fetch Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-15 16:22   ` Richard Henderson
  2016-06-03 16:45   ` Sergey Fedorov
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers Alex Bennée
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée

This allows for slightly neater code when checking for success of a
cmpxchg operation.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/qemu/atomic.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index 5dba7db..94e7110 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -123,6 +123,16 @@
     _old;                                                               \
     })
 
+#define atomic_bool_cmpxchg(ptr, old, new)                              \
+    ({                                                                  \
+    typeof(*ptr) _old = (old), _new = (new);                            \
+    bool r;                                                             \
+    r = __atomic_compare_exchange(ptr, &_old, &_new, false,             \
+                                  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);  \
+    r;                                                                  \
+    })
+
+
 /* Provide shorter names for GCC atomic builtins, return old value */
 #define atomic_fetch_inc(ptr)  __atomic_fetch_add(ptr, 1, __ATOMIC_SEQ_CST)
 #define atomic_fetch_dec(ptr)  __atomic_fetch_sub(ptr, 1, __ATOMIC_SEQ_CST)
@@ -327,6 +337,7 @@
 #define atomic_fetch_and       __sync_fetch_and_and
 #define atomic_fetch_or        __sync_fetch_and_or
 #define atomic_cmpxchg         __sync_val_compare_and_swap
+#define atomic_bool_cmpxchg    __sync_bool_compare_and_swap
 
 #define atomic_dec_fetch(ptr)  __sync_sub_and_fetch(ptr, 1)
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool Alex Bennée
@ 2016-04-15 16:22   ` Richard Henderson
  2016-04-15 17:06     ` Alex Bennée
  2016-06-03 16:45   ` Sergey Fedorov
  1 sibling, 1 reply; 41+ messages in thread
From: Richard Henderson @ 2016-04-15 16:22 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, peter.maydell,
	claudio.fontana

On 04/15/2016 07:23 AM, Alex Bennée wrote:
> +#define atomic_bool_cmpxchg(ptr, old, new)                              \
> +    ({                                                                  \
> +    typeof(*ptr) _old = (old), _new = (new);                            \
> +    bool r;                                                             \
> +    r = __atomic_compare_exchange(ptr, &_old, &_new, false,             \
> +                                  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);  \
> +    r;                                                                  \
> +    })

How are you thinking this will be used?  If a loop like

  do {
    old = atomic_read (ptr);
    new = f(old);
  } while (!atomic_bool_cmpxchg(ptr, old, new));

then it's usually helpful to use a weak compare_exchange (s/false/true/ above).
 This will produce one loop for ll/sc architectures instead of two.


r~

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool
  2016-04-15 16:22   ` Richard Henderson
@ 2016-04-15 17:06     ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 17:06 UTC (permalink / raw)
  To: Richard Henderson
  Cc: mttcg, fred.konrad, a.rigo, serge.fdrv, cota, qemu-devel,
	mark.burton, pbonzini, jan.kiszka, peter.maydell, claudio.fontana


Richard Henderson <rth@twiddle.net> writes:

> On 04/15/2016 07:23 AM, Alex Bennée wrote:
>> +#define atomic_bool_cmpxchg(ptr, old, new)                              \
>> +    ({                                                                  \
>> +    typeof(*ptr) _old = (old), _new = (new);                            \
>> +    bool r;                                                             \
>> +    r = __atomic_compare_exchange(ptr, &_old, &_new, false,             \
>> +                                  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);  \
>> +    r;                                                                  \
>> +    })
>
> How are you thinking this will be used?  If a loop like
>
>   do {
>     old = atomic_read (ptr);
>     new = f(old);
>   } while (!atomic_bool_cmpxchg(ptr, old, new));
>
> then it's usually helpful to use a weak compare_exchange (s/false/true/ above).
>  This will produce one loop for ll/sc architectures instead of two.

I used it to make Fred's STREX code a little neater:

        if (len == 1 << size) {
            oldval = (uint8_t)env->exclusive_val;
            result = atomic_bool_cmpxchg(p, oldval, (uint8_t)newval);
        }
        address_space_unmap(cs->as, p, len, true, result ? len : 0);

Instead of:

        if (len == 1 << size) {
            oldval = (uint8_t)env->exclusive_val;
            result = (atomic_cmpxchg(p, oldval, (uint8_t)newval) == oldval);
        }
        address_space_unmap(cs->as, p, len, true, result ? len : 0);

But now I'm wondering if there is a race in there. I'll have to look
closer.


>
>
> r~


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool Alex Bennée
  2016-04-15 16:22   ` Richard Henderson
@ 2016-06-03 16:45   ` Sergey Fedorov
  2016-06-03 19:12     ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-03 16:45 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana

On 15/04/16 17:23, Alex Bennée wrote:
> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
> index 5dba7db..94e7110 100644
> --- a/include/qemu/atomic.h
> +++ b/include/qemu/atomic.h
> @@ -123,6 +123,16 @@
>      _old;                                                               \
>      })
>  
> +#define atomic_bool_cmpxchg(ptr, old, new)                              \
> +    ({                                                                  \
> +    typeof(*ptr) _old = (old), _new = (new);                            \
> +    bool r;                                                             \
> +    r = __atomic_compare_exchange(ptr, &_old, &_new, false,             \
> +                                  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);  \
> +    r;                                                                  \
> +    })
> +
> +

Could be more simple:

#define atomic_bool_cmpxchg(ptr, old, new)                              \
    ({                                                                  \
    typeof(*ptr) _old = (old), _new = (new);                            \
    __atomic_compare_exchange(ptr, &_old, &_new, false,                 \
                              __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);      \
    })


Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool
  2016-06-03 16:45   ` Sergey Fedorov
@ 2016-06-03 19:12     ` Alex Bennée
  2016-06-03 19:20       ` Eric Blake
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-06-03 19:12 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 15/04/16 17:23, Alex Bennée wrote:
>> diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
>> index 5dba7db..94e7110 100644
>> --- a/include/qemu/atomic.h
>> +++ b/include/qemu/atomic.h
>> @@ -123,6 +123,16 @@
>>      _old;                                                               \
>>      })
>>
>> +#define atomic_bool_cmpxchg(ptr, old, new)                              \
>> +    ({                                                                  \
>> +    typeof(*ptr) _old = (old), _new = (new);                            \
>> +    bool r;                                                             \
>> +    r = __atomic_compare_exchange(ptr, &_old, &_new, false,             \
>> +                                  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);  \
>> +    r;                                                                  \
>> +    })
>> +
>> +
>
> Could be more simple:
>
> #define atomic_bool_cmpxchg(ptr, old, new)                              \
>     ({                                                                  \
>     typeof(*ptr) _old = (old), _new = (new);                            \
>     __atomic_compare_exchange(ptr, &_old, &_new, false,                 \
>                               __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);      \
>     })

OK that makes sense. I'll have to ask my toolchain colleague what the
rules are for results from {} blocks.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool
  2016-06-03 19:12     ` Alex Bennée
@ 2016-06-03 19:20       ` Eric Blake
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Blake @ 2016-06-03 19:20 UTC (permalink / raw)
  To: Alex Bennée, Sergey Fedorov
  Cc: mttcg, peter.maydell, claudio.fontana, jan.kiszka, mark.burton,
	qemu-devel, a.rigo, cota, pbonzini, rth, fred.konrad

[-- Attachment #1: Type: text/plain, Size: 1535 bytes --]

On 06/03/2016 01:12 PM, Alex Bennée wrote:

>>> +#define atomic_bool_cmpxchg(ptr, old, new)                              \
>>> +    ({                                                                  \
>>> +    typeof(*ptr) _old = (old), _new = (new);                            \
>>> +    bool r;                                                             \
>>> +    r = __atomic_compare_exchange(ptr, &_old, &_new, false,             \
>>> +                                  __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);  \
>>> +    r;                                                                  \
>>> +    })
>>> +
>>> +
>>
>> Could be more simple:
>>
>> #define atomic_bool_cmpxchg(ptr, old, new)                              \
>>     ({                                                                  \
>>     typeof(*ptr) _old = (old), _new = (new);                            \
>>     __atomic_compare_exchange(ptr, &_old, &_new, false,                 \
>>                               __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);      \
>>     })
> 
> OK that makes sense. I'll have to ask my toolchain colleague what the
> rules are for results from {} blocks.

It's a gcc extension, and the rule is that the value of the overall ({})
is the value of the last statement in the block.  No need for a
temporary variable 'r' if you can just use __atomic_compare_exchange()
as the last statement.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (5 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-20 18:59   ` Eduardo Habkost
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu Alex Bennée
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite,
	Eduardo Habkost, Michael S. Tsirkin, Alexander Graf, David Gibson,
	Andreas Färber, Marcelo Tosatti, open list:PowerPC,
	open list:Overall

CPUState is a fairly common pointer to pass to these helpers. This means
if you need other arguments for the async_run_on_cpu case you end up
having to do a g_malloc to stuff additional data into the routine. For
the current users this isn't a massive deal but for MTTCG this gets
cumbersome when the only other parameter is often an address.

This adds the typedef run_on_cpu_func for helper functions which has an
explicit CPUState * passed as the first parameter. All the users of
run_on_cpu and async_run_on_cpu have had their helpers updated to use
CPUState where available.

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 cpus.c                | 15 +++++++--------
 hw/i386/kvm/apic.c    |  3 +--
 hw/i386/kvmvapic.c    |  8 ++++----
 hw/ppc/ppce500_spin.c |  3 +--
 hw/ppc/spapr.c        |  6 ++----
 hw/ppc/spapr_hcall.c  | 12 +++++-------
 include/qom/cpu.h     |  8 +++++---
 kvm-all.c             | 20 +++++++-------------
 target-i386/helper.c  |  3 +--
 target-i386/kvm.c     |  6 ++----
 target-s390x/cpu.c    |  4 ++--
 target-s390x/cpu.h    |  7 ++-----
 12 files changed, 39 insertions(+), 56 deletions(-)

diff --git a/cpus.c b/cpus.c
index f7c7359..9177161 100644
--- a/cpus.c
+++ b/cpus.c
@@ -583,9 +583,8 @@ static const VMStateDescription vmstate_timers = {
     }
 };
 
-static void cpu_throttle_thread(void *opaque)
+static void cpu_throttle_thread(CPUState *cpu, void *opaque)
 {
-    CPUState *cpu = opaque;
     double pct;
     double throttle_ratio;
     long sleeptime_ns;
@@ -615,7 +614,7 @@ static void cpu_throttle_timer_tick(void *opaque)
     }
     CPU_FOREACH(cpu) {
         if (!atomic_xchg(&cpu->throttle_thread_scheduled, 1)) {
-            async_run_on_cpu(cpu, cpu_throttle_thread, cpu);
+            async_run_on_cpu(cpu, cpu_throttle_thread, NULL);
         }
     }
 
@@ -940,12 +939,12 @@ void qemu_init_cpu_loop(void)
     qemu_thread_get_self(&io_thread);
 }
 
-void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
 {
     struct qemu_work_item wi;
 
     if (qemu_cpu_is_self(cpu)) {
-        func(data);
+        func(cpu, data);
         return;
     }
 
@@ -970,12 +969,12 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
     }
 }
 
-void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data)
+void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
 {
     struct qemu_work_item *wi;
 
     if (qemu_cpu_is_self(cpu)) {
-        func(data);
+        func(cpu, data);
         return;
     }
 
@@ -1014,7 +1013,7 @@ static void flush_queued_work(CPUState *cpu)
             cpu->queued_work_last = NULL;
         }
         qemu_mutex_unlock(&cpu->work_mutex);
-        wi->func(wi->data);
+        wi->func(cpu, wi->data);
         qemu_mutex_lock(&cpu->work_mutex);
         if (wi->free) {
             g_free(wi);
diff --git a/hw/i386/kvm/apic.c b/hw/i386/kvm/apic.c
index 3c7c8fa..d019724 100644
--- a/hw/i386/kvm/apic.c
+++ b/hw/i386/kvm/apic.c
@@ -123,10 +123,9 @@ static void kvm_apic_vapic_base_update(APICCommonState *s)
     }
 }
 
-static void do_inject_external_nmi(void *data)
+static void do_inject_external_nmi(CPUState *cpu, void *data)
 {
     APICCommonState *s = data;
-    CPUState *cpu = CPU(s->cpu);
     uint32_t lvt;
     int ret;
 
diff --git a/hw/i386/kvmvapic.c b/hw/i386/kvmvapic.c
index a0439a8..0ec3a96 100644
--- a/hw/i386/kvmvapic.c
+++ b/hw/i386/kvmvapic.c
@@ -487,7 +487,7 @@ typedef struct VAPICEnableTPRReporting {
     bool enable;
 } VAPICEnableTPRReporting;
 
-static void vapic_do_enable_tpr_reporting(void *data)
+static void vapic_do_enable_tpr_reporting(CPUState *cpu, void *data)
 {
     VAPICEnableTPRReporting *info = data;
 
@@ -738,15 +738,15 @@ static void vapic_realize(DeviceState *dev, Error **errp)
     nb_option_roms++;
 }
 
-static void do_vapic_enable(void *data)
+static void do_vapic_enable(CPUState *cpu, void *data)
 {
     VAPICROMState *s = data;
-    X86CPU *cpu = X86_CPU(first_cpu);
+    X86CPU *x86_cpu = X86_CPU(cpu);
 
     static const uint8_t enabled = 1;
     cpu_physical_memory_write(s->vapic_paddr + offsetof(VAPICState, enabled),
                               &enabled, sizeof(enabled));
-    apic_enable_vapic(cpu->apic_state, s->vapic_paddr);
+    apic_enable_vapic(x86_cpu->apic_state, s->vapic_paddr);
     s->state = VAPIC_ACTIVE;
 }
 
diff --git a/hw/ppc/ppce500_spin.c b/hw/ppc/ppce500_spin.c
index 76bd78b..e2aeef3 100644
--- a/hw/ppc/ppce500_spin.c
+++ b/hw/ppc/ppce500_spin.c
@@ -94,10 +94,9 @@ static void mmubooke_create_initial_mapping(CPUPPCState *env,
     env->tlb_dirty = true;
 }
 
-static void spin_kick(void *data)
+static void spin_kick(CPUState *cpu, void *data)
 {
     SpinKick *kick = data;
-    CPUState *cpu = CPU(kick->cpu);
     CPUPPCState *env = &kick->cpu->env;
     SpinInfo *curspin = kick->spin;
     hwaddr map_size = 64 * 1024 * 1024;
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e7be21e..ad1c261 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2106,10 +2106,8 @@ static void spapr_machine_finalizefn(Object *obj)
     g_free(spapr->kvm_type);
 }
 
-static void ppc_cpu_do_nmi_on_cpu(void *arg)
+static void ppc_cpu_do_nmi_on_cpu(CPUState *cs, void *arg)
 {
-    CPUState *cs = arg;
-
     cpu_synchronize_state(cs);
     ppc_cpu_do_system_reset(cs);
 }
@@ -2119,7 +2117,7 @@ static void spapr_nmi(NMIState *n, int cpu_index, Error **errp)
     CPUState *cs;
 
     CPU_FOREACH(cs) {
-        async_run_on_cpu(cs, ppc_cpu_do_nmi_on_cpu, cs);
+        async_run_on_cpu(cs, ppc_cpu_do_nmi_on_cpu, NULL);
     }
 }
 
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 2dcb676..4b7fbb7 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -10,19 +10,18 @@
 #include "kvm_ppc.h"
 
 struct SPRSyncState {
-    CPUState *cs;
     int spr;
     target_ulong value;
     target_ulong mask;
 };
 
-static void do_spr_sync(void *arg)
+static void do_spr_sync(CPUState *cs, void *arg)
 {
     struct SPRSyncState *s = arg;
-    PowerPCCPU *cpu = POWERPC_CPU(s->cs);
+    PowerPCCPU *cpu = POWERPC_CPU(cs);
     CPUPPCState *env = &cpu->env;
 
-    cpu_synchronize_state(s->cs);
+    cpu_synchronize_state(cs);
     env->spr[s->spr] &= ~s->mask;
     env->spr[s->spr] |= s->value;
 }
@@ -31,7 +30,6 @@ static void set_spr(CPUState *cs, int spr, target_ulong value,
                     target_ulong mask)
 {
     struct SPRSyncState s = {
-        .cs = cs,
         .spr = spr,
         .value = value,
         .mask = mask
@@ -911,11 +909,11 @@ typedef struct {
     Error *err;
 } SetCompatState;
 
-static void do_set_compat(void *arg)
+static void do_set_compat(CPUState *cs, void *arg)
 {
     SetCompatState *s = arg;
 
-    cpu_synchronize_state(CPU(s->cpu));
+    cpu_synchronize_state(cs);
     ppc_set_compat(s->cpu, s->cpu_version, &s->err);
 }
 
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index ab08f1a..385d5bb 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -223,9 +223,11 @@ struct kvm_run;
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
 /* work queue */
+typedef void (*run_on_cpu_func)(CPUState *cpu, void *data);
+
 struct qemu_work_item {
     struct qemu_work_item *next;
-    void (*func)(void *data);
+    run_on_cpu_func func;
     void *data;
     int done;
     bool free;
@@ -627,7 +629,7 @@ bool cpu_is_stopped(CPUState *cpu);
  *
  * Schedules the function @func for execution on the vCPU @cpu.
  */
-void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 
 /**
  * async_run_on_cpu:
@@ -637,7 +639,7 @@ void run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
  *
  * Schedules the function @func for execution on the vCPU @cpu asynchronously.
  */
-void async_run_on_cpu(CPUState *cpu, void (*func)(void *data), void *data);
+void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 
 /**
  * qemu_get_cpu:
diff --git a/kvm-all.c b/kvm-all.c
index e7b66df..b31fbba 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1755,10 +1755,8 @@ void kvm_flush_coalesced_mmio_buffer(void)
     s->coalesced_flush_in_progress = false;
 }
 
-static void do_kvm_cpu_synchronize_state(void *arg)
+static void do_kvm_cpu_synchronize_state(CPUState *cpu, void *arg)
 {
-    CPUState *cpu = arg;
-
     if (!cpu->kvm_vcpu_dirty) {
         kvm_arch_get_registers(cpu);
         cpu->kvm_vcpu_dirty = true;
@@ -1768,34 +1766,30 @@ static void do_kvm_cpu_synchronize_state(void *arg)
 void kvm_cpu_synchronize_state(CPUState *cpu)
 {
     if (!cpu->kvm_vcpu_dirty) {
-        run_on_cpu(cpu, do_kvm_cpu_synchronize_state, cpu);
+        run_on_cpu(cpu, do_kvm_cpu_synchronize_state, NULL);
     }
 }
 
-static void do_kvm_cpu_synchronize_post_reset(void *arg)
+static void do_kvm_cpu_synchronize_post_reset(CPUState *cpu, void *arg)
 {
-    CPUState *cpu = arg;
-
     kvm_arch_put_registers(cpu, KVM_PUT_RESET_STATE);
     cpu->kvm_vcpu_dirty = false;
 }
 
 void kvm_cpu_synchronize_post_reset(CPUState *cpu)
 {
-    run_on_cpu(cpu, do_kvm_cpu_synchronize_post_reset, cpu);
+    run_on_cpu(cpu, do_kvm_cpu_synchronize_post_reset, NULL);
 }
 
-static void do_kvm_cpu_synchronize_post_init(void *arg)
+static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, void *arg)
 {
-    CPUState *cpu = arg;
-
     kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
     cpu->kvm_vcpu_dirty = false;
 }
 
 void kvm_cpu_synchronize_post_init(CPUState *cpu)
 {
-    run_on_cpu(cpu, do_kvm_cpu_synchronize_post_init, cpu);
+    run_on_cpu(cpu, do_kvm_cpu_synchronize_post_init, NULL);
 }
 
 int kvm_cpu_exec(CPUState *cpu)
@@ -2137,7 +2131,7 @@ struct kvm_set_guest_debug_data {
     int err;
 };
 
-static void kvm_invoke_set_guest_debug(void *data)
+static void kvm_invoke_set_guest_debug(CPUState *unused_cpu, void *data)
 {
     struct kvm_set_guest_debug_data *dbg_data = data;
 
diff --git a/target-i386/helper.c b/target-i386/helper.c
index 5755839..1b50f59 100644
--- a/target-i386/helper.c
+++ b/target-i386/helper.c
@@ -1117,11 +1117,10 @@ typedef struct MCEInjectionParams {
     int flags;
 } MCEInjectionParams;
 
-static void do_inject_x86_mce(void *data)
+static void do_inject_x86_mce(CPUState *cpu, void *data)
 {
     MCEInjectionParams *params = data;
     CPUX86State *cenv = &params->cpu->env;
-    CPUState *cpu = CPU(params->cpu);
     uint64_t *banks = cenv->mce_banks + 4 * params->bank;
 
     cpu_synchronize_state(cpu);
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 87ab969..4ee678c 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -145,10 +145,8 @@ static int kvm_get_tsc(CPUState *cs)
     return 0;
 }
 
-static inline void do_kvm_synchronize_tsc(void *arg)
+static inline void do_kvm_synchronize_tsc(CPUState *cpu, void *arg)
 {
-    CPUState *cpu = arg;
-
     kvm_get_tsc(cpu);
 }
 
@@ -158,7 +156,7 @@ void kvm_synchronize_all_tsc(void)
 
     if (kvm_enabled()) {
         CPU_FOREACH(cpu) {
-            run_on_cpu(cpu, do_kvm_synchronize_tsc, cpu);
+            run_on_cpu(cpu, do_kvm_synchronize_tsc, NULL);
         }
     }
 }
diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c
index 4bfff34..7dc8e82 100644
--- a/target-s390x/cpu.c
+++ b/target-s390x/cpu.c
@@ -186,7 +186,7 @@ static void s390_cpu_machine_reset_cb(void *opaque)
 {
     S390CPU *cpu = opaque;
 
-    run_on_cpu(CPU(cpu), s390_do_cpu_full_reset, CPU(cpu));
+    run_on_cpu(CPU(cpu), s390_do_cpu_full_reset, NULL);
 }
 #endif
 
@@ -236,7 +236,7 @@ static void s390_cpu_realizefn(DeviceState *dev, Error **errp)
     s390_cpu_gdb_init(cs);
     qemu_init_vcpu(cs);
 #if !defined(CONFIG_USER_ONLY)
-    run_on_cpu(cs, s390_do_cpu_full_reset, cs);
+    run_on_cpu(cs, s390_do_cpu_full_reset, NULL);
 #else
     cpu_reset(cs);
 #endif
diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h
index 6d97c08..2112994 100644
--- a/target-s390x/cpu.h
+++ b/target-s390x/cpu.h
@@ -454,17 +454,14 @@ static inline hwaddr decode_basedisp_s(CPUS390XState *env, uint32_t ipb,
 #define decode_basedisp_rs decode_basedisp_s
 
 /* helper functions for run_on_cpu() */
-static inline void s390_do_cpu_reset(void *arg)
+static inline void s390_do_cpu_reset(CPUState *cs, void *arg)
 {
-    CPUState *cs = arg;
     S390CPUClass *scc = S390_CPU_GET_CLASS(cs);
 
     scc->cpu_reset(cs);
 }
-static inline void s390_do_cpu_full_reset(void *arg)
+static inline void s390_do_cpu_full_reset(CPUState *cs, void *arg)
 {
-    CPUState *cs = arg;
-
     cpu_reset(cs);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers Alex Bennée
@ 2016-04-20 18:59   ` Eduardo Habkost
  2016-04-20 19:50     ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Eduardo Habkost @ 2016-04-20 18:59 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, fred.konrad, a.rigo, serge.fdrv, cota, qemu-devel,
	mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite, Michael S. Tsirkin,
	Alexander Graf, David Gibson, Andreas Färber,
	Marcelo Tosatti, open list:PowerPC, open list:Overall

On Fri, Apr 15, 2016 at 03:23:45PM +0100, Alex Bennée wrote:
> CPUState is a fairly common pointer to pass to these helpers. This means
> if you need other arguments for the async_run_on_cpu case you end up
> having to do a g_malloc to stuff additional data into the routine. For
> the current users this isn't a massive deal but for MTTCG this gets
> cumbersome when the only other parameter is often an address.
> 
> This adds the typedef run_on_cpu_func for helper functions which has an
> explicit CPUState * passed as the first parameter. All the users of
> run_on_cpu and async_run_on_cpu have had their helpers updated to use
> CPUState where available.
> 
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
>  cpus.c                | 15 +++++++--------
>  hw/i386/kvm/apic.c    |  3 +--
>  hw/i386/kvmvapic.c    |  8 ++++----
>  hw/ppc/ppce500_spin.c |  3 +--
>  hw/ppc/spapr.c        |  6 ++----
>  hw/ppc/spapr_hcall.c  | 12 +++++-------
>  include/qom/cpu.h     |  8 +++++---
>  kvm-all.c             | 20 +++++++-------------
>  target-i386/helper.c  |  3 +--
>  target-i386/kvm.c     |  6 ++----
>  target-s390x/cpu.c    |  4 ++--
>  target-s390x/cpu.h    |  7 ++-----

What about target-s390x/kvm.c and target-s390x/misc_helper.c?

A few other minor comments/questions below. But the patch looks
good overall.

>  12 files changed, 39 insertions(+), 56 deletions(-)
> 
[...]
> diff --git a/hw/ppc/ppce500_spin.c b/hw/ppc/ppce500_spin.c
> index 76bd78b..e2aeef3 100644
> --- a/hw/ppc/ppce500_spin.c
> +++ b/hw/ppc/ppce500_spin.c
> @@ -94,10 +94,9 @@ static void mmubooke_create_initial_mapping(CPUPPCState *env,
>      env->tlb_dirty = true;
>  }
>  
> -static void spin_kick(void *data)
> +static void spin_kick(CPUState *cpu, void *data)
>  {
>      SpinKick *kick = data;
> -    CPUState *cpu = CPU(kick->cpu);
>      CPUPPCState *env = &kick->cpu->env;

I would set env = &cpu->env to avoid confusion. Then the SpinKick
struct can be removed completely.

>      SpinInfo *curspin = kick->spin;
>      hwaddr map_size = 64 * 1024 * 1024;
[...]
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 2dcb676..4b7fbb7 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -10,19 +10,18 @@
>  #include "kvm_ppc.h"
>  
>  struct SPRSyncState {
> -    CPUState *cs;
>      int spr;
>      target_ulong value;
>      target_ulong mask;
>  };
>  
> -static void do_spr_sync(void *arg)
> +static void do_spr_sync(CPUState *cs, void *arg)
>  {
>      struct SPRSyncState *s = arg;
> -    PowerPCCPU *cpu = POWERPC_CPU(s->cs);
> +    PowerPCCPU *cpu = POWERPC_CPU(cs);
>      CPUPPCState *env = &cpu->env;
>  
> -    cpu_synchronize_state(s->cs);
> +    cpu_synchronize_state(cs);
>      env->spr[s->spr] &= ~s->mask;
>      env->spr[s->spr] |= s->value;
>  }
> @@ -31,7 +30,6 @@ static void set_spr(CPUState *cs, int spr, target_ulong value,
>                      target_ulong mask)
>  {
>      struct SPRSyncState s = {
> -        .cs = cs,
>          .spr = spr,
>          .value = value,
>          .mask = mask
> @@ -911,11 +909,11 @@ typedef struct {
>      Error *err;
>  } SetCompatState;
>  
> -static void do_set_compat(void *arg)
> +static void do_set_compat(CPUState *cs, void *arg)
>  {
>      SetCompatState *s = arg;
>  
> -    cpu_synchronize_state(CPU(s->cpu));
> +    cpu_synchronize_state(cs);
>      ppc_set_compat(s->cpu, s->cpu_version, &s->err);

This is not incorrect, but inconsistent: you replaced s->cpu
usage on the first line, but kept using s->cpu in the second
line.

Is there a specific reason you removed SPRSyncState.cs but kept
SetCompatState.cpu?


>  }
>  
[...]
> diff --git a/target-i386/helper.c b/target-i386/helper.c
> index 5755839..1b50f59 100644
> --- a/target-i386/helper.c
> +++ b/target-i386/helper.c
> @@ -1117,11 +1117,10 @@ typedef struct MCEInjectionParams {
>      int flags;
>  } MCEInjectionParams;
>  
> -static void do_inject_x86_mce(void *data)
> +static void do_inject_x86_mce(CPUState *cpu, void *data)
>  {
>      MCEInjectionParams *params = data;
>      CPUX86State *cenv = &params->cpu->env;

I would eliminate MCEInjectionParams.cpu and set cenv = &cpu->env
above, to avoid confusion.

> -    CPUState *cpu = CPU(params->cpu);
>      uint64_t *banks = cenv->mce_banks + 4 * params->bank;
>  
>      cpu_synchronize_state(cpu);
[...]
> diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h
> index 6d97c08..2112994 100644
> --- a/target-s390x/cpu.h
> +++ b/target-s390x/cpu.h
> @@ -454,17 +454,14 @@ static inline hwaddr decode_basedisp_s(CPUS390XState *env, uint32_t ipb,
>  #define decode_basedisp_rs decode_basedisp_s
>  
>  /* helper functions for run_on_cpu() */
> -static inline void s390_do_cpu_reset(void *arg)
> +static inline void s390_do_cpu_reset(CPUState *cs, void *arg)
>  {
> -    CPUState *cs = arg;
>      S390CPUClass *scc = S390_CPU_GET_CLASS(cs);
>  
>      scc->cpu_reset(cs);
>  }
> -static inline void s390_do_cpu_full_reset(void *arg)
> +static inline void s390_do_cpu_full_reset(CPUState *cs, void *arg)
>  {
> -    CPUState *cs = arg;
> -
>      cpu_reset(cs);
>  }

The run_on_cpu callers at target-s390x/misc_helper.c still pass
an unnecessary non-NULL void* argument, making the code
confusing.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers
  2016-04-20 18:59   ` Eduardo Habkost
@ 2016-04-20 19:50     ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-20 19:50 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: mttcg, fred.konrad, a.rigo, serge.fdrv, cota, qemu-devel,
	mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite, Michael S. Tsirkin,
	Alexander Graf, David Gibson, Andreas Färber,
	Marcelo Tosatti, open list:PowerPC, open list:Overall


Eduardo Habkost <ehabkost@redhat.com> writes:

> On Fri, Apr 15, 2016 at 03:23:45PM +0100, Alex Bennée wrote:
>> CPUState is a fairly common pointer to pass to these helpers. This means
>> if you need other arguments for the async_run_on_cpu case you end up
>> having to do a g_malloc to stuff additional data into the routine. For
>> the current users this isn't a massive deal but for MTTCG this gets
>> cumbersome when the only other parameter is often an address.
>>
>> This adds the typedef run_on_cpu_func for helper functions which has an
>> explicit CPUState * passed as the first parameter. All the users of
>> run_on_cpu and async_run_on_cpu have had their helpers updated to use
>> CPUState where available.
>>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> ---
>>  cpus.c                | 15 +++++++--------
>>  hw/i386/kvm/apic.c    |  3 +--
>>  hw/i386/kvmvapic.c    |  8 ++++----
>>  hw/ppc/ppce500_spin.c |  3 +--
>>  hw/ppc/spapr.c        |  6 ++----
>>  hw/ppc/spapr_hcall.c  | 12 +++++-------
>>  include/qom/cpu.h     |  8 +++++---
>>  kvm-all.c             | 20 +++++++-------------
>>  target-i386/helper.c  |  3 +--
>>  target-i386/kvm.c     |  6 ++----
>>  target-s390x/cpu.c    |  4 ++--
>>  target-s390x/cpu.h    |  7 ++-----
>
> What about target-s390x/kvm.c and target-s390x/misc_helper.c?
>
> A few other minor comments/questions below. But the patch looks
> good overall.

Good catch, I should have cross-built to ensure I got all the kvm based
helpers. I'll fix that up.

>
>>  12 files changed, 39 insertions(+), 56 deletions(-)
>>
> [...]
>> diff --git a/hw/ppc/ppce500_spin.c b/hw/ppc/ppce500_spin.c
>> index 76bd78b..e2aeef3 100644
>> --- a/hw/ppc/ppce500_spin.c
>> +++ b/hw/ppc/ppce500_spin.c
>> @@ -94,10 +94,9 @@ static void mmubooke_create_initial_mapping(CPUPPCState *env,
>>      env->tlb_dirty = true;
>>  }
>>
>> -static void spin_kick(void *data)
>> +static void spin_kick(CPUState *cpu, void *data)
>>  {
>>      SpinKick *kick = data;
>> -    CPUState *cpu = CPU(kick->cpu);
>>      CPUPPCState *env = &kick->cpu->env;
>
> I would set env = &cpu->env to avoid confusion. Then the SpinKick
> struct can be removed completely.

ok

>
>>      SpinInfo *curspin = kick->spin;
>>      hwaddr map_size = 64 * 1024 * 1024;
> [...]
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index 2dcb676..4b7fbb7 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -10,19 +10,18 @@
>>  #include "kvm_ppc.h"
>>
>>  struct SPRSyncState {
>> -    CPUState *cs;
>>      int spr;
>>      target_ulong value;
>>      target_ulong mask;
>>  };
>>
>> -static void do_spr_sync(void *arg)
>> +static void do_spr_sync(CPUState *cs, void *arg)
>>  {
>>      struct SPRSyncState *s = arg;
>> -    PowerPCCPU *cpu = POWERPC_CPU(s->cs);
>> +    PowerPCCPU *cpu = POWERPC_CPU(cs);
>>      CPUPPCState *env = &cpu->env;
>>
>> -    cpu_synchronize_state(s->cs);
>> +    cpu_synchronize_state(cs);
>>      env->spr[s->spr] &= ~s->mask;
>>      env->spr[s->spr] |= s->value;
>>  }
>> @@ -31,7 +30,6 @@ static void set_spr(CPUState *cs, int spr, target_ulong value,
>>                      target_ulong mask)
>>  {
>>      struct SPRSyncState s = {
>> -        .cs = cs,
>>          .spr = spr,
>>          .value = value,
>>          .mask = mask
>> @@ -911,11 +909,11 @@ typedef struct {
>>      Error *err;
>>  } SetCompatState;
>>
>> -static void do_set_compat(void *arg)
>> +static void do_set_compat(CPUState *cs, void *arg)
>>  {
>>      SetCompatState *s = arg;
>>
>> -    cpu_synchronize_state(CPU(s->cpu));
>> +    cpu_synchronize_state(cs);
>>      ppc_set_compat(s->cpu, s->cpu_version, &s->err);
>
> This is not incorrect, but inconsistent: you replaced s->cpu
> usage on the first line, but kept using s->cpu in the second
> line.
>
> Is there a specific reason you removed SPRSyncState.cs but kept
> SetCompatState.cpu?

I was trying for minimal change but your right I can do better.

>
>
>>  }
>>
> [...]
>> diff --git a/target-i386/helper.c b/target-i386/helper.c
>> index 5755839..1b50f59 100644
>> --- a/target-i386/helper.c
>> +++ b/target-i386/helper.c
>> @@ -1117,11 +1117,10 @@ typedef struct MCEInjectionParams {
>>      int flags;
>>  } MCEInjectionParams;
>>
>> -static void do_inject_x86_mce(void *data)
>> +static void do_inject_x86_mce(CPUState *cpu, void *data)
>>  {
>>      MCEInjectionParams *params = data;
>>      CPUX86State *cenv = &params->cpu->env;
>
> I would eliminate MCEInjectionParams.cpu and set cenv = &cpu->env
> above, to avoid confusion.
>
>> -    CPUState *cpu = CPU(params->cpu);
>>      uint64_t *banks = cenv->mce_banks + 4 * params->bank;
>>
>>      cpu_synchronize_state(cpu);
> [...]
>> diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h
>> index 6d97c08..2112994 100644
>> --- a/target-s390x/cpu.h
>> +++ b/target-s390x/cpu.h
>> @@ -454,17 +454,14 @@ static inline hwaddr decode_basedisp_s(CPUS390XState *env, uint32_t ipb,
>>  #define decode_basedisp_rs decode_basedisp_s
>>
>>  /* helper functions for run_on_cpu() */
>> -static inline void s390_do_cpu_reset(void *arg)
>> +static inline void s390_do_cpu_reset(CPUState *cs, void *arg)
>>  {
>> -    CPUState *cs = arg;
>>      S390CPUClass *scc = S390_CPU_GET_CLASS(cs);
>>
>>      scc->cpu_reset(cs);
>>  }
>> -static inline void s390_do_cpu_full_reset(void *arg)
>> +static inline void s390_do_cpu_full_reset(CPUState *cs, void *arg)
>>  {
>> -    CPUState *cs = arg;
>> -
>>      cpu_reset(cs);
>>  }
>
> The run_on_cpu callers at target-s390x/misc_helper.c still pass
> an unnecessary non-NULL void* argument, making the code
> confusing.

Will fix.


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu.
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (6 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-06-05 16:01   ` Sergey Fedorov
  2016-06-05 16:44   ` Sergey Fedorov
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work Alex Bennée
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite,
	Andreas Färber

From: KONRAD Frederic <fred.konrad@greensocs.com>

We already had async_run_on_cpu but for some tasks we need to ensure all
vCPUs have stopped running before updating shared structures. We call
this safe work as it is safe to make the modifications.

Work is scheduled with async_safe_run_on_cpu() which is passed the
CPUState and an anonymous structure with the relevant data for the task.
Once this is done the vCPU is kicked to bring it out of the main
execution loop.

The main difference with the other run_on_cpu functions is it operates
out of a single queue. This ensures fairness as all pending tasks will
get drained whichever vCPU is nominally doing the work. The internal
implementation is also a GArray so the need to malloc memory is
minimised while adding tasks to the queue.

When async_safe_work_pending() cpu_exec returns and the vCPUs can't
enters execution loop. Once all scheduled vCPUs have exited the loop the
last one to exit processed the execution queue.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
[AJB: Name change, single queue, atomic counter for active vCPUs]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
v1 (arm-v1)
  - now async_safe_run_on_cpu
  - single GArray based queue
  - use atomic counter to bring all vCPUs to a halt
  - wording for async safe_work
---
 cpu-exec-common.c |   1 +
 cpu-exec.c        |  11 ++++++
 cpus.c            | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/qom/cpu.h |  19 ++++++++++
 4 files changed, 131 insertions(+), 2 deletions(-)

diff --git a/cpu-exec-common.c b/cpu-exec-common.c
index 3d7eaa3..c2f7c29 100644
--- a/cpu-exec-common.c
+++ b/cpu-exec-common.c
@@ -79,3 +79,4 @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
     cpu->current_tb = NULL;
     siglongjmp(cpu->jmp_env, 1);
 }
+
diff --git a/cpu-exec.c b/cpu-exec.c
index 42cec05..2f362f8 100644
--- a/cpu-exec.c
+++ b/cpu-exec.c
@@ -365,6 +365,17 @@ int cpu_exec(CPUState *cpu)
     uintptr_t next_tb;
     SyncClocks sc;
 
+    /*
+     * This happen when somebody doesn't want this CPU to start
+     * In case of MTTCG.
+     */
+#ifdef CONFIG_SOFTMMU
+    if (async_safe_work_pending()) {
+        cpu->exit_request = 1;
+        return 0;
+    }
+#endif
+
     /* replay_interrupt may need current_cpu */
     current_cpu = cpu;
 
diff --git a/cpus.c b/cpus.c
index 9177161..860e2a9 100644
--- a/cpus.c
+++ b/cpus.c
@@ -928,6 +928,19 @@ static QemuCond qemu_cpu_cond;
 static QemuCond qemu_pause_cond;
 static QemuCond qemu_work_cond;
 
+/* safe work */
+static int safe_work_pending;
+static int tcg_scheduled_cpus;
+
+typedef struct {
+    CPUState            *cpu;  /* CPU affected */
+    run_on_cpu_func     func;  /* Helper function */
+    void                *data; /* Helper data */
+} qemu_safe_work_item;
+
+static GArray *safe_work;       /* array of qemu_safe_work_items */
+static QemuMutex safe_work_mutex;
+
 void qemu_init_cpu_loop(void)
 {
     qemu_init_sigbus();
@@ -937,6 +950,9 @@ void qemu_init_cpu_loop(void)
     qemu_mutex_init(&qemu_global_mutex);
 
     qemu_thread_get_self(&io_thread);
+
+    safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
+    qemu_mutex_init(&safe_work_mutex);
 }
 
 void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
@@ -997,6 +1013,81 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
     qemu_cpu_kick(cpu);
 }
 
+/*
+ * Safe work interface
+ *
+ * Safe work is defined as work that requires the system to be
+ * quiescent before making changes.
+ */
+
+void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
+{
+    CPUState *iter;
+    qemu_safe_work_item wi;
+    wi.cpu = cpu;
+    wi.func = func;
+    wi.data = data;
+
+    qemu_mutex_lock(&safe_work_mutex);
+    g_array_append_val(safe_work, wi);
+    atomic_inc(&safe_work_pending);
+    qemu_mutex_unlock(&safe_work_mutex);
+
+    /* Signal all vCPUs to halt */
+    CPU_FOREACH(iter) {
+        qemu_cpu_kick(iter);
+    }
+}
+
+/**
+ * flush_queued_safe_work:
+ *
+ * @scheduled_cpu_count
+ *
+ * If not 0 will signal the other vCPUs and sleep. The last vCPU to
+ * get to the function then drains the queue while the system is in a
+ * quiescent state. This allows the operations to change shared
+ * structures.
+ *
+ * @see async_run_safe_work_on_cpu
+ */
+static void flush_queued_safe_work(int scheduled_cpu_count)
+{
+    qemu_safe_work_item *wi;
+    int i;
+
+    /* bail out if there is nothing to do */
+    if (!async_safe_work_pending()) {
+        return;
+    }
+
+    if (scheduled_cpu_count) {
+
+        /* Nothing to do but sleep */
+        qemu_cond_wait(&qemu_work_cond, &qemu_global_mutex);
+
+    } else {
+
+        /* We can now do the work */
+        qemu_mutex_lock(&safe_work_mutex);
+        for (i = 0; i < safe_work->len; i++) {
+            wi = &g_array_index(safe_work, qemu_safe_work_item, i);
+            wi->func(wi->cpu, wi->data);
+        }
+        g_array_remove_range(safe_work, 0, safe_work->len);
+        atomic_set(&safe_work_pending, 0);
+        qemu_mutex_unlock(&safe_work_mutex);
+
+        /* Wake everyone up */
+        qemu_cond_broadcast(&qemu_work_cond);
+    }
+}
+
+bool async_safe_work_pending(void)
+{
+    return (atomic_read(&safe_work_pending) != 0);
+}
+
 static void flush_queued_work(CPUState *cpu)
 {
     struct qemu_work_item *wi;
@@ -1259,6 +1350,7 @@ static void *qemu_tcg_single_cpu_thread_fn(void *arg)
 
         if (cpu) {
             g_assert(cpu->exit_request);
+            flush_queued_safe_work(0);
             /* Pairs with smp_wmb in qemu_cpu_kick.  */
             atomic_mb_set(&cpu->exit_request, 0);
             qemu_tcg_wait_io_event(cpu);
@@ -1300,8 +1392,13 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
     while (1) {
         bool sleep = false;
 
-        if (cpu_can_run(cpu)) {
-            int r = tcg_cpu_exec(cpu);
+        if (cpu_can_run(cpu) && !async_safe_work_pending()) {
+            int r;
+
+            atomic_inc(&tcg_scheduled_cpus);
+            r = tcg_cpu_exec(cpu);
+            flush_queued_safe_work(atomic_dec_fetch(&tcg_scheduled_cpus));
+
             switch (r)
             {
             case EXCP_DEBUG:
@@ -1319,6 +1416,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg)
                 /* Ignore everything else? */
                 break;
             }
+
         } else {
             sleep = true;
         }
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 385d5bb..8ab969e 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -642,6 +642,25 @@ void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
 
 /**
+ * async_safe_run_on_cpu:
+ * @cpu: The vCPU to run on.
+ * @func: The function to be executed.
+ * @data: Data to pass to the function.
+ *
+ * Schedules the function @func for execution on the vCPU @cpu asynchronously
+ * when all the VCPUs are outside their loop.
+ */
+void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data);
+
+/**
+ * async_safe_work_pending:
+ *
+ * Check whether any safe work is pending on any VCPUs.
+ * Returns: @true if a safe work is pending, @false otherwise.
+ */
+bool async_safe_work_pending(void);
+
+/**
  * qemu_get_cpu:
  * @index: The CPUState@cpu_index value of the CPU to obtain.
  *
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu.
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu Alex Bennée
@ 2016-06-05 16:01   ` Sergey Fedorov
  2016-06-06  8:50     ` Alex Bennée
  2016-06-05 16:44   ` Sergey Fedorov
  1 sibling, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-05 16:01 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite, Andreas Färber

On 15/04/16 17:23, Alex Bennée wrote:
> diff --git a/cpu-exec-common.c b/cpu-exec-common.c
> index 3d7eaa3..c2f7c29 100644
> --- a/cpu-exec-common.c
> +++ b/cpu-exec-common.c
> @@ -79,3 +79,4 @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
>      cpu->current_tb = NULL;
>      siglongjmp(cpu->jmp_env, 1);
>  }
> +

Do we rally want this extra line at the end of the file? :)

> diff --git a/cpu-exec.c b/cpu-exec.c
> index 42cec05..2f362f8 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -365,6 +365,17 @@ int cpu_exec(CPUState *cpu)
>      uintptr_t next_tb;
>      SyncClocks sc;
>  
> +    /*
> +     * This happen when somebody doesn't want this CPU to start
> +     * In case of MTTCG.
> +     */
> +#ifdef CONFIG_SOFTMMU
> +    if (async_safe_work_pending()) {
> +        cpu->exit_request = 1;
> +        return 0;
> +    }
> +#endif
> +

I can't see the point of this change. We check for
"async_safe_work_pending()" each time round the loop in
qemu_tcg_cpu_thread_fn() before entering tcg_cpu_exec() and we do that
after 'cpu->exit_request' gets reset.

>      /* replay_interrupt may need current_cpu */
>      current_cpu = cpu;
>  
> diff --git a/cpus.c b/cpus.c
> index 9177161..860e2a9 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -928,6 +928,19 @@ static QemuCond qemu_cpu_cond;
>  static QemuCond qemu_pause_cond;
>  static QemuCond qemu_work_cond;
>  
> +/* safe work */
> +static int safe_work_pending;
> +static int tcg_scheduled_cpus;
> +
> +typedef struct {
> +    CPUState            *cpu;  /* CPU affected */

Do we really need to pass CPU state to the safe work? Async safe work
seems to be supposed to handle cross-CPU tasks rather than CPU-specific
ones. TB flush, for example, doesn't really require CPU to be passed to
the async safe work handler: we should be able to check for code buffer
overflow before scheduling the job. Because of this, it seems to be
reasonable to rename async_safe_run_on_cpu() to something like
request_async_cpu_safe_work().

> +    run_on_cpu_func     func;  /* Helper function */
> +    void                *data; /* Helper data */
> +} qemu_safe_work_item;
> +
> +static GArray *safe_work;       /* array of qemu_safe_work_items */

I think we shouldn't bother with "static" memory allocation for the list
of work items. We should be fine with dynamic allocation in
async_safe_run_on_cpu(). Halting all the CPUs doesn't seem to be cheap
anyhow, thus it shouldn't be used frequently. The only use so far is to
make tb_flush() safe. If we look at how tb_flush() is used we can see
that this is either code buffer exhaustion or some rare operation which
needs all the previous translations to be flushed. The former shouldn't
happen often, the latter isn't supposed to be cheap and fast anyway.

> +static QemuMutex safe_work_mutex;
> +
>  void qemu_init_cpu_loop(void)
>  {
>      qemu_init_sigbus();
> @@ -937,6 +950,9 @@ void qemu_init_cpu_loop(void)
>      qemu_mutex_init(&qemu_global_mutex);
>  
>      qemu_thread_get_self(&io_thread);
> +
> +    safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
> +    qemu_mutex_init(&safe_work_mutex);
>  }
>  
>  void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
> @@ -997,6 +1013,81 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>      qemu_cpu_kick(cpu);
>  }
>  
> +/*
> + * Safe work interface
> + *
> + * Safe work is defined as work that requires the system to be
> + * quiescent before making changes.
> + */
> +
> +void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
> +{
> +    CPUState *iter;
> +    qemu_safe_work_item wi;
> +    wi.cpu = cpu;
> +    wi.func = func;
> +    wi.data = data;
> +
> +    qemu_mutex_lock(&safe_work_mutex);
> +    g_array_append_val(safe_work, wi);
> +    atomic_inc(&safe_work_pending);
> +    qemu_mutex_unlock(&safe_work_mutex);
> +
> +    /* Signal all vCPUs to halt */
> +    CPU_FOREACH(iter) {
> +        qemu_cpu_kick(iter);
> +    }
> +}
> +
> +/**
> + * flush_queued_safe_work:
> + *
> + * @scheduled_cpu_count
> + *
> + * If not 0 will signal the other vCPUs and sleep. The last vCPU to
> + * get to the function then drains the queue while the system is in a
> + * quiescent state. This allows the operations to change shared
> + * structures.
> + *
> + * @see async_run_safe_work_on_cpu
> + */
> +static void flush_queued_safe_work(int scheduled_cpu_count)
> +{
> +    qemu_safe_work_item *wi;
> +    int i;
> +
> +    /* bail out if there is nothing to do */
> +    if (!async_safe_work_pending()) {
> +        return;
> +    }
> +
> +    if (scheduled_cpu_count) {
> +
> +        /* Nothing to do but sleep */
> +        qemu_cond_wait(&qemu_work_cond, &qemu_global_mutex);

We'll not be able to extend this to user-mode emulation because we don't
have BQL there. I think we should consider something like
exclusive_idle()/start_exclusive()/end_exclusive()/cpu_exec_start()/cpu_exec_end().

Kind regards,
Sergey

> +
> +    } else {
> +
> +        /* We can now do the work */
> +        qemu_mutex_lock(&safe_work_mutex);
> +        for (i = 0; i < safe_work->len; i++) {
> +            wi = &g_array_index(safe_work, qemu_safe_work_item, i);
> +            wi->func(wi->cpu, wi->data);
> +        }
> +        g_array_remove_range(safe_work, 0, safe_work->len);
> +        atomic_set(&safe_work_pending, 0);
> +        qemu_mutex_unlock(&safe_work_mutex);
> +
> +        /* Wake everyone up */
> +        qemu_cond_broadcast(&qemu_work_cond);
> +    }
> +}
> +
> +bool async_safe_work_pending(void)
> +{
> +    return (atomic_read(&safe_work_pending) != 0);
> +}
> +
>  static void flush_queued_work(CPUState *cpu)
>  {
>      struct qemu_work_item *wi;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu.
  2016-06-05 16:01   ` Sergey Fedorov
@ 2016-06-06  8:50     ` Alex Bennée
  2016-06-06  9:38       ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-06-06  8:50 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite, Andreas Färber


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 15/04/16 17:23, Alex Bennée wrote:
>> diff --git a/cpu-exec-common.c b/cpu-exec-common.c
>> index 3d7eaa3..c2f7c29 100644
>> --- a/cpu-exec-common.c
>> +++ b/cpu-exec-common.c
>> @@ -79,3 +79,4 @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
>>      cpu->current_tb = NULL;
>>      siglongjmp(cpu->jmp_env, 1);
>>  }
>> +
>
> Do we rally want this extra line at the end of the file? :)
>
>> diff --git a/cpu-exec.c b/cpu-exec.c
>> index 42cec05..2f362f8 100644
>> --- a/cpu-exec.c
>> +++ b/cpu-exec.c
>> @@ -365,6 +365,17 @@ int cpu_exec(CPUState *cpu)
>>      uintptr_t next_tb;
>>      SyncClocks sc;
>>
>> +    /*
>> +     * This happen when somebody doesn't want this CPU to start
>> +     * In case of MTTCG.
>> +     */
>> +#ifdef CONFIG_SOFTMMU
>> +    if (async_safe_work_pending()) {
>> +        cpu->exit_request = 1;
>> +        return 0;
>> +    }
>> +#endif
>> +
>
> I can't see the point of this change. We check for
> "async_safe_work_pending()" each time round the loop in
> qemu_tcg_cpu_thread_fn() before entering tcg_cpu_exec() and we do that
> after 'cpu->exit_request' gets reset.

I can't remember what this was meant to fix now so I'll remove it as suggested.

>
>>      /* replay_interrupt may need current_cpu */
>>      current_cpu = cpu;
>>
>> diff --git a/cpus.c b/cpus.c
>> index 9177161..860e2a9 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -928,6 +928,19 @@ static QemuCond qemu_cpu_cond;
>>  static QemuCond qemu_pause_cond;
>>  static QemuCond qemu_work_cond;
>>
>> +/* safe work */
>> +static int safe_work_pending;
>> +static int tcg_scheduled_cpus;
>> +
>> +typedef struct {
>> +    CPUState            *cpu;  /* CPU affected */
>
> Do we really need to pass CPU state to the safe work? Async safe work
> seems to be supposed to handle cross-CPU tasks rather than CPU-specific
> ones. TB flush, for example, doesn't really require CPU to be passed to
> the async safe work handler: we should be able to check for code buffer
> overflow before scheduling the job. Because of this, it seems to be
> reasonable to rename async_safe_run_on_cpu() to something like
> request_async_cpu_safe_work().

CPUState is common enough to be passed around and this is a shared array
for all queued work. I wanted to ensure the second most common operation
of CPUState + one extra bit of information was covered without having to
resort to memory allocation.

The previous iterations of this work had allocations for structures and
the linked list and I was working hard to make the common case
allocation free.

>
>> +    run_on_cpu_func     func;  /* Helper function */
>> +    void                *data; /* Helper data */
>> +} qemu_safe_work_item;
>> +
>> +static GArray *safe_work;       /* array of qemu_safe_work_items */
>
> I think we shouldn't bother with "static" memory allocation for the list
> of work items. We should be fine with dynamic allocation in
> async_safe_run_on_cpu().

Why? Under heavy load these add up fast, some TLB flush operations can
queue CPU x ASID deferred operations. It's not exactly a large array to
allocate at the start. We could make the initial size smaller if you
want.

> Halting all the CPUs doesn't seem to be cheap
> anyhow, thus it shouldn't be used frequently. The only use so far is to
> make tb_flush() safe. If we look at how tb_flush() is used we can see
> that this is either code buffer exhaustion or some rare operation which
> needs all the previous translations to be flushed. The former shouldn't
> happen often, the latter isn't supposed to be cheap and fast anyway.

The various flushes are more common under ARMv8, we are not just talking
about the big tb_flush using this mechanism.

>
>> +static QemuMutex safe_work_mutex;
>> +
>>  void qemu_init_cpu_loop(void)
>>  {
>>      qemu_init_sigbus();
>> @@ -937,6 +950,9 @@ void qemu_init_cpu_loop(void)
>>      qemu_mutex_init(&qemu_global_mutex);
>>
>>      qemu_thread_get_self(&io_thread);
>> +
>> +    safe_work = g_array_sized_new(TRUE, TRUE, sizeof(qemu_safe_work_item), 128);
>> +    qemu_mutex_init(&safe_work_mutex);
>>  }
>>
>>  void run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>> @@ -997,6 +1013,81 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>>      qemu_cpu_kick(cpu);
>>  }
>>
>> +/*
>> + * Safe work interface
>> + *
>> + * Safe work is defined as work that requires the system to be
>> + * quiescent before making changes.
>> + */
>> +
>> +void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>> +{
>> +    CPUState *iter;
>> +    qemu_safe_work_item wi;
>> +    wi.cpu = cpu;
>> +    wi.func = func;
>> +    wi.data = data;
>> +
>> +    qemu_mutex_lock(&safe_work_mutex);
>> +    g_array_append_val(safe_work, wi);
>> +    atomic_inc(&safe_work_pending);
>> +    qemu_mutex_unlock(&safe_work_mutex);
>> +
>> +    /* Signal all vCPUs to halt */
>> +    CPU_FOREACH(iter) {
>> +        qemu_cpu_kick(iter);
>> +    }
>> +}
>> +
>> +/**
>> + * flush_queued_safe_work:
>> + *
>> + * @scheduled_cpu_count
>> + *
>> + * If not 0 will signal the other vCPUs and sleep. The last vCPU to
>> + * get to the function then drains the queue while the system is in a
>> + * quiescent state. This allows the operations to change shared
>> + * structures.
>> + *
>> + * @see async_run_safe_work_on_cpu
>> + */
>> +static void flush_queued_safe_work(int scheduled_cpu_count)
>> +{
>> +    qemu_safe_work_item *wi;
>> +    int i;
>> +
>> +    /* bail out if there is nothing to do */
>> +    if (!async_safe_work_pending()) {
>> +        return;
>> +    }
>> +
>> +    if (scheduled_cpu_count) {
>> +
>> +        /* Nothing to do but sleep */
>> +        qemu_cond_wait(&qemu_work_cond, &qemu_global_mutex);
>
> We'll not be able to extend this to user-mode emulation because we don't
> have BQL there. I think we should consider something like
> exclusive_idle()/start_exclusive()/end_exclusive()/cpu_exec_start()/cpu_exec_end().

This is true, sacrificed on the alter of making things simpler. It would
be nice if we had a single simple dispatch point for linux-user as well
and we could make things sane there as well.

I'll have another look at the exclusive code although at first blush it
seemed a little more complex because it didn't have the luxury of
knowing how many threads are running.

>
> Kind regards,
> Sergey
>
>> +
>> +    } else {
>> +
>> +        /* We can now do the work */
>> +        qemu_mutex_lock(&safe_work_mutex);
>> +        for (i = 0; i < safe_work->len; i++) {
>> +            wi = &g_array_index(safe_work, qemu_safe_work_item, i);
>> +            wi->func(wi->cpu, wi->data);
>> +        }
>> +        g_array_remove_range(safe_work, 0, safe_work->len);
>> +        atomic_set(&safe_work_pending, 0);
>> +        qemu_mutex_unlock(&safe_work_mutex);
>> +
>> +        /* Wake everyone up */
>> +        qemu_cond_broadcast(&qemu_work_cond);
>> +    }
>> +}
>> +
>> +bool async_safe_work_pending(void)
>> +{
>> +    return (atomic_read(&safe_work_pending) != 0);
>> +}
>> +
>>  static void flush_queued_work(CPUState *cpu)
>>  {
>>      struct qemu_work_item *wi;


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu.
  2016-06-06  8:50     ` Alex Bennée
@ 2016-06-06  9:38       ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-06  9:38 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite, Andreas Färber

On 06/06/16 11:50, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 15/04/16 17:23, Alex Bennée wrote:
>>> diff --git a/cpus.c b/cpus.c
>>> index 9177161..860e2a9 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -928,6 +928,19 @@ static QemuCond qemu_cpu_cond;
>>>  static QemuCond qemu_pause_cond;
>>>  static QemuCond qemu_work_cond;
>>>
>>> +/* safe work */
>>> +static int safe_work_pending;
>>> +static int tcg_scheduled_cpus;
>>> +
>>> +typedef struct {
>>> +    CPUState            *cpu;  /* CPU affected */
>> Do we really need to pass CPU state to the safe work? Async safe work
>> seems to be supposed to handle cross-CPU tasks rather than CPU-specific
>> ones. TB flush, for example, doesn't really require CPU to be passed to
>> the async safe work handler: we should be able to check for code buffer
>> overflow before scheduling the job. Because of this, it seems to be
>> reasonable to rename async_safe_run_on_cpu() to something like
>> request_async_cpu_safe_work().
> CPUState is common enough to be passed around and this is a shared array
> for all queued work. I wanted to ensure the second most common operation
> of CPUState + one extra bit of information was covered without having to
> resort to memory allocation.
>
> The previous iterations of this work had allocations for structures and
> the linked list and I was working hard to make the common case
> allocation free.

The point was that this mechanism supposed to be used for operations
requiring all CPUs to be halted thus there should be little chance it
would require a specific CPUState to be passed.

>
>>> +    run_on_cpu_func     func;  /* Helper function */
>>> +    void                *data; /* Helper data */
>>> +} qemu_safe_work_item;
>>> +
>>> +static GArray *safe_work;       /* array of qemu_safe_work_items */
>> I think we shouldn't bother with "static" memory allocation for the list
>> of work items. We should be fine with dynamic allocation in
>> async_safe_run_on_cpu().
> Why? Under heavy load these add up fast, some TLB flush operations can
> queue CPU x ASID deferred operations. It's not exactly a large array to
> allocate at the start. We could make the initial size smaller if you
> want.

We are going to use async_run_on_cpu() for TLB invalidation, aren't we?

>
>> Halting all the CPUs doesn't seem to be cheap
>> anyhow, thus it shouldn't be used frequently. The only use so far is to
>> make tb_flush() safe. If we look at how tb_flush() is used we can see
>> that this is either code buffer exhaustion or some rare operation which
>> needs all the previous translations to be flushed. The former shouldn't
>> happen often, the latter isn't supposed to be cheap and fast anyway.
> The various flushes are more common under ARMv8, we are not just talking
> about the big tb_flush using this mechanism.

Which of them would really require quiescent state? I suppose all
frequent operations are to be handled in per-CPU manner and don't
require this expansive "halt all the CPUs" mechanism.

>
>>
>>> +static QemuMutex safe_work_mutex;
>>> +
>>>  void qemu_init_cpu_loop(void)
>>>  {
>>>      qemu_init_sigbus();
(snip)
>>> @@ -997,6 +1013,81 @@ void async_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>>>      qemu_cpu_kick(cpu);
>>>  }
>>>
>>> +/*
>>> + * Safe work interface
>>> + *
>>> + * Safe work is defined as work that requires the system to be
>>> + * quiescent before making changes.
>>> + */
>>> +
>>> +void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
>>> +{
>>> +    CPUState *iter;
>>> +    qemu_safe_work_item wi;
>>> +    wi.cpu = cpu;
>>> +    wi.func = func;
>>> +    wi.data = data;
>>> +
>>> +    qemu_mutex_lock(&safe_work_mutex);
>>> +    g_array_append_val(safe_work, wi);
>>> +    atomic_inc(&safe_work_pending);
>>> +    qemu_mutex_unlock(&safe_work_mutex);
>>> +
>>> +    /* Signal all vCPUs to halt */
>>> +    CPU_FOREACH(iter) {
>>> +        qemu_cpu_kick(iter);
>>> +    }
>>> +}
>>> +
>>> +/**
>>> + * flush_queued_safe_work:
>>> + *
>>> + * @scheduled_cpu_count
>>> + *
>>> + * If not 0 will signal the other vCPUs and sleep. The last vCPU to
>>> + * get to the function then drains the queue while the system is in a
>>> + * quiescent state. This allows the operations to change shared
>>> + * structures.
>>> + *
>>> + * @see async_run_safe_work_on_cpu
>>> + */
>>> +static void flush_queued_safe_work(int scheduled_cpu_count)
>>> +{
>>> +    qemu_safe_work_item *wi;
>>> +    int i;
>>> +
>>> +    /* bail out if there is nothing to do */
>>> +    if (!async_safe_work_pending()) {
>>> +        return;
>>> +    }
>>> +
>>> +    if (scheduled_cpu_count) {
>>> +
>>> +        /* Nothing to do but sleep */
>>> +        qemu_cond_wait(&qemu_work_cond, &qemu_global_mutex);
>> We'll not be able to extend this to user-mode emulation because we don't
>> have BQL there. I think we should consider something like
>> exclusive_idle()/start_exclusive()/end_exclusive()/cpu_exec_start()/cpu_exec_end().
> This is true, sacrificed on the alter of making things simpler. It would
> be nice if we had a single simple dispatch point for linux-user as well
> and we could make things sane there as well.
>
> I'll have another look at the exclusive code although at first blush it
> seemed a little more complex because it didn't have the luxury of
> knowing how many threads are running.

Actually, it knows that making use of user-mode-only 'CPUState::running'
and its global counter 'pending_cpus'.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu.
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu Alex Bennée
  2016-06-05 16:01   ` Sergey Fedorov
@ 2016-06-05 16:44   ` Sergey Fedorov
  1 sibling, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-05 16:44 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite, Andreas Färber

On 15/04/16 17:23, Alex Bennée wrote:
> +/*
> + * Safe work interface
> + *
> + * Safe work is defined as work that requires the system to be
> + * quiescent before making changes.
> + */
> +
> +void async_safe_run_on_cpu(CPUState *cpu, run_on_cpu_func func, void *data)
> +{
> +    CPUState *iter;
> +    qemu_safe_work_item wi;

We can call the function right away if MTTCG is disabled or there's only
one CPU emulated.

Kind regards,
Sergey

> +    wi.cpu = cpu;
> +    wi.func = func;
> +    wi.data = data;
> +
> +    qemu_mutex_lock(&safe_work_mutex);
> +    g_array_append_val(safe_work, wi);
> +    atomic_inc(&safe_work_pending);
> +    qemu_mutex_unlock(&safe_work_mutex);
> +
> +    /* Signal all vCPUs to halt */
> +    CPU_FOREACH(iter) {
> +        qemu_cpu_kick(iter);
> +    }
> +}

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work.
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (7 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-06-05 16:39   ` Sergey Fedorov
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe Alex Bennée
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

From: KONRAD Frederic <fred.konrad@greensocs.com>

Some architectures allow to flush the tlb of other VCPUs. This is not a problem
when we have only one thread for all VCPUs but it definitely needs to be an
asynchronous work when we are in true multithreaded work.

This patch doesn't do anything to protect other cputlb function being
called in MTTCG mode making cross vCPU changes.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
[AJB: remove need for g_malloc on defer, make check fixes]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
v1
  - Remove tlb_flush_all just do the check in tlb_flush.
  - remove the need to g_malloc
  - tlb_flush calls direct if !cpu->created
---
 cputlb.c                | 61 ++++++++++++++++++++++++++++++++++++++-----------
 include/exec/exec-all.h |  1 +
 2 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 1412049..42a3b07 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -56,22 +56,14 @@
     } \
 } while (0)
 
+/* We need a solution for stuffing 64 bit pointers in 32 bit ones if
+ * we care about this combination */
+QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(void *));
+
 /* statistics */
 int tlb_flush_count;
 
-/* NOTE:
- * If flush_global is true (the usual case), flush all tlb entries.
- * If flush_global is false, flush (at least) all tlb entries not
- * marked global.
- *
- * Since QEMU doesn't currently implement a global/not-global flag
- * for tlb entries, at the moment tlb_flush() will also flush all
- * tlb entries in the flush_global == false case. This is OK because
- * CPU architectures generally permit an implementation to drop
- * entries from the TLB at any time, so flushing more entries than
- * required is only an efficiency issue, not a correctness issue.
- */
-void tlb_flush(CPUState *cpu, int flush_global)
+static void tlb_flush_nocheck(CPUState *cpu, int flush_global)
 {
     CPUArchState *env = cpu->env_ptr;
 
@@ -89,6 +81,34 @@ void tlb_flush(CPUState *cpu, int flush_global)
     env->tlb_flush_addr = -1;
     env->tlb_flush_mask = 0;
     tlb_flush_count++;
+    /* atomic_mb_set(&cpu->pending_tlb_flush, 0); */
+}
+
+static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
+{
+    tlb_flush_nocheck(cpu, GPOINTER_TO_INT(opaque));
+}
+
+/* NOTE:
+ * If flush_global is true (the usual case), flush all tlb entries.
+ * If flush_global is false, flush (at least) all tlb entries not
+ * marked global.
+ *
+ * Since QEMU doesn't currently implement a global/not-global flag
+ * for tlb entries, at the moment tlb_flush() will also flush all
+ * tlb entries in the flush_global == false case. This is OK because
+ * CPU architectures generally permit an implementation to drop
+ * entries from the TLB at any time, so flushing more entries than
+ * required is only an efficiency issue, not a correctness issue.
+ */
+void tlb_flush(CPUState *cpu, int flush_global)
+{
+    if (cpu->created) {
+        async_run_on_cpu(cpu, tlb_flush_global_async_work,
+                         GINT_TO_POINTER(flush_global));
+    } else {
+        tlb_flush_nocheck(cpu, flush_global);
+    }
 }
 
 static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
@@ -222,6 +242,21 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
     tb_flush_jmp_cache(cpu, addr);
 }
 
+static void tlb_flush_page_async_work(CPUState *cpu, void *opaque)
+{
+    tlb_flush_page(cpu, GPOINTER_TO_UINT(opaque));
+}
+
+void tlb_flush_page_all(target_ulong addr)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        async_run_on_cpu(cpu, tlb_flush_page_async_work,
+                         GUINT_TO_POINTER(addr));
+    }
+}
+
 /* update the TLBs so that writes to code in the virtual page 'addr'
    can be detected */
 void tlb_protect_code(ram_addr_t ram_addr)
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 9144ee0..f695577 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -190,6 +190,7 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
 void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr);
 void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
                  uintptr_t retaddr);
+void tlb_flush_page_all(target_ulong addr);
 #else
 static inline void tlb_flush_page(CPUState *cpu, target_ulong addr)
 {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work.
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work Alex Bennée
@ 2016-06-05 16:39   ` Sergey Fedorov
  2016-06-06  8:54     ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-05 16:39 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 15/04/16 17:23, Alex Bennée wrote:
> diff --git a/cputlb.c b/cputlb.c
> index 1412049..42a3b07 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -56,22 +56,14 @@
>      } \
>  } while (0)
>  
> +/* We need a solution for stuffing 64 bit pointers in 32 bit ones if
> + * we care about this combination */
> +QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(void *));
> +
>  /* statistics */
>  int tlb_flush_count;
>  
> -/* NOTE:
> - * If flush_global is true (the usual case), flush all tlb entries.
> - * If flush_global is false, flush (at least) all tlb entries not
> - * marked global.
> - *
> - * Since QEMU doesn't currently implement a global/not-global flag
> - * for tlb entries, at the moment tlb_flush() will also flush all
> - * tlb entries in the flush_global == false case. This is OK because
> - * CPU architectures generally permit an implementation to drop
> - * entries from the TLB at any time, so flushing more entries than
> - * required is only an efficiency issue, not a correctness issue.
> - */
> -void tlb_flush(CPUState *cpu, int flush_global)
> +static void tlb_flush_nocheck(CPUState *cpu, int flush_global)
>  {
>      CPUArchState *env = cpu->env_ptr;
>  
> @@ -89,6 +81,34 @@ void tlb_flush(CPUState *cpu, int flush_global)
>      env->tlb_flush_addr = -1;
>      env->tlb_flush_mask = 0;
>      tlb_flush_count++;
> +    /* atomic_mb_set(&cpu->pending_tlb_flush, 0); */
> +}
> +
> +static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
> +{
> +    tlb_flush_nocheck(cpu, GPOINTER_TO_INT(opaque));
> +}
> +
> +/* NOTE:
> + * If flush_global is true (the usual case), flush all tlb entries.
> + * If flush_global is false, flush (at least) all tlb entries not
> + * marked global.
> + *
> + * Since QEMU doesn't currently implement a global/not-global flag
> + * for tlb entries, at the moment tlb_flush() will also flush all
> + * tlb entries in the flush_global == false case. This is OK because
> + * CPU architectures generally permit an implementation to drop
> + * entries from the TLB at any time, so flushing more entries than
> + * required is only an efficiency issue, not a correctness issue.
> + */
> +void tlb_flush(CPUState *cpu, int flush_global)
> +{
> +    if (cpu->created) {

Why do we check for 'cpu->created' here? Any why don't do that in
tlb_flush_page_all()?

> +        async_run_on_cpu(cpu, tlb_flush_global_async_work,
> +                         GINT_TO_POINTER(flush_global));
> +    } else {
> +        tlb_flush_nocheck(cpu, flush_global);
> +    }
>  }
>  
>  static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
> @@ -222,6 +242,21 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
>      tb_flush_jmp_cache(cpu, addr);
>  }
>  
> +static void tlb_flush_page_async_work(CPUState *cpu, void *opaque)
> +{
> +    tlb_flush_page(cpu, GPOINTER_TO_UINT(opaque));
> +}
> +
> +void tlb_flush_page_all(target_ulong addr)
> +{
> +    CPUState *cpu;
> +
> +    CPU_FOREACH(cpu) {
> +        async_run_on_cpu(cpu, tlb_flush_page_async_work,
> +                         GUINT_TO_POINTER(addr));
> +    }
> +}
> +
>  /* update the TLBs so that writes to code in the virtual page 'addr'
>     can be detected */
>  void tlb_protect_code(ram_addr_t ram_addr)
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 9144ee0..f695577 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -190,6 +190,7 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
>  void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr);
>  void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>                   uintptr_t retaddr);
> +void tlb_flush_page_all(target_ulong addr);
>  #else
>  static inline void tlb_flush_page(CPUState *cpu, target_ulong addr)
>  {

tlb_flush_by_mmuidx() and tlb_flush_page_by_mmuidx() want to be safe as
well.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work.
  2016-06-05 16:39   ` Sergey Fedorov
@ 2016-06-06  8:54     ` Alex Bennée
  2016-06-06 10:04       ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-06-06  8:54 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 15/04/16 17:23, Alex Bennée wrote:
>> diff --git a/cputlb.c b/cputlb.c
>> index 1412049..42a3b07 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -56,22 +56,14 @@
>>      } \
>>  } while (0)
>>
>> +/* We need a solution for stuffing 64 bit pointers in 32 bit ones if
>> + * we care about this combination */
>> +QEMU_BUILD_BUG_ON(sizeof(target_ulong) > sizeof(void *));
>> +
>>  /* statistics */
>>  int tlb_flush_count;
>>
>> -/* NOTE:
>> - * If flush_global is true (the usual case), flush all tlb entries.
>> - * If flush_global is false, flush (at least) all tlb entries not
>> - * marked global.
>> - *
>> - * Since QEMU doesn't currently implement a global/not-global flag
>> - * for tlb entries, at the moment tlb_flush() will also flush all
>> - * tlb entries in the flush_global == false case. This is OK because
>> - * CPU architectures generally permit an implementation to drop
>> - * entries from the TLB at any time, so flushing more entries than
>> - * required is only an efficiency issue, not a correctness issue.
>> - */
>> -void tlb_flush(CPUState *cpu, int flush_global)
>> +static void tlb_flush_nocheck(CPUState *cpu, int flush_global)
>>  {
>>      CPUArchState *env = cpu->env_ptr;
>>
>> @@ -89,6 +81,34 @@ void tlb_flush(CPUState *cpu, int flush_global)
>>      env->tlb_flush_addr = -1;
>>      env->tlb_flush_mask = 0;
>>      tlb_flush_count++;
>> +    /* atomic_mb_set(&cpu->pending_tlb_flush, 0); */
>> +}
>> +
>> +static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
>> +{
>> +    tlb_flush_nocheck(cpu, GPOINTER_TO_INT(opaque));
>> +}
>> +
>> +/* NOTE:
>> + * If flush_global is true (the usual case), flush all tlb entries.
>> + * If flush_global is false, flush (at least) all tlb entries not
>> + * marked global.
>> + *
>> + * Since QEMU doesn't currently implement a global/not-global flag
>> + * for tlb entries, at the moment tlb_flush() will also flush all
>> + * tlb entries in the flush_global == false case. This is OK because
>> + * CPU architectures generally permit an implementation to drop
>> + * entries from the TLB at any time, so flushing more entries than
>> + * required is only an efficiency issue, not a correctness issue.
>> + */
>> +void tlb_flush(CPUState *cpu, int flush_global)
>> +{
>> +    if (cpu->created) {
>
> Why do we check for 'cpu->created' here? Any why don't do that in
> tlb_flush_page_all()?

A bunch of random stuff gets kicked off at start-up which was getting in
the way (c.f. arm_cpu_reset and watch/breakpoints). tlb_flush() is
rather liberally sprinkled around the init code of various CPUs.

>
>> +        async_run_on_cpu(cpu, tlb_flush_global_async_work,
>> +                         GINT_TO_POINTER(flush_global));
>> +    } else {
>> +        tlb_flush_nocheck(cpu, flush_global);
>> +    }
>>  }
>>
>>  static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
>> @@ -222,6 +242,21 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, ...)
>>      tb_flush_jmp_cache(cpu, addr);
>>  }
>>
>> +static void tlb_flush_page_async_work(CPUState *cpu, void *opaque)
>> +{
>> +    tlb_flush_page(cpu, GPOINTER_TO_UINT(opaque));
>> +}
>> +
>> +void tlb_flush_page_all(target_ulong addr)
>> +{
>> +    CPUState *cpu;
>> +
>> +    CPU_FOREACH(cpu) {
>> +        async_run_on_cpu(cpu, tlb_flush_page_async_work,
>> +                         GUINT_TO_POINTER(addr));
>> +    }
>> +}
>> +
>>  /* update the TLBs so that writes to code in the virtual page 'addr'
>>     can be detected */
>>  void tlb_protect_code(ram_addr_t ram_addr)
>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>> index 9144ee0..f695577 100644
>> --- a/include/exec/exec-all.h
>> +++ b/include/exec/exec-all.h
>> @@ -190,6 +190,7 @@ void tlb_set_page(CPUState *cpu, target_ulong vaddr,
>>  void tb_invalidate_phys_addr(AddressSpace *as, hwaddr addr);
>>  void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>>                   uintptr_t retaddr);
>> +void tlb_flush_page_all(target_ulong addr);
>>  #else
>>  static inline void tlb_flush_page(CPUState *cpu, target_ulong addr)
>>  {
>
> tlb_flush_by_mmuidx() and tlb_flush_page_by_mmuidx() want to be safe as
> well.
>
> Kind regards,
> Sergey


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work.
  2016-06-06  8:54     ` Alex Bennée
@ 2016-06-06 10:04       ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-06 10:04 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite

On 06/06/16 11:54, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 15/04/16 17:23, Alex Bennée wrote:
>>> diff --git a/cputlb.c b/cputlb.c
>>> index 1412049..42a3b07 100644
>>> --- a/cputlb.c
>>> +++ b/cputlb.c
(snip)
>>> @@ -89,6 +81,34 @@ void tlb_flush(CPUState *cpu, int flush_global)
>>>      env->tlb_flush_addr = -1;
>>>      env->tlb_flush_mask = 0;
>>>      tlb_flush_count++;
>>> +    /* atomic_mb_set(&cpu->pending_tlb_flush, 0); */
>>> +}
>>> +
>>> +static void tlb_flush_global_async_work(CPUState *cpu, void *opaque)
>>> +{
>>> +    tlb_flush_nocheck(cpu, GPOINTER_TO_INT(opaque));
>>> +}
>>> +
>>> +/* NOTE:
>>> + * If flush_global is true (the usual case), flush all tlb entries.
>>> + * If flush_global is false, flush (at least) all tlb entries not
>>> + * marked global.
>>> + *
>>> + * Since QEMU doesn't currently implement a global/not-global flag
>>> + * for tlb entries, at the moment tlb_flush() will also flush all
>>> + * tlb entries in the flush_global == false case. This is OK because
>>> + * CPU architectures generally permit an implementation to drop
>>> + * entries from the TLB at any time, so flushing more entries than
>>> + * required is only an efficiency issue, not a correctness issue.
>>> + */
>>> +void tlb_flush(CPUState *cpu, int flush_global)
>>> +{
>>> +    if (cpu->created) {
>> Why do we check for 'cpu->created' here? Any why don't do that in
>> tlb_flush_page_all()?
> A bunch of random stuff gets kicked off at start-up which was getting in
> the way (c.f. arm_cpu_reset and watch/breakpoints). tlb_flush() is
> rather liberally sprinkled around the init code of various CPUs.

Wouldn't we race if we call tlb_flush() with no BQL held? We could just
queue an async job since we force each CPU thread to flush its work
queue before starting execution anyway.

Kind regards,
Sergey

>
>>> +        async_run_on_cpu(cpu, tlb_flush_global_async_work,
>>> +                         GINT_TO_POINTER(flush_global));
>>> +    } else {
>>> +        tlb_flush_nocheck(cpu, flush_global);
>>> +    }
>>>  }
>>>
>>>  static inline void v_tlb_flush_by_mmuidx(CPUState *cpu, va_list argp)
>>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe.
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (8 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-06-05 16:48   ` Sergey Fedorov
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a] Alex Bennée
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

From: KONRAD Frederic <fred.konrad@greensocs.com>

tb_flush is not thread safe we definitely need to exit VCPUs to do that.
This introduces tb_flush_safe which just creates an async safe work which will
do a tb_flush later. This is called when we run out of space.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
[AJB: merge the various tb_flush commits]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
v1:
  - add more comments
  - fix build for linux-user
  - add log warning on linux-user
---
 include/exec/exec-all.h |  1 +
 translate-all.c         | 34 +++++++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index f695577..858055b 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -307,6 +307,7 @@ struct TBContext {
 
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
+void tb_flush_safe(CPUState *cpu);
 void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
 
 #if defined(USE_DIRECT_JUMP)
diff --git a/translate-all.c b/translate-all.c
index 8e70583..874dc56 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -825,9 +825,30 @@ static void page_flush_tb(void)
     }
 }
 
-/* flush all the translation blocks */
-/* XXX: tb_flush is currently not thread safe.  System emulation calls it only
- * with tb_lock taken or from safe_work, so no need to take tb_lock here.
+#ifdef CONFIG_SOFTMMU
+static void tb_flush_work(CPUState *cpu, void *opaque)
+{
+    tb_flush(cpu);
+}
+#endif
+
+void tb_flush_safe(CPUState *cpu)
+{
+#ifdef CONFIG_SOFTMMU
+    async_safe_run_on_cpu(cpu, tb_flush_work, NULL);
+#else
+    qemu_log("Safe flushing of TBs not implemented for linux-user\n");
+    tb_flush(cpu);
+#endif
+}
+
+/* Flush *all* translations
+ *
+ * This invalidates all translations, lookups and caches.
+ *
+ * This is not thread save for linux-user. For softmmu targets the
+ * flushing is done when all vCPUs are quiescent via the
+ * async_safe_work mechanism
  */
 void tb_flush(CPUState *cpu)
 {
@@ -1183,10 +1204,9 @@ TranslationBlock *tb_gen_code(CPUState *cpu,
     if (unlikely(!tb)) {
  buffer_overflow:
         /* flush must be done */
-        tb_flush(cpu);
-        /* cannot fail at this point */
-        tb = tb_alloc(pc);
-        assert(tb != NULL);
+        tb_flush_safe(cpu);
+        tb_unlock();
+        cpu_loop_exit(cpu);
     }
 
     gen_code_buf = tcg_ctx.code_gen_ptr;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe.
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe Alex Bennée
@ 2016-06-05 16:48   ` Sergey Fedorov
  2016-06-06  8:54     ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-05 16:48 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 15/04/16 17:23, Alex Bennée wrote:
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index f695577..858055b 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -307,6 +307,7 @@ struct TBContext {
>  
>  void tb_free(TranslationBlock *tb);
>  void tb_flush(CPUState *cpu);
> +void tb_flush_safe(CPUState *cpu);

Do we really want to have both tb_flush_safe() and tb_flush()?

Kind regards,
Sergey

>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
>  
>  #if defined(USE_DIRECT_JUMP)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe.
  2016-06-05 16:48   ` Sergey Fedorov
@ 2016-06-06  8:54     ` Alex Bennée
  2016-06-06 10:06       ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-06-06  8:54 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 15/04/16 17:23, Alex Bennée wrote:
>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>> index f695577..858055b 100644
>> --- a/include/exec/exec-all.h
>> +++ b/include/exec/exec-all.h
>> @@ -307,6 +307,7 @@ struct TBContext {
>>
>>  void tb_free(TranslationBlock *tb);
>>  void tb_flush(CPUState *cpu);
>> +void tb_flush_safe(CPUState *cpu);
>
> Do we really want to have both tb_flush_safe() and tb_flush()?

I guess if we are going to include user-mode in the party ;-)

>
> Kind regards,
> Sergey
>
>>  void tb_phys_invalidate(TranslationBlock *tb, tb_page_addr_t page_addr);
>>
>>  #if defined(USE_DIRECT_JUMP)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe.
  2016-06-06  8:54     ` Alex Bennée
@ 2016-06-06 10:06       ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-06 10:06 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite

On 06/06/16 11:54, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 15/04/16 17:23, Alex Bennée wrote:
>>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>>> index f695577..858055b 100644
>>> --- a/include/exec/exec-all.h
>>> +++ b/include/exec/exec-all.h
>>> @@ -307,6 +307,7 @@ struct TBContext {
>>>
>>>  void tb_free(TranslationBlock *tb);
>>>  void tb_flush(CPUState *cpu);
>>> +void tb_flush_safe(CPUState *cpu);
>> Do we really want to have both tb_flush_safe() and tb_flush()?
> I guess if we are going to include user-mode in the party ;-)
>
>

User-mode should be fine; we take care of it:

+void tb_flush_safe(CPUState *cpu)
+{
+#ifdef CONFIG_SOFTMMU
+    async_safe_run_on_cpu(cpu, tb_flush_work, NULL);
+#else
+    qemu_log("Safe flushing of TBs not implemented for linux-user\n");
+    tb_flush(cpu);
+#endif
+}
+

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a]
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (9 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-06-05 16:54   ` Sergey Fedorov
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 11/12] arm: atomically check the exclusive value in a STREX Alex Bennée
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite,
	open list:ARM

From: KONRAD Frederic <fred.konrad@greensocs.com>

Instead of flushing each individual vCPU use the tlb_flush_page_all
functions which is async enabled for MTTCG.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 include/exec/exec-all.h |  3 +++
 target-arm/helper.c     | 12 ++----------
 2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 858055b..bc97683 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -208,6 +208,9 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu,
 static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
 {
 }
+static inline void tlb_flush_page_all(target_ulong addr)
+{
+}
 #endif
 
 #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
diff --git a/target-arm/helper.c b/target-arm/helper.c
index 19d5d52..bc9fbda 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -554,21 +554,13 @@ static void tlbiasid_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
 static void tlbimva_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    CPUState *other_cs;
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
-    }
+    tlb_flush_page_all(value & TARGET_PAGE_MASK);
 }
 
 static void tlbimvaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
                              uint64_t value)
 {
-    CPUState *other_cs;
-
-    CPU_FOREACH(other_cs) {
-        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
-    }
+    tlb_flush_page_all(value & TARGET_PAGE_MASK);
 }
 
 static const ARMCPRegInfo cp_reginfo[] = {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a]
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a] Alex Bennée
@ 2016-06-05 16:54   ` Sergey Fedorov
  2016-06-06  8:55     ` Alex Bennée
  0 siblings, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-05 16:54 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite, open list:ARM

On 15/04/16 17:23, Alex Bennée wrote:
> From: KONRAD Frederic <fred.konrad@greensocs.com>
>
> Instead of flushing each individual vCPU use the tlb_flush_page_all
> functions which is async enabled for MTTCG.
>
> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
>  include/exec/exec-all.h |  3 +++
>  target-arm/helper.c     | 12 ++----------
>  2 files changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index 858055b..bc97683 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -208,6 +208,9 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu,
>  static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
>  {
>  }
> +static inline void tlb_flush_page_all(target_ulong addr)
> +{
> +}

This change belongs to the patch which introduced the function.

Kind regards,
Sergey

>  #endif
>  
>  #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 19d5d52..bc9fbda 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -554,21 +554,13 @@ static void tlbiasid_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
>  static void tlbimva_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
>                               uint64_t value)
>  {
> -    CPUState *other_cs;
> -
> -    CPU_FOREACH(other_cs) {
> -        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
> -    }
> +    tlb_flush_page_all(value & TARGET_PAGE_MASK);
>  }
>  
>  static void tlbimvaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
>                               uint64_t value)
>  {
> -    CPUState *other_cs;
> -
> -    CPU_FOREACH(other_cs) {
> -        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
> -    }
> +    tlb_flush_page_all(value & TARGET_PAGE_MASK);
>  }
>  
>  static const ARMCPRegInfo cp_reginfo[] = {

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a]
  2016-06-05 16:54   ` Sergey Fedorov
@ 2016-06-06  8:55     ` Alex Bennée
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-06-06  8:55 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite, open list:ARM


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 15/04/16 17:23, Alex Bennée wrote:
>> From: KONRAD Frederic <fred.konrad@greensocs.com>
>>
>> Instead of flushing each individual vCPU use the tlb_flush_page_all
>> functions which is async enabled for MTTCG.
>>
>> Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> ---
>>  include/exec/exec-all.h |  3 +++
>>  target-arm/helper.c     | 12 ++----------
>>  2 files changed, 5 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>> index 858055b..bc97683 100644
>> --- a/include/exec/exec-all.h
>> +++ b/include/exec/exec-all.h
>> @@ -208,6 +208,9 @@ static inline void tlb_flush_page_by_mmuidx(CPUState *cpu,
>>  static inline void tlb_flush_by_mmuidx(CPUState *cpu, ...)
>>  {
>>  }
>> +static inline void tlb_flush_page_all(target_ulong addr)
>> +{
>> +}
>
> This change belongs to the patch which introduced the function.

Thanks, will fix.

>
> Kind regards,
> Sergey
>
>>  #endif
>>
>>  #define CODE_GEN_ALIGN           16 /* must be >= of the size of a icache line */
>> diff --git a/target-arm/helper.c b/target-arm/helper.c
>> index 19d5d52..bc9fbda 100644
>> --- a/target-arm/helper.c
>> +++ b/target-arm/helper.c
>> @@ -554,21 +554,13 @@ static void tlbiasid_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
>>  static void tlbimva_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
>>                               uint64_t value)
>>  {
>> -    CPUState *other_cs;
>> -
>> -    CPU_FOREACH(other_cs) {
>> -        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
>> -    }
>> +    tlb_flush_page_all(value & TARGET_PAGE_MASK);
>>  }
>>
>>  static void tlbimvaa_is_write(CPUARMState *env, const ARMCPRegInfo *ri,
>>                               uint64_t value)
>>  {
>> -    CPUState *other_cs;
>> -
>> -    CPU_FOREACH(other_cs) {
>> -        tlb_flush_page(other_cs, value & TARGET_PAGE_MASK);
>> -    }
>> +    tlb_flush_page_all(value & TARGET_PAGE_MASK);
>>  }
>>
>>  static const ARMCPRegInfo cp_reginfo[] = {


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 11/12] arm: atomically check the exclusive value in a STREX
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (10 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a] Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86 Alex Bennée
  2016-04-15 19:12 ` [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm " Alex Bennée
  13 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, open list:ARM

From: KONRAD Frederic <fred.konrad@greensocs.com>

This mechanism replaces the existing load/store exclusive mechanism which seems
to be broken for MTTCG. It follows the intention of the existing
mechanism and stores the target address and data values during a load
operation and checks that they remain unchanged before a store.

In common with the older approach, this provides weaker semantics than required
in that it could be that a different processor writes the same value as a
non-exclusive write, however in practise this seems to be irrelevant.

The old implementation didn’t correctly store it’s values as globals, but rather
kept a local copy per CPU.

This new mechanism stores the values globally and also uses the atomic cmpxchg
macros to ensure atomicity - it is therefore very efficient and threadsafe.

Signed-off-by: KONRAD Frederic <fred.konrad@greensocs.com>
[AJB: re-base, compile fixes, atomic_bool_cmpxchg]
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>

---
v1 (enable-for-armv7-v1):
  - re-base on top of base patches
  - use atomic_bool_cmpxchg
  - ws fixes
  - protect with #ifdef CONFIG_SOFTMMU to fix linux-user compilation
---
 target-arm/cpu.c       |  21 ++++++++
 target-arm/cpu.h       |   6 +++
 target-arm/helper.c    |  16 ++++++
 target-arm/helper.h    |   6 +++
 target-arm/op_helper.c | 130 ++++++++++++++++++++++++++++++++++++++++++++++++-
 target-arm/translate.c |  96 ++++++------------------------------
 6 files changed, 194 insertions(+), 81 deletions(-)

diff --git a/target-arm/cpu.c b/target-arm/cpu.c
index e48e83a..2c0e9a5 100644
--- a/target-arm/cpu.c
+++ b/target-arm/cpu.c
@@ -32,6 +32,26 @@
 #include "sysemu/kvm.h"
 #include "kvm_arm.h"
 
+/* Protect cpu_exclusive_* variable .*/
+__thread bool cpu_have_exclusive_lock;
+QemuSpin cpu_exclusive_lock;
+
+inline void arm_exclusive_lock(void)
+{
+    if (!cpu_have_exclusive_lock) {
+        qemu_spin_lock(&cpu_exclusive_lock);
+        cpu_have_exclusive_lock = true;
+    }
+}
+
+inline void arm_exclusive_unlock(void)
+{
+    if (cpu_have_exclusive_lock) {
+        cpu_have_exclusive_lock = false;
+        qemu_spin_unlock(&cpu_exclusive_lock);
+    }
+}
+
 static void arm_cpu_set_pc(CPUState *cs, vaddr value)
 {
     ARMCPU *cpu = ARM_CPU(cs);
@@ -482,6 +502,7 @@ static void arm_cpu_initfn(Object *obj)
         cpu->psci_version = 2; /* TCG implements PSCI 0.2 */
         if (!inited) {
             inited = true;
+            qemu_spin_init(&cpu_exclusive_lock);
             arm_translate_init();
         }
     }
diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index 066ff67..4055567 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -523,6 +523,9 @@ static inline bool is_a64(CPUARMState *env)
 int cpu_arm_signal_handler(int host_signum, void *pinfo,
                            void *puc);
 
+bool arm_get_phys_addr(CPUARMState *env, target_ulong address, int access_type,
+                       hwaddr *phys_ptr, int *prot, target_ulong *page_size);
+
 /**
  * pmccntr_sync
  * @env: CPUARMState
@@ -2170,6 +2173,9 @@ static inline void cpu_get_tb_cpu_state(CPUARMState *env, target_ulong *pc,
 
 #include "exec/exec-all.h"
 
+void arm_exclusive_lock(void);
+void arm_exclusive_unlock(void);
+
 enum {
     QEMU_PSCI_CONDUIT_DISABLED = 0,
     QEMU_PSCI_CONDUIT_SMC = 1,
diff --git a/target-arm/helper.c b/target-arm/helper.c
index bc9fbda..560cbef 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -35,6 +35,18 @@ static bool get_phys_addr_lpae(CPUARMState *env, target_ulong address,
 #define PMCRE   0x1
 #endif
 
+#ifdef CONFIG_SOFTMMU
+bool arm_get_phys_addr(CPUARMState *env, target_ulong address, int access_type,
+                       hwaddr *phys_ptr, int *prot, target_ulong *page_size)
+{
+    MemTxAttrs attrs = {};
+    ARMMMUFaultInfo fi = {};
+    uint32_t fsr;
+    return get_phys_addr(env, address, access_type, cpu_mmu_index(env, false),
+                         phys_ptr, &attrs, prot, page_size, &fsr, &fi);
+}
+#endif
+
 static int vfp_gdb_get_reg(CPUARMState *env, uint8_t *buf, int reg)
 {
     int nregs;
@@ -6104,6 +6116,10 @@ static void arm_cpu_do_interrupt_aarch32(CPUState *cs)
     uint32_t offset;
     uint32_t moe;
 
+    arm_exclusive_lock();
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
+
     /* If this is a debug exception we must update the DBGDSCR.MOE bits */
     switch (env->exception.syndrome >> ARM_EL_EC_SHIFT) {
     case EC_BREAKPOINT:
diff --git a/target-arm/helper.h b/target-arm/helper.h
index 84aa637..ba1fe4d 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -537,6 +537,12 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+#ifdef CONFIG_SOFTMMU
+DEF_HELPER_4(atomic_cmpxchg64, i32, env, i32, i64, i32)
+#endif
+DEF_HELPER_1(atomic_clear, void, env)
+DEF_HELPER_3(atomic_claim, void, env, i32, i64)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index d626ff1..669fb45 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -31,12 +31,141 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
     CPUState *cs = CPU(arm_env_get_cpu(env));
 
     assert(!excp_is_internal(excp));
+    arm_exclusive_lock();
     cs->exception_index = excp;
     env->exception.syndrome = syndrome;
     env->exception.target_el = target_el;
+    /*
+     * We MAY already have the lock - in which case we are exiting the
+     * instruction due to an exception. Otherwise we better make sure we are not
+     * about to enter a STREX anyway.
+     */
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
     cpu_loop_exit(cs);
 }
 
+#ifdef CONFIG_SOFTMMU
+/* NB return 1 for fail, 0 for pass */
+uint32_t HELPER(atomic_cmpxchg64)(CPUARMState *env, uint32_t addr,
+                                  uint64_t newval, uint32_t size)
+{
+    ARMCPU *cpu = arm_env_get_cpu(env);
+    CPUState *cs = CPU(cpu);
+
+    uintptr_t retaddr = GETRA();
+    bool result = false;
+    hwaddr len = 1 << size;
+
+    hwaddr paddr;
+    target_ulong page_size;
+    int prot;
+
+    arm_exclusive_lock();
+
+    if (env->exclusive_addr != addr) {
+        arm_exclusive_unlock();
+        return 1;
+    }
+
+    if (arm_get_phys_addr(env, addr, 1, &paddr, &prot, &page_size)) {
+        tlb_fill(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                 cpu_mmu_index(env, false), retaddr);
+        if (arm_get_phys_addr(env, addr, 1, &paddr, &prot, &page_size)) {
+            arm_exclusive_unlock();
+            return 1;
+        }
+    }
+
+    switch (size) {
+    case 0:
+    {
+        uint8_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint8_t)env->exclusive_val;
+            result = atomic_bool_cmpxchg(p, oldval, (uint8_t)newval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    case 1:
+    {
+        uint16_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint16_t)env->exclusive_val;
+            result = atomic_bool_cmpxchg(p, oldval, (uint16_t)newval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    case 2:
+    {
+        uint32_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint32_t)env->exclusive_val;
+            result = atomic_bool_cmpxchg(p, oldval, (uint32_t)newval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    case 3:
+    {
+        uint64_t oldval, *p;
+        p = address_space_map(cs->as, paddr, &len, true);
+        if (len == 1 << size) {
+            oldval = (uint64_t)env->exclusive_val;
+            result = atomic_bool_cmpxchg(p, oldval, (uint64_t)newval);
+        }
+        address_space_unmap(cs->as, p, len, true, result ? len : 0);
+    }
+    break;
+    default:
+        abort();
+    break;
+    }
+
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
+    if (result) {
+        return 0;
+    } else {
+        return 1;
+    }
+}
+#endif
+
+void HELPER(atomic_clear)(CPUARMState *env)
+{
+    /* make sure no STREX is about to start */
+    arm_exclusive_lock();
+    env->exclusive_addr = -1;
+    arm_exclusive_unlock();
+}
+
+void HELPER(atomic_claim)(CPUARMState *env, uint32_t addr, uint64_t val)
+{
+    CPUState *cpu;
+    CPUARMState *current_cpu;
+
+    /* ensure that there are no STREX's executing */
+    arm_exclusive_lock();
+
+    CPU_FOREACH(cpu) {
+        current_cpu = &ARM_CPU(cpu)->env;
+        if (current_cpu->exclusive_addr  == addr) {
+            /* We steal the atomic of this CPU. */
+            current_cpu->exclusive_addr = -1;
+        }
+    }
+
+    env->exclusive_val = val;
+    env->exclusive_addr = addr;
+    arm_exclusive_unlock();
+}
+
 static int exception_target_el(CPUARMState *env)
 {
     int target_el = MAX(1, arm_current_el(env));
@@ -862,7 +991,6 @@ void HELPER(exception_return)(CPUARMState *env)
 
     aarch64_save_sp(env, cur_el);
 
-    env->exclusive_addr = -1;
 
     /* We must squash the PSTATE.SS bit to zero unless both of the
      * following hold:
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 940ec8d..c946c0e 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -7691,6 +7691,7 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
 {
     TCGv_i32 tmp = tcg_temp_new_i32();
+    TCGv_i64 val = tcg_temp_new_i64();
 
     s->is_ldex = true;
 
@@ -7715,20 +7716,20 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
 
         tcg_gen_addi_i32(tmp2, addr, 4);
         gen_aa32_ld32u(s, tmp3, tmp2, get_mem_index(s));
+        tcg_gen_concat_i32_i64(val, tmp, tmp3);
         tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(cpu_exclusive_val, tmp, tmp3);
         store_reg(s, rt2, tmp3);
     } else {
-        tcg_gen_extu_i32_i64(cpu_exclusive_val, tmp);
+        tcg_gen_extu_i32_i64(val, tmp);
     }
-
+    gen_helper_atomic_claim(cpu_env, addr, val);
+    tcg_temp_free_i64(val);
     store_reg(s, rt, tmp);
-    tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
 }
 
 static void gen_clrex(DisasContext *s)
 {
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    gen_helper_atomic_clear(cpu_env);
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7745,84 +7746,19 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
 {
     TCGv_i32 tmp;
-    TCGv_i64 val64, extaddr;
-    TCGLabel *done_label;
-    TCGLabel *fail_label;
-
-    /* if (env->exclusive_addr == addr && env->exclusive_val == [addr]) {
-         [addr] = {Rt};
-         {Rd} = 0;
-       } else {
-         {Rd} = 1;
-       } */
-    fail_label = gen_new_label();
-    done_label = gen_new_label();
-    extaddr = tcg_temp_new_i64();
-    tcg_gen_extu_i32_i64(extaddr, addr);
-    tcg_gen_brcond_i64(TCG_COND_NE, extaddr, cpu_exclusive_addr, fail_label);
-    tcg_temp_free_i64(extaddr);
-
-    tmp = tcg_temp_new_i32();
-    switch (size) {
-    case 0:
-        gen_aa32_ld8u(s, tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_ld16u(s, tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_ld32u(s, tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
-
-    val64 = tcg_temp_new_i64();
-    if (size == 3) {
-        TCGv_i32 tmp2 = tcg_temp_new_i32();
-        TCGv_i32 tmp3 = tcg_temp_new_i32();
-        tcg_gen_addi_i32(tmp2, addr, 4);
-        gen_aa32_ld32u(s, tmp3, tmp2, get_mem_index(s));
-        tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(val64, tmp, tmp3);
-        tcg_temp_free_i32(tmp3);
-    } else {
-        tcg_gen_extu_i32_i64(val64, tmp);
-    }
-    tcg_temp_free_i32(tmp);
-
-    tcg_gen_brcond_i64(TCG_COND_NE, val64, cpu_exclusive_val, fail_label);
-    tcg_temp_free_i64(val64);
+    TCGv_i32 tmp2;
+    TCGv_i64 val = tcg_temp_new_i64();
+    TCGv_i32 tmp_size = tcg_const_i32(size);
 
     tmp = load_reg(s, rt);
-    switch (size) {
-    case 0:
-        gen_aa32_st8(s, tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_st16(s, tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_st32(s, tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
+    tmp2 = load_reg(s, rt2);
+    tcg_gen_concat_i32_i64(val, tmp, tmp2);
     tcg_temp_free_i32(tmp);
-    if (size == 3) {
-        tcg_gen_addi_i32(addr, addr, 4);
-        tmp = load_reg(s, rt2);
-        gen_aa32_st32(s, tmp, addr, get_mem_index(s));
-        tcg_temp_free_i32(tmp);
-    }
-    tcg_gen_movi_i32(cpu_R[rd], 0);
-    tcg_gen_br(done_label);
-    gen_set_label(fail_label);
-    tcg_gen_movi_i32(cpu_R[rd], 1);
-    gen_set_label(done_label);
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    tcg_temp_free_i32(tmp2);
+
+    gen_helper_atomic_cmpxchg64(cpu_R[rd], cpu_env, addr, val, tmp_size);
+    tcg_temp_free_i64(val);
+    tcg_temp_free_i32(tmp_size);
 }
 #endif
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (11 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 11/12] arm: atomically check the exclusive value in a STREX Alex Bennée
@ 2016-04-15 14:23 ` Alex Bennée
  2016-06-05 17:12   ` Sergey Fedorov
  2016-06-06 10:26   ` Peter Maydell
  2016-04-15 19:12 ` [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm " Alex Bennée
  13 siblings, 2 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 14:23 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Alex Bennée, Peter Crosthwaite

This makes multi-threading the default for 32 bit ARM on x86. It has
been tested with Debian Jessie as well as my extended KVM unit tests
which stress the SMC and TB invalidation code. Those tests can be found
at:

  https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5

Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
 cpus.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/cpus.c b/cpus.c
index 860e2a9..daa92c7 100644
--- a/cpus.c
+++ b/cpus.c
@@ -171,12 +171,24 @@ opts_init(tcg_register_config);
 
 static bool default_mttcg_enabled(void)
 {
-    /*
-     * TODO: Check if we have a chance to have MTTCG working on this guest/host.
-     *       Basically is the atomic instruction implemented? Is there any
-     *       memory ordering issue?
+    /* Checklist for enabling MTTCG on a given frontend/backend combination
+     *
+     *  - Are atomics correctly modelled for an MTTCG environment
+     *  - If the backend is weakly ordered
+     *    - has the front-end implemented explicit memory ordering ops
+     *    - does the back-end generate code to ensure memory ordering
      */
+#if defined(__i386__) || defined(__x86_64__)
+    /* x86 backend is strongly ordered which helps a lot */
+    #if defined(TARGET_ARM)
+    return true;
+    #else
+    return false;
+    #endif
+#else
+    /* Until memory ordering implemented things will likely break */
     return false;
+#endif
 }
 
 void qemu_tcg_configure(QemuOpts *opts)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86 Alex Bennée
@ 2016-06-05 17:12   ` Sergey Fedorov
  2016-06-06  8:58     ` Alex Bennée
  2016-06-06 10:26   ` Peter Maydell
  1 sibling, 1 reply; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-05 17:12 UTC (permalink / raw)
  To: Alex Bennée, mttcg, fred.konrad, a.rigo, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Peter Crosthwaite

On 15/04/16 17:23, Alex Bennée wrote:
> This makes multi-threading the default for 32 bit ARM on x86. It has
> been tested with Debian Jessie as well as my extended KVM unit tests
> which stress the SMC and TB invalidation code. Those tests can be found
> at:
>
>   https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
>  cpus.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index 860e2a9..daa92c7 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -171,12 +171,24 @@ opts_init(tcg_register_config);
>  
>  static bool default_mttcg_enabled(void)
>  {
> -    /*
> -     * TODO: Check if we have a chance to have MTTCG working on this guest/host.
> -     *       Basically is the atomic instruction implemented? Is there any
> -     *       memory ordering issue?
> +    /* Checklist for enabling MTTCG on a given frontend/backend combination
> +     *
> +     *  - Are atomics correctly modelled for an MTTCG environment

- Are cross-CPU manipulations safe (e.g. TLB invalidation/flush from
target helper)
- Are TCG context manipulations safe (e.g. TB invalidation from target
helper)

> +     *  - If the backend is weakly ordered
> +     *    - has the front-end implemented explicit memory ordering ops
> +     *    - does the back-end generate code to ensure memory ordering
>       */
> +#if defined(__i386__) || defined(__x86_64__)
> +    /* x86 backend is strongly ordered which helps a lot */
> +    #if defined(TARGET_ARM)
> +    return true;
> +    #else
> +    return false;
> +    #endif

Is it okay to indent preprocessor lines this way? I think preprocessor
lines are better to stand out from regular code and could be indented
like this:

#if defined(__foo__)
# if defined(BAR)
    /* ... */
# else
    /* ... */
# endif
#else
    /* ... */
#endif

Kind regards,
Sergey

> +#else
> +    /* Until memory ordering implemented things will likely break */
>      return false;
> +#endif
>  }
>  
>  void qemu_tcg_configure(QemuOpts *opts)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-06-05 17:12   ` Sergey Fedorov
@ 2016-06-06  8:58     ` Alex Bennée
  2016-06-06 10:19       ` Sergey Fedorov
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-06-06  8:58 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite


Sergey Fedorov <serge.fdrv@gmail.com> writes:

> On 15/04/16 17:23, Alex Bennée wrote:
>> This makes multi-threading the default for 32 bit ARM on x86. It has
>> been tested with Debian Jessie as well as my extended KVM unit tests
>> which stress the SMC and TB invalidation code. Those tests can be found
>> at:
>>
>>   https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5
>>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> ---
>>  cpus.c | 20 ++++++++++++++++----
>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/cpus.c b/cpus.c
>> index 860e2a9..daa92c7 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -171,12 +171,24 @@ opts_init(tcg_register_config);
>>
>>  static bool default_mttcg_enabled(void)
>>  {
>> -    /*
>> -     * TODO: Check if we have a chance to have MTTCG working on this guest/host.
>> -     *       Basically is the atomic instruction implemented? Is there any
>> -     *       memory ordering issue?
>> +    /* Checklist for enabling MTTCG on a given frontend/backend combination
>> +     *
>> +     *  - Are atomics correctly modelled for an MTTCG environment
>
> - Are cross-CPU manipulations safe (e.g. TLB invalidation/flush from
> target helper)
> - Are TCG context manipulations safe (e.g. TB invalidation from target
> helper)

OK

>
>> +     *  - If the backend is weakly ordered
>> +     *    - has the front-end implemented explicit memory ordering ops
>> +     *    - does the back-end generate code to ensure memory ordering
>>       */
>> +#if defined(__i386__) || defined(__x86_64__)
>> +    /* x86 backend is strongly ordered which helps a lot */
>> +    #if defined(TARGET_ARM)
>> +    return true;
>> +    #else
>> +    return false;
>> +    #endif
>
> Is it okay to indent preprocessor lines this way? I think preprocessor
> lines are better to stand out from regular code and could be indented
> like this:
>
> #if defined(__foo__)
> # if defined(BAR)
>     /* ... */
> # else
>     /* ... */
> # endif
> #else
>     /* ... */
> #endif

To be honest I was expecting more push-back on this because it is such
an ugly way of solving the problem and expressing what a default on
means.

>
> Kind regards,
> Sergey
>
>> +#else
>> +    /* Until memory ordering implemented things will likely break */
>>      return false;
>> +#endif
>>  }
>>
>>  void qemu_tcg_configure(QemuOpts *opts)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-06-06  8:58     ` Alex Bennée
@ 2016-06-06 10:19       ` Sergey Fedorov
  0 siblings, 0 replies; 41+ messages in thread
From: Sergey Fedorov @ 2016-06-06 10:19 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, fred.konrad, a.rigo, cota, qemu-devel, mark.burton,
	pbonzini, jan.kiszka, rth, peter.maydell, claudio.fontana,
	Peter Crosthwaite

On 06/06/16 11:58, Alex Bennée wrote:
> Sergey Fedorov <serge.fdrv@gmail.com> writes:
>
>> On 15/04/16 17:23, Alex Bennée wrote:
>>> diff --git a/cpus.c b/cpus.c
>>> index 860e2a9..daa92c7 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -171,12 +171,24 @@ opts_init(tcg_register_config);
>>>
>>>  static bool default_mttcg_enabled(void)
>>>  {
>>> -    /*
>>> -     * TODO: Check if we have a chance to have MTTCG working on this guest/host.
>>> -     *       Basically is the atomic instruction implemented? Is there any
>>> -     *       memory ordering issue?
>>> +    /* Checklist for enabling MTTCG on a given frontend/backend combination
>>> +     *
>>> +     *  - Are atomics correctly modelled for an MTTCG environment
>> - Are cross-CPU manipulations safe (e.g. TLB invalidation/flush from
>> target helper)
>> - Are TCG context manipulations safe (e.g. TB invalidation from target
>> helper)
> OK
>
>>> +     *  - If the backend is weakly ordered
>>> +     *    - has the front-end implemented explicit memory ordering ops
>>> +     *    - does the back-end generate code to ensure memory ordering
>>>       */
>>> +#if defined(__i386__) || defined(__x86_64__)
>>> +    /* x86 backend is strongly ordered which helps a lot */
>>> +    #if defined(TARGET_ARM)
>>> +    return true;
>>> +    #else
>>> +    return false;
>>> +    #endif
>> Is it okay to indent preprocessor lines this way? I think preprocessor
>> lines are better to stand out from regular code and could be indented
>> like this:
>>
>> #if defined(__foo__)
>> # if defined(BAR)
>>     /* ... */
>> # else
>>     /* ... */
>> # endif
>> #else
>>     /* ... */
>> #endif
> To be honest I was expecting more push-back on this because it is such
> an ugly way of solving the problem and expressing what a default on
> means.

I could be okay as long as there are only a few options. We could also
put here some generic tests like strong/weak ordering checks and
introduce target- and host-specific functions which can tell us if we
should ever try enabling MTTCG for them.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86 Alex Bennée
  2016-06-05 17:12   ` Sergey Fedorov
@ 2016-06-06 10:26   ` Peter Maydell
  2016-06-06 14:28     ` Alex Bennée
  1 sibling, 1 reply; 41+ messages in thread
From: Peter Maydell @ 2016-06-06 10:26 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo,
	Sergey Fedorov, Emilio G. Cota, QEMU Developers, Mark Burton,
	Paolo Bonzini, J. Kiszka, Richard Henderson, Claudio Fontana,
	Peter Crosthwaite

On 15 April 2016 at 15:23, Alex Bennée <alex.bennee@linaro.org> wrote:
> This makes multi-threading the default for 32 bit ARM on x86. It has
> been tested with Debian Jessie as well as my extended KVM unit tests
> which stress the SMC and TB invalidation code. Those tests can be found
> at:
>
>   https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
>  cpus.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/cpus.c b/cpus.c
> index 860e2a9..daa92c7 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -171,12 +171,24 @@ opts_init(tcg_register_config);
>
>  static bool default_mttcg_enabled(void)
>  {
> -    /*
> -     * TODO: Check if we have a chance to have MTTCG working on this guest/host.
> -     *       Basically is the atomic instruction implemented? Is there any
> -     *       memory ordering issue?
> +    /* Checklist for enabling MTTCG on a given frontend/backend combination
> +     *
> +     *  - Are atomics correctly modelled for an MTTCG environment
> +     *  - If the backend is weakly ordered
> +     *    - has the front-end implemented explicit memory ordering ops
> +     *    - does the back-end generate code to ensure memory ordering
>       */
> +#if defined(__i386__) || defined(__x86_64__)
> +    /* x86 backend is strongly ordered which helps a lot */
> +    #if defined(TARGET_ARM)
> +    return true;
> +    #else
> +    return false;
> +    #endif
> +#else
> +    /* Until memory ordering implemented things will likely break */
>      return false;
> +#endif

No new per-host ifdef ladders, please (or per-target ifdef ladders,
either). Have some #defines for "TCG backend supports MTTCG" and
"TCG frontend supports MTTCG" which get set in some suitable per-host
and per-target header, and only enable if they're both set.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-06-06 10:26   ` Peter Maydell
@ 2016-06-06 14:28     ` Alex Bennée
  2016-06-06 14:37       ` Peter Maydell
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Bennée @ 2016-06-06 14:28 UTC (permalink / raw)
  To: Peter Maydell
  Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo,
	Sergey Fedorov, Emilio G. Cota, QEMU Developers, Mark Burton,
	Paolo Bonzini, J. Kiszka, Richard Henderson, Claudio Fontana,
	Peter Crosthwaite


Peter Maydell <peter.maydell@linaro.org> writes:

> On 15 April 2016 at 15:23, Alex Bennée <alex.bennee@linaro.org> wrote:
>> This makes multi-threading the default for 32 bit ARM on x86. It has
>> been tested with Debian Jessie as well as my extended KVM unit tests
>> which stress the SMC and TB invalidation code. Those tests can be found
>> at:
>>
>>   https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5
>>
>> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
>> ---
>>  cpus.c | 20 ++++++++++++++++----
>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>
>> diff --git a/cpus.c b/cpus.c
>> index 860e2a9..daa92c7 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -171,12 +171,24 @@ opts_init(tcg_register_config);
>>
>>  static bool default_mttcg_enabled(void)
>>  {
>> -    /*
>> -     * TODO: Check if we have a chance to have MTTCG working on this guest/host.
>> -     *       Basically is the atomic instruction implemented? Is there any
>> -     *       memory ordering issue?
>> +    /* Checklist for enabling MTTCG on a given frontend/backend combination
>> +     *
>> +     *  - Are atomics correctly modelled for an MTTCG environment
>> +     *  - If the backend is weakly ordered
>> +     *    - has the front-end implemented explicit memory ordering ops
>> +     *    - does the back-end generate code to ensure memory ordering
>>       */
>> +#if defined(__i386__) || defined(__x86_64__)
>> +    /* x86 backend is strongly ordered which helps a lot */
>> +    #if defined(TARGET_ARM)
>> +    return true;
>> +    #else
>> +    return false;
>> +    #endif
>> +#else
>> +    /* Until memory ordering implemented things will likely break */
>>      return false;
>> +#endif
>
> No new per-host ifdef ladders, please (or per-target ifdef ladders,
> either). Have some #defines for "TCG backend supports MTTCG" and
> "TCG frontend supports MTTCG" which get set in some suitable per-host
> and per-target header, and only enable if they're both set.

Will do so. I guess the middling case of backend is strongly ordered
enough to get away with partial barrier implementation at the front end
should be skipped? We'll only turn on the frontend/backend support flags
when:

  * All frontends fully express the ordering constraints to the TCG
    (e.g. all barriers and annotations complete)

  * The backend emits enough code to ensure any ordering constraint
    expressed in TCG ops can be satisfied.

Are you happy to keep the commentary here with the default function as
that is where people are likely to end up when searching?

>
> thanks
> -- PMM


--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86
  2016-06-06 14:28     ` Alex Bennée
@ 2016-06-06 14:37       ` Peter Maydell
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Maydell @ 2016-06-06 14:37 UTC (permalink / raw)
  To: Alex Bennée
  Cc: MTTCG Devel, KONRAD Frédéric, Alvise Rigo,
	Sergey Fedorov, Emilio G. Cota, QEMU Developers, Mark Burton,
	Paolo Bonzini, J. Kiszka, Richard Henderson, Claudio Fontana,
	Peter Crosthwaite

On 6 June 2016 at 15:28, Alex Bennée <alex.bennee@linaro.org> wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
>> No new per-host ifdef ladders, please (or per-target ifdef ladders,
>> either). Have some #defines for "TCG backend supports MTTCG" and
>> "TCG frontend supports MTTCG" which get set in some suitable per-host
>> and per-target header, and only enable if they're both set.
>
> Will do so. I guess the middling case of backend is strongly ordered
> enough to get away with partial barrier implementation at the front end
> should be skipped?

I don't mind if you have multiple ifdefs for "backend fully supports
MTTCG" and "backend partially supports MTTCG" or whatever combination
makes sense -- I haven't looked enough at the implementation to
know what would be best. I just want to avoid ifdef ladders.

> Are you happy to keep the commentary here with the default function as
> that is where people are likely to end up when searching?

Yes, that makes sense. Consider also a section in tcg/README
documenting the requirements for a backend.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86
  2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
                   ` (12 preceding siblings ...)
  2016-04-15 14:23 ` [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86 Alex Bennée
@ 2016-04-15 19:12 ` Alex Bennée
  13 siblings, 0 replies; 41+ messages in thread
From: Alex Bennée @ 2016-04-15 19:12 UTC (permalink / raw)
  To: mttcg, fred.konrad, a.rigo, serge.fdrv, cota
  Cc: qemu-devel, mark.burton, pbonzini, jan.kiszka, rth, peter.maydell,
	claudio.fontana, Dr. David Alan Gilbert


Alex Bennée <alex.bennee@linaro.org> writes:

> Hi,
>
> This series finally completes the re-build of Fred's multi_tcg_v8 tree
> by enabling MTTCG for armv7 guests on x86 hosts. This applies on top
> of the previous series:
<snip>
>
> Benchmarks
> ==========
>
> The benchmark is a simple boot and build test which builds stress-ng
> with -j ${NR_CPUS} and shuts down to facilitate easy repetition.
>
> arm-softmmu/qemu-system-arm -machine type=virt -display none -m 4096 \
>     -cpu cortex-a15 -serial telnet:127.0.0.1:4444 \
>     -monitor stdio -netdev user,id=unet,hostfwd=tcp::2222-:22 \
>     -device virtio-net -device,netdev=unet \
>     -drive file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none \
>     -device virtio-blk-device,drive=myblock
>     -append "console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1"
>     -kernel /home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img
>
>
> | -smp 1 (mttcg=off) | -smp 4 (mttcg=off) | -smp 4 (mttcg=on) |
> |--------------------+--------------------+-------------------|
> | 301.60 (5 runs)    | 312.27 (4 runs)    |  573.26 (5 runs)  |
>
> As the results show currently the performance for mttcg is worse than
> the single threaded version. However this tree doesn't have the
> lockless tb_find_fast which means every time there is a transition
> from one page to the next the lock needs to be taken. There is still
> work to be done for performance ;-)
>
> Alex Bennée (5):
>   qemu-thread: add simple test-and-set spinlock
>   atomic: introduce atomic_dec_fetch.
>   atomic: introduce cmpxchg_bool
>   cpus: pass CPUState to run_on_cpu helpers
>   cpus: default MTTCG to on for 32 bit ARM on x86
>
> KONRAD Frederic (5):
>   cpus: introduce async_safe_run_on_cpu.
>   cputlb: introduce tlb_flush_* async work.
>   translate-all: introduces tb_flush_safe.
>   arm: use tlb_flush_page_all for tlbimva[a]
>   arm: atomically check the exclusive value in a STREX
>
> Paolo Bonzini (1):
>   include: move CPU-related definitions out of qemu-common.h
>
> Sergey Fedorov (1):
>   tcg/i386: Make direct jump patching thread-safe
>
>  cpu-exec-common.c         |   1 +
>  cpu-exec.c                |  11 ++++
>  cpus.c                    | 137 +++++++++++++++++++++++++++++++++++++++++-----
>  cputlb.c                  |  61 ++++++++++++++++-----
>  hw/i386/kvm/apic.c        |   3 +-
>  hw/i386/kvmvapic.c        |   8 +--
>  hw/ppc/ppce500_spin.c     |   3 +-
>  hw/ppc/spapr.c            |   6 +-
>  hw/ppc/spapr_hcall.c      |  12 ++--
>  include/exec/exec-all.h   |   7 ++-
>  include/qemu-common.h     |  24 --------
>  include/qemu/atomic.h     |  15 +++++
>  include/qemu/processor.h  |  28 ++++++++++
>  include/qemu/thread.h     |  34 ++++++++++++
>  include/qemu/timer.h      |   1 +
>  include/qom/cpu.h         |  34 +++++++++++-
>  include/sysemu/cpus.h     |  13 +++++

As suggested by treblig I also ran a more pure CPU heavy task (pigz
compression of a kernel tarball):

command is ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 'type=virt', '-display', 'none', '-m', '4096', '-cpu', 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 'virtio-net-device,netdev=unet', '-drive', 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 root=/dev/vda1 systemd.unit=benchmark-pigz.service', '-kernel', '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', '1', '-tcg', 'mttcg=off']
run 1: ret=0 (PASS), time=136.379699 (1/1)
run 2: ret=0 (PASS), time=135.358848 (2/2)
run 3: ret=0 (PASS), time=135.708094 (3/3)
run 4: ret=0 (PASS), time=136.076002 (4/4)
run 5: ret=0 (PASS), time=137.863306 (5/5)
command is ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 'type=virt', '-display', 'none', '-m', '4096', '-cpu', 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 'virtio-net-device,netdev=unet', '-drive', 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 root=/dev/vda1 systemd.unit=benchmark-pigz.service', '-kernel', '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', '4', '-tcg', 'mttcg=on']
run 1: ret=0 (PASS), time=142.524636 (1/1)
run 2: ret=0 (PASS), time=139.960601 (2/2)
run 3: ret=0 (PASS), time=137.956633 (3/3)
run 4: ret=0 (PASS), time=139.699225 (4/4)
run 5: ret=0 (PASS), time=143.365373 (5/5)

More parity but of course we'd actually want it to be faster.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2016-06-06 14:37 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-15 14:23 [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm on x86 Alex Bennée
2016-04-15 14:23 ` Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 01/12] include: move CPU-related definitions out of qemu-common.h Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 02/12] tcg/i386: Make direct jump patching thread-safe Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 03/12] qemu-thread: add simple test-and-set spinlock Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 04/12] atomic: introduce atomic_dec_fetch Alex Bennée
2016-06-02 20:34   ` Sergey Fedorov
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 05/12] atomic: introduce cmpxchg_bool Alex Bennée
2016-04-15 16:22   ` Richard Henderson
2016-04-15 17:06     ` Alex Bennée
2016-06-03 16:45   ` Sergey Fedorov
2016-06-03 19:12     ` Alex Bennée
2016-06-03 19:20       ` Eric Blake
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 06/12] cpus: pass CPUState to run_on_cpu helpers Alex Bennée
2016-04-20 18:59   ` Eduardo Habkost
2016-04-20 19:50     ` Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 07/12] cpus: introduce async_safe_run_on_cpu Alex Bennée
2016-06-05 16:01   ` Sergey Fedorov
2016-06-06  8:50     ` Alex Bennée
2016-06-06  9:38       ` Sergey Fedorov
2016-06-05 16:44   ` Sergey Fedorov
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 08/12] cputlb: introduce tlb_flush_* async work Alex Bennée
2016-06-05 16:39   ` Sergey Fedorov
2016-06-06  8:54     ` Alex Bennée
2016-06-06 10:04       ` Sergey Fedorov
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 09/12] translate-all: introduces tb_flush_safe Alex Bennée
2016-06-05 16:48   ` Sergey Fedorov
2016-06-06  8:54     ` Alex Bennée
2016-06-06 10:06       ` Sergey Fedorov
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 10/12] arm: use tlb_flush_page_all for tlbimva[a] Alex Bennée
2016-06-05 16:54   ` Sergey Fedorov
2016-06-06  8:55     ` Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 11/12] arm: atomically check the exclusive value in a STREX Alex Bennée
2016-04-15 14:23 ` [Qemu-devel] [RFC v1 12/12] cpus: default MTTCG to on for 32 bit ARM on x86 Alex Bennée
2016-06-05 17:12   ` Sergey Fedorov
2016-06-06  8:58     ` Alex Bennée
2016-06-06 10:19       ` Sergey Fedorov
2016-06-06 10:26   ` Peter Maydell
2016-06-06 14:28     ` Alex Bennée
2016-06-06 14:37       ` Peter Maydell
2016-04-15 19:12 ` [Qemu-devel] [RFC v1 00/12] Enable MTTCG for 32 bit arm " Alex Bennée

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).