[Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation
@ 2016-04-19 13:39 Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
                   ` (14 more replies)
  0 siblings, 15 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo

This is the eighth iteration of the patch series which applies to the
upstream branch of QEMU (v2.5.0+).

Changes versus previous versions are at the bottom of this cover letter.

The code is also available at following repository:
https://git.virtualopensystems.com/dev/qemu-mt.git
branch:
slowpath-for-atomic-v8-no-mttcg

This patch series provides an infrastructure for atomic instruction
implementation in QEMU, thus offering a 'legacy' solution for
translating guest atomic instructions. Moreover, it can be considered as
a first step toward a multi-thread TCG.

The underlying idea is to provide new TCG helpers (sort of softmmu
helpers) that guarantee atomicity to some memory accesses or in general
a way to define memory transactions.

More specifically, the new softmmu helpers behave as LoadLink and
StoreConditional instructions, and are called from TCG code by means of
target specific helpers. This work includes the implementation for all
the ARM atomic instructions, see target-arm/op_helper.c.

The implementation heavily uses the software TLB together with a new
bitmap that has been added to the ram_list structure which flags, on a
per-CPU basis, all the memory pages that are in the middle of a LoadLink
(LL), StoreConditional (SC) operation.  Since all these pages can be
accessed directly through the fast-path and alter a vCPU's linked value,
the new bitmap has been coupled with a new TLB flag for the TLB virtual
address which forces the slow-path execution for all the accesses to a
page containing a linked address.

The new slow-path is implemented such that:
- the LL behaves as a normal load slow-path, except for clearing the
  dirty flag in the bitmap.  The cputlb.c code while generating a TLB
  entry, checks if there is at least one vCPU that has the bit cleared
  in the exclusive bitmap, it that case the TLB entry will have the EXCL
  flag set, thus forcing the slow-path.  In order to ensure that all the
  vCPUs will follow the slow-path for that page, we flush the TLB cache
  of all the other vCPUs.

  The LL will also set the linked address and size of the access in a
  vCPU's private variable. After the corresponding SC, this address will
  be set to a reset value.

- the SC can fail returning 1, or succeed, returning 0.  It has to come
  always after a LL and has to access the same address 'linked' by the
  previous LL, otherwise it will fail. If in the time window delimited
  by a legit pair of LL/SC operations another write access happens to
  the linked address, the SC will fail.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

The code has been tested with bare-metal test cases and by booting Linux.

* Performance considerations
The new slow-path adds some overhead to the translation of the ARM
atomic instructions, since their emulation doesn't happen anymore only
in the guest (by means of pure TCG generated code), but requires the
execution of two helpers functions. Despite this, the additional time
required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
negligible.
Instead, on a LL/SC bound test scenario - like:
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
solution requires 30% (1 million iterations) and 70% (10 millions
iterations) of additional time for the test to complete.

Changes from v7:
- softmmu_template.h refactoring for a more consistent reduction of
  code duplication
- Simplified the patches introducing MMIO support for exclusive accesses
- Addressed all comments from Alex

Changes from v6:
- Included aligned variants of the exclusive helpers
- Reverted to single bit per page design in DIRTY_MEMORY_EXCLUSIVE
  bitmap. The new way we restore the pages as non-exclusive (PATCH 13)
  made the per-VCPU design unnecessary.
- arm32 now uses aligned exclusive accesses
- aarch64 exclusive instructions implemented [PATCH 15-16]
- Addressed comments from Alex

Changes from v5:
- The exclusive memory region is now set through a CPUClass hook,
  allowing any architecture to decide the memory area that will be
  protected during a LL/SC operation [PATCH 3]
- The runtime helpers dropped any target dependency and are now in a
  common file [PATCH 5]
- Improved the way we restore a guest page as non-exclusive [PATCH 9]
- Included MMIO memory as possible target of LL/SC
  instructions. This also required to somehow simplify the
  helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]

Changes from v4:
- Reworked the exclusive bitmap to be of fixed size (8 bits per address)
- The slow-path is now TCG backend independent, no need to touch
  tcg/* anymore as suggested by Aurelien Jarno.

Changes from v3:
- based on upstream QEMU
- addressed comments from Alex Bennée
- the slow path can be enabled by the user with:
  ./configure --enable-tcg-ldst-excl only if the backend supports it
- all the ARM ldex/stex instructions make now use of the slow path
- added aarch64 TCG backend support
- part of the code has been rewritten

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
  a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (14):
  exec.c: Add new exclusive bitmap to ram_list
  softmmu: Simplify helper_*_st_name, wrap unaligned code
  softmmu: Simplify helper_*_st_name, wrap MMIO code
  softmmu: Simplify helper_*_st_name, wrap RAM code
  softmmu: Add new TLB_EXCL flag
  qom: cpu: Add CPUClass hooks for exclusive range
  softmmu: Add helpers for a new slowpath
  softmmu: Add history of excl accesses
  softmmu: Honor the new exclusive bitmap
  softmmu: Support MMIO exclusive accesses
  tcg: Create new runtime helpers for excl accesses
  target-arm: translate: Use ld/st excl for atomic insns
  target-arm: cpu64: use custom set_excl hook
  target-arm: aarch64: Use ls/st exclusive for atomic insns

 Makefile.target             |   2 +-
 cputlb.c                    |  64 ++++++++++-
 exec.c                      |  21 +++-
 include/exec/cpu-all.h      |   8 ++
 include/exec/helper-gen.h   |   3 +
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   3 +
 include/exec/memory.h       |   4 +-
 include/exec/ram_addr.h     |  31 +++++
 include/qom/cpu.h           |  33 ++++++
 qom/cpu.c                   |  29 +++++
 softmmu_llsc_template.h     | 136 ++++++++++++++++++++++
 softmmu_template.h          | 274 ++++++++++++++++++++++++++++++--------------
 target-arm/cpu.h            |   3 +
 target-arm/cpu64.c          |   8 ++
 target-arm/helper-a64.c     |  55 +++++++++
 target-arm/helper-a64.h     |   2 +
 target-arm/helper.h         |   2 +
 target-arm/machine.c        |   7 ++
 target-arm/op_helper.c      |  14 ++-
 target-arm/translate-a64.c  | 168 +++++++++++++++------------
 target-arm/translate.c      | 263 +++++++++++++++++++++++-------------------
 tcg-llsc-helper.c           | 104 +++++++++++++++++
 tcg-llsc-helper.h           |  61 ++++++++++
 tcg/tcg-llsc-gen-helper.h   |  67 +++++++++++
 tcg/tcg.h                   |  31 +++++
 vl.c                        |   3 +
 27 files changed, 1110 insertions(+), 287 deletions(-)
 create mode 100644 softmmu_llsc_template.h
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

-- 
2.8.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 01/14] exec.c: Add new exclusive bitmap to ram_list
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 02/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC). For all these
pages, the corresponding TLB entries will be generated in such a way to
force the slow-path for all the VCPUs (see the following patches).

When the system starts, the whole memory is set to dirty.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 exec.c                  |  2 +-
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 31 +++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/exec.c b/exec.c
index 7115403..cefee1b 100644
--- a/exec.c
+++ b/exec.c
@@ -1579,7 +1579,7 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
             ram_list.dirty_memory[i] =
                 bitmap_zero_extend(ram_list.dirty_memory[i],
                                    old_ram_size, new_ram_size);
-       }
+        }
     }
     cpu_physical_memory_set_dirty_range(new_block->offset,
                                         new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index c92734a..71e0480 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index ef1489d..19789fc 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -172,6 +173,9 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
     if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
         bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
     }
+    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE], page, end - page);
+    }
     xen_modified_memory(start, length);
 }
 
@@ -287,5 +291,32 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
 }
 
 void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
+
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_GET_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
+
+/* Make the page of @addr not exclusive. */
+static inline void cpu_physical_memory_unset_excl(ram_addr_t addr)
+{
+    set_bit_atomic(EXCL_BITMAP_GET_OFFSET(addr),
+                   ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Return true if the page of @addr is exclusive, i.e. the EXCL bit is set. */
+static inline int cpu_physical_memory_is_excl(ram_addr_t addr)
+{
+    return !test_bit(EXCL_BITMAP_GET_OFFSET(addr),
+                     ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Set the page of @addr as exclusive clearing its EXCL bit and return the
+ * previous bit's state. */
+static inline int cpu_physical_memory_set_excl(ram_addr_t addr)
+{
+    return bitmap_test_and_clear_atomic(
+                                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                                EXCL_BITMAP_GET_OFFSET(addr), 1);
+}
+
 #endif
 #endif
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel]  [RFC v8 02/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 03/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

Attempting to simplify the helper_*_st_name, wrap the
do_unaligned_access code into an shared inline function. As this also
removes the goto statement the inline code is expanded twice in each
helper.

>From Message-id 1452268394-31252-2-git-send-email-alex.bennee@linaro.org:
There is a minor wrinkle that we need to use a unique name for each
inline fragment as the template is included multiple times. For this the
smmu_helper macro does the appropriate glue magic.

I've tested the result with no change to functionality. Comparing the
the objdump of cputlb.o shows minimal changes in probe_write and
everything else is identical.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
CC: Alvise Rigo <a.rigo@virtualopensystems.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
[Alex Bennée: define smmu_helper and unified logic between be/le]

Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 82 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 46 insertions(+), 36 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 208f808..3eb54f8 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -370,6 +370,46 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                  iotlbentry->attrs);
 }
 
+/* Inline helper functions for SoftMMU
+ *
+ * These functions help reduce code duplication in the various main
+ * helper functions. Constant arguments (like endian state) will allow
+ * the compiler to skip code which is never called in a given inline.
+ */
+#define smmu_helper(name) glue(glue(glue(smmu_helper_, SUFFIX), \
+                                                    MMUSUFFIX), _##name)
+static inline void smmu_helper(do_unl_store)(CPUArchState *env,
+                                             bool little_endian,
+                                             DATA_TYPE val,
+                                             target_ulong addr,
+                                             TCGMemOpIdx oi,
+                                             unsigned mmu_idx,
+                                             uintptr_t retaddr)
+{
+    int i;
+
+    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+    /* Note: relies on the fact that tlb_fill() does not remove the
+     * previous page from the TLB cache.  */
+    for (i = DATA_SIZE - 1; i >= 0; i--) {
+        uint8_t val8;
+        if (little_endian) {
+            /* Little-endian extract.  */
+            val8 = val >> (i * 8);
+        } else {
+            /* Big-endian extract.  */
+            val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
+        }
+        /* Note the adjustment at the beginning of the function.
+           Undo that for the recursion.  */
+        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+                                        oi, retaddr + GETPC_ADJ);
+    }
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -399,7 +439,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
         CPUIOTLBEntry *iotlbentry;
         if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+            smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, retaddr);
+            return;
         }
         iotlbentry = &env->iotlb[mmu_idx][index];
 
@@ -414,23 +455,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     if (DATA_SIZE > 1
         && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
                      >= TARGET_PAGE_SIZE)) {
-        int i;
-    do_unaligned_access:
-        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                                 mmu_idx, retaddr);
-        }
-        /* XXX: not efficient, but simple */
-        /* Note: relies on the fact that tlb_fill() does not remove the
-         * previous page from the TLB cache.  */
-        for (i = DATA_SIZE - 1; i >= 0; i--) {
-            /* Little-endian extract.  */
-            uint8_t val8 = val >> (i * 8);
-            /* Note the adjustment at the beginning of the function.
-               Undo that for the recursion.  */
-            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-                                            oi, retaddr + GETPC_ADJ);
-        }
+        smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, retaddr);
         return;
     }
 
@@ -479,7 +504,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
         CPUIOTLBEntry *iotlbentry;
         if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+            smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, retaddr);
+            return;
         }
         iotlbentry = &env->iotlb[mmu_idx][index];
 
@@ -494,23 +520,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     if (DATA_SIZE > 1
         && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
                      >= TARGET_PAGE_SIZE)) {
-        int i;
-    do_unaligned_access:
-        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                                 mmu_idx, retaddr);
-        }
-        /* XXX: not efficient, but simple */
-        /* Note: relies on the fact that tlb_fill() does not remove the
-         * previous page from the TLB cache.  */
-        for (i = DATA_SIZE - 1; i >= 0; i--) {
-            /* Big-endian extract.  */
-            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
-            /* Note the adjustment at the beginning of the function.
-               Undo that for the recursion.  */
-            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-                                            oi, retaddr + GETPC_ADJ);
-        }
+        smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, retaddr);
         return;
     }
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel]  [RFC v8 03/14] softmmu: Simplify helper_*_st_name, wrap MMIO code
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 02/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 04/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
inline function. The function covers both BE and LE cases and it is expanded
twice in each helper (TODO: check this last statement).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
CC: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 49 +++++++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 3eb54f8..9185486 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -410,6 +410,29 @@ static inline void smmu_helper(do_unl_store)(CPUArchState *env,
     }
 }
 
+static inline void smmu_helper(do_mmio_store)(CPUArchState *env,
+                                              bool little_endian,
+                                              DATA_TYPE val,
+                                              target_ulong addr,
+                                              TCGMemOpIdx oi, unsigned mmu_idx,
+                                              int index, uintptr_t retaddr)
+{
+    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+    if ((addr & (DATA_SIZE - 1)) != 0) {
+        smmu_helper(do_unl_store)(env, little_endian, val, addr, mmu_idx, oi,
+                                  retaddr);
+    }
+    /* ??? Note that the io helpers always read data in the target
+       byte ordering.  We should push the LE/BE request down into io.  */
+    if (little_endian) {
+        val = TGT_LE(val);
+    } else {
+        val = TGT_BE(val);
+    }
+    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -437,17 +460,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 
     /* Handle an IO access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, retaddr);
-            return;
-        }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_LE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+        smmu_helper(do_mmio_store)(env, true, val, addr, oi, mmu_idx, index,
+                                   retaddr);
         return;
     }
 
@@ -502,17 +516,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 
     /* Handle an IO access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, retaddr);
-            return;
-        }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_BE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+        smmu_helper(do_mmio_store)(env, false, val, addr, oi, mmu_idx, index,
+                                   retaddr);
         return;
     }
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel]  [RFC v8 04/14] softmmu: Simplify helper_*_st_name, wrap RAM code
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (2 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 03/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 05/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

Attempting to simplify the helper_*_st_name, wrap the code relative to a
RAM access into an inline function. The function covers both BE and LE cases
and it is expanded twice in each helper (TODO: check this last statement).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
CC: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 80 +++++++++++++++++++++++++++---------------------------
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 9185486..ea6a0fb 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -433,13 +433,48 @@ static inline void smmu_helper(do_mmio_store)(CPUArchState *env,
     glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
 }
 
+static inline void smmu_helper(do_ram_store)(CPUArchState *env,
+                                             bool little_endian, DATA_TYPE val,
+                                             target_ulong addr, TCGMemOpIdx oi,
+                                             unsigned mmu_idx, int index,
+                                             uintptr_t retaddr)
+{
+    uintptr_t haddr;
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                     >= TARGET_PAGE_SIZE)) {
+        smmu_helper(do_unl_store)(env, little_endian, val, addr, oi, mmu_idx,
+                                  retaddr);
+        return;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+    if ((addr & (DATA_SIZE - 1)) != 0
+        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+#if DATA_SIZE == 1
+    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+#else
+    if (little_endian) {
+        glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+    } else {
+        glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
+    }
+#endif
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-    uintptr_t haddr;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -465,27 +500,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         return;
     }
 
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (DATA_SIZE > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
-                     >= TARGET_PAGE_SIZE)) {
-        smmu_helper(do_unl_store)(env, true, val, addr, oi, mmu_idx, retaddr);
-        return;
-    }
-
-    /* Handle aligned access or unaligned access in the same page.  */
-    if ((addr & (DATA_SIZE - 1)) != 0
-        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                             mmu_idx, retaddr);
-    }
-
-    haddr = addr + env->tlb_table[mmu_idx][index].addend;
-#if DATA_SIZE == 1
-    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
-#else
-    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
-#endif
+    smmu_helper(do_ram_store)(env, true, val, addr, oi, mmu_idx, index,
+                              retaddr);
 }
 
 #if DATA_SIZE > 1
@@ -495,7 +511,6 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     unsigned mmu_idx = get_mmuidx(oi);
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-    uintptr_t haddr;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -521,23 +536,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         return;
     }
 
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (DATA_SIZE > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
-                     >= TARGET_PAGE_SIZE)) {
-        smmu_helper(do_unl_store)(env, false, val, addr, oi, mmu_idx, retaddr);
-        return;
-    }
-
-    /* Handle aligned access or unaligned access in the same page.  */
-    if ((addr & (DATA_SIZE - 1)) != 0
-        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                             mmu_idx, retaddr);
-    }
-
-    haddr = addr + env->tlb_table[mmu_idx][index].addend;
-    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
+    smmu_helper(do_ram_store)(env, false, val, addr, oi, mmu_idx, index,
+                              retaddr);
 }
 #endif /* DATA_SIZE > 1 */
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel]  [RFC v8 05/14] softmmu: Add new TLB_EXCL flag
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (3 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 04/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 06/14] qom: cpu: Add CPUClass hooks for exclusive range Alvise Rigo
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

Add a new TLB flag to force all the accesses made to a page to follow
the slow-path.

The TLB entries referring guest pages with the DIRTY_MEMORY_EXCLUSIVE
bit clean will have this flag set.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 include/exec/cpu-all.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 83b1781..f8d8feb 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
 #define TLB_NOTDIRTY    (1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
+/* Set if TLB entry references a page that requires exclusive access.  */
+#define TLB_EXCL        (1 << 6)
+
+/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
+ * above. */
+#if TLB_EXCL >= TARGET_PAGE_SIZE
+#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
+#endif
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 06/14] qom: cpu: Add CPUClass hooks for exclusive range
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (4 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 05/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 07/14] softmmu: Add helpers for a new slowpath Alvise Rigo
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Andreas Färber

The excl_protected_range is a hwaddr range set by the VCPU at the
execution of a LoadLink instruction. If a normal access writes to this
range, the corresponding StoreCond will fail.

Each architecture can set the exclusive range when issuing the LoadLink
operation through a CPUClass hook. This comes in handy to emulate, for
instance, the exclusive monitor implemented in some ARM architectures
(more precisely, the Exclusive Reservation Granule).

In addition, add another CPUClass hook called to decide whether a
StoreCond has to fail or not.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 include/qom/cpu.h | 20 ++++++++++++++++++++
 qom/cpu.c         | 27 +++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 2e5229d..21f10eb 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -29,6 +29,7 @@
 #include "qemu/queue.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
+#include "qemu/range.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
                                      void *opaque);
@@ -123,6 +124,10 @@ struct TranslationBlock;
  * @cpu_exec_enter: Callback for cpu_exec preparation.
  * @cpu_exec_exit: Callback for cpu_exec cleanup.
  * @cpu_exec_interrupt: Callback for processing interrupts in cpu_exec.
+ * @cpu_set_excl_protected_range: Callback used by LL operation for setting the
+ *                                exclusive range.
+ * @cpu_valid_excl_access: Callback for checking the validity of a SC operation.
+ * @cpu_reset_excl_context: Callback for resetting the exclusive context.
  * @disas_set_info: Setup architecture specific components of disassembly info
  *
  * Represents a CPU family or model.
@@ -183,6 +188,13 @@ typedef struct CPUClass {
     void (*cpu_exec_exit)(CPUState *cpu);
     bool (*cpu_exec_interrupt)(CPUState *cpu, int interrupt_request);
 
+    /* Atomic instruction handling */
+    void (*cpu_set_excl_protected_range)(CPUState *cpu, hwaddr addr,
+                                         hwaddr size);
+    bool (*cpu_valid_excl_access)(CPUState *cpu, hwaddr addr,
+                                 hwaddr size);
+    void (*cpu_reset_excl_context)(CPUState *cpu);
+
     void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
 } CPUClass;
 
@@ -219,6 +231,9 @@ struct kvm_run;
 #define TB_JMP_CACHE_BITS 12
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
+/* Atomic insn translation TLB support. */
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+
 /**
  * CPUState:
  * @cpu_index: CPU index (informative).
@@ -341,6 +356,11 @@ struct CPUState {
      */
     bool throttle_thread_scheduled;
 
+    /* vCPU's exclusive addresses range.
+     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
+     * in the middle of a LL/SC. */
+    struct Range excl_protected_range;
+
     /* Note that this is accessed at the start of every TB via a negative
        offset from AREG0.  Leave this field at the end so as to make the
        (absolute value) offset as small as possible.  This reduces code
diff --git a/qom/cpu.c b/qom/cpu.c
index 8f537a4..309d487 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -203,6 +203,29 @@ static bool cpu_common_exec_interrupt(CPUState *cpu, int int_req)
     return false;
 }
 
+static void cpu_common_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+    cpu->excl_protected_range.begin = addr;
+    cpu->excl_protected_range.end = addr + size;
+}
+
+static bool cpu_common_valid_excl_access(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+    /* Check if the excl range completely covers the access */
+    if (cpu->excl_protected_range.begin <= addr &&
+        cpu->excl_protected_range.end >= addr + size) {
+
+        return true;
+    }
+
+    return false;
+}
+
+static void cpu_common_reset_excl_context(CPUState *cpu)
+{
+    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+}
+
 void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
                     int flags)
 {
@@ -252,6 +275,7 @@ static void cpu_common_reset(CPUState *cpu)
     cpu->can_do_io = 1;
     cpu->exception_index = -1;
     cpu->crash_occurred = false;
+    cpu_common_reset_excl_context(cpu);
     memset(cpu->tb_jmp_cache, 0, TB_JMP_CACHE_SIZE * sizeof(void *));
 }
 
@@ -355,6 +379,9 @@ static void cpu_class_init(ObjectClass *klass, void *data)
     k->cpu_exec_enter = cpu_common_noop;
     k->cpu_exec_exit = cpu_common_noop;
     k->cpu_exec_interrupt = cpu_common_exec_interrupt;
+    k->cpu_set_excl_protected_range = cpu_common_set_excl_range;
+    k->cpu_valid_excl_access = cpu_common_valid_excl_access;
+    k->cpu_reset_excl_context = cpu_common_reset_excl_context;
     dc->realize = cpu_common_realizefn;
     /*
      * Reason: CPUs still need special care by board code: wiring up
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 07/14] softmmu: Add helpers for a new slowpath
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (5 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 06/14] qom: cpu: Add CPUClass hooks for exclusive range Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 08/14] softmmu: Add history of excl accesses Alvise Rigo
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite, Andreas Färber

The new helpers rely on the legacy ones to perform the actual read/write.

The LoadLink helper (helper_ldlink_name) prepares the way for the
following StoreCond operation. It sets the linked address and the size
of the access. The LoadLink helper also updates the TLB entry of the
page involved in the LL/SC to all vCPUs by forcing a TLB flush, so that
the following accesses made by all the vCPUs will follow the slow path.

The StoreConditional helper (helper_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU. A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

For the time being we do not support exclusive accesses to MMIO memory.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |   4 ++
 include/qom/cpu.h       |   5 ++
 qom/cpu.c               |   2 +
 softmmu_llsc_template.h | 132 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |  12 +++++
 tcg/tcg.h               |  31 ++++++++++++
 6 files changed, 186 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index f6fb161..58d6f03 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -29,6 +29,7 @@
 #include "exec/memory-internal.h"
 #include "exec/ram_addr.h"
 #include "tcg/tcg.h"
+#include "hw/hw.h"
 
 //#define DEBUG_TLB
 //#define DEBUG_TLB_CHECK
@@ -476,6 +477,8 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -488,6 +491,7 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 21f10eb..014851e 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -356,10 +356,15 @@ struct CPUState {
      */
     bool throttle_thread_scheduled;
 
+    /* Used by the atomic insn translation backend. */
+    bool ll_sc_context;
     /* vCPU's exclusive addresses range.
      * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not
      * in the middle of a LL/SC. */
     struct Range excl_protected_range;
+    /* Used to carry the SC result but also to flag a normal store access made
+     * by a stcond (see softmmu_template.h). */
+    bool excl_succeeded;
 
     /* Note that this is accessed at the start of every TB via a negative
        offset from AREG0.  Leave this field at the end so as to make the
diff --git a/qom/cpu.c b/qom/cpu.c
index 309d487..3280735 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -224,6 +224,8 @@ static bool cpu_common_valid_excl_access(CPUState *cpu, hwaddr addr, hwaddr size
 static void cpu_common_reset_excl_context(CPUState *cpu)
 {
     cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+    cpu->ll_sc_context = false;
+    cpu->excl_succeeded = false;
 }
 
 void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 0000000..ca2ac95
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,132 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* This template does not generate together the le and be version, but only one
+ * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
+ * The same nomenclature as softmmu_template.h is used for the exclusive
+ * helpers.  */
+
+#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
+
+#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
+
+#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
+
+#if DATA_SIZE > 1
+#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#else /* DATA_SIZE <= 1 */
+#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
+#endif
+
+#endif
+
+WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
+                                TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    CPUState *this_cpu = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(this_cpu);
+    hwaddr hw_addr;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* Use the proper load helper from cpu_ldst.h */
+    ret = helper_ld(env, addr, oi, retaddr);
+
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
+     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
+    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
+    if (likely(!(env->tlb_table[mmu_idx][index].addr_read & TLB_MMIO))) {
+        /* If all the vCPUs have the EXCL bit set for this page there is no need
+         * to request any flush. */
+        if (!cpu_physical_memory_is_excl(hw_addr)) {
+            CPUState *cpu;
+
+            cpu_physical_memory_set_excl(hw_addr);
+            CPU_FOREACH(cpu) {
+                if (this_cpu != cpu) {
+                    tlb_flush(cpu, 1);
+                }
+            }
+        }
+    } else {
+        hw_error("EXCL accesses to MMIO regions not supported yet.");
+    }
+
+    cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
+
+    /* For this vCPU, just update the TLB entry, no need to flush. */
+    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
+
+    /* From now on we are in LL/SC context */
+    this_cpu->ll_sc_context = true;
+
+    return ret;
+}
+
+WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
+                             DATA_TYPE val, TCGMemOpIdx oi,
+                             uintptr_t retaddr)
+{
+    WORD_TYPE ret = 0;
+    CPUState *cpu = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(cpu);
+
+    if (!cpu->ll_sc_context) {
+        ret = 1;
+    } else {
+        /* We set it preventively to true to distinguish the following legacy
+         * access as one made by the store conditional wrapper. If the store
+         * conditional does not succeed, the value will be set to 0.*/
+        cpu->excl_succeeded = true;
+        helper_st(env, addr, val, oi, retaddr);
+
+        if (!cpu->excl_succeeded) {
+            ret = 1;
+        }
+    }
+
+    /* Unset LL/SC context */
+    cc->cpu_reset_excl_context(cpu);
+
+    return ret;
+}
+
+#undef helper_ldlink_name
+#undef helper_stcond_name
+#undef helper_ld
+#undef helper_st
diff --git a/softmmu_template.h b/softmmu_template.h
index ea6a0fb..ede1240 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -565,6 +565,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
 #endif
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
+#ifdef GEN_EXCLUSIVE_HELPERS
+
+#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
+#define BIGENDIAN_EXCLUSIVE_HELPERS
+#include "softmmu_llsc_template.h"
+#undef BIGENDIAN_EXCLUSIVE_HELPERS
+#endif
+
+#include "softmmu_llsc_template.h"
+
+#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
+
 #undef READ_ACCESS_TYPE
 #undef SHIFT
 #undef DATA_TYPE
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a696922..3e050a4 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -968,6 +968,21 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -1010,6 +1025,22 @@ uint32_t helper_be_ldl_cmmu(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_cmmu(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
+                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 08/14] softmmu: Add history of excl accesses
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (6 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 07/14] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 09/14] softmmu: Honor the new exclusive bitmap Alvise Rigo
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite, Andreas Färber

Add a circular buffer to store the hw addresses used in the last
EXCLUSIVE_HISTORY_LEN exclusive accesses.

When an address is pop'ed from the buffer, its page will be set as not
exclusive. In this way we avoid frequent set/unset of a page (causing
frequent flushes as well).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 21 +++++++++++++++++++++
 exec.c                  | 19 +++++++++++++++++++
 include/qom/cpu.h       |  8 ++++++++
 softmmu_llsc_template.h |  1 +
 vl.c                    |  3 +++
 5 files changed, 52 insertions(+)

diff --git a/cputlb.c b/cputlb.c
index 58d6f03..02b0d14 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -475,6 +475,27 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
     return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* Keep a circular array with the last excl_history.length addresses used for
+ * exclusive accesses. The exiting addresses are marked as non-exclusive. */
+extern CPUExclusiveHistory excl_history;
+static inline void excl_history_put_addr(hwaddr addr)
+{
+    hwaddr last;
+
+    /* Calculate the index of the next exclusive address */
+    excl_history.last_idx = (excl_history.last_idx + 1) % excl_history.length;
+
+    last = excl_history.c_array[excl_history.last_idx];
+
+    /* Unset EXCL bit of the oldest entry */
+    if (last != EXCLUSIVE_RESET_ADDR) {
+        cpu_physical_memory_unset_excl(last);
+    }
+
+    /* Add a new address, overwriting the oldest one */
+    excl_history.c_array[excl_history.last_idx] = addr & TARGET_PAGE_MASK;
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/exec.c b/exec.c
index cefee1b..3c54b92 100644
--- a/exec.c
+++ b/exec.c
@@ -177,6 +177,25 @@ struct CPUAddressSpace {
     MemoryListener tcg_as_listener;
 };
 
+/* Exclusive memory support */
+CPUExclusiveHistory excl_history;
+void cpu_exclusive_history_init(void)
+{
+    /* Initialize exclusive history for atomic instruction handling. */
+    if (tcg_enabled()) {
+        g_assert(EXCLUSIVE_HISTORY_CPU_LEN * max_cpus <= UINT16_MAX);
+        excl_history.length = EXCLUSIVE_HISTORY_CPU_LEN * max_cpus;
+        excl_history.c_array = g_malloc(excl_history.length * sizeof(hwaddr));
+        memset(excl_history.c_array, -1, excl_history.length * sizeof(hwaddr));
+    }
+}
+
+void cpu_exclusive_history_free(void)
+{
+    if (tcg_enabled()) {
+        g_free(excl_history.c_array);
+    }
+}
 #endif
 
 #if !defined(CONFIG_USER_ONLY)
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 014851e..de144f6 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -232,7 +232,15 @@ struct kvm_run;
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
 /* Atomic insn translation TLB support. */
+typedef struct CPUExclusiveHistory {
+    uint16_t last_idx;           /* index of last insertion */
+    uint16_t length;             /* history's length, it depends on smp_cpus */
+    hwaddr *c_array;             /* history's circular array */
+} CPUExclusiveHistory;
 #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+#define EXCLUSIVE_HISTORY_CPU_LEN 256
+void cpu_exclusive_history_init(void);
+void cpu_exclusive_history_free(void);
 
 /**
  * CPUState:
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index ca2ac95..1e24fec 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -77,6 +77,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
             CPUState *cpu;
 
             cpu_physical_memory_set_excl(hw_addr);
+            excl_history_put_addr(hw_addr);
             CPU_FOREACH(cpu) {
                 if (this_cpu != cpu) {
                     tlb_flush(cpu, 1);
diff --git a/vl.c b/vl.c
index f043009..b22d99b 100644
--- a/vl.c
+++ b/vl.c
@@ -547,6 +547,7 @@ static void res_free(void)
 {
     g_free(boot_splash_filedata);
     boot_splash_filedata = NULL;
+    cpu_exclusive_history_free();
 }
 
 static int default_driver_check(void *opaque, QemuOpts *opts, Error **errp)
@@ -4322,6 +4323,8 @@ int main(int argc, char **argv, char **envp)
 
     configure_accelerator(current_machine);
 
+    cpu_exclusive_history_init();
+
     if (qtest_chrdev) {
         qtest_init(qtest_chrdev, qtest_log, &error_fatal);
     }
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 09/14] softmmu: Honor the new exclusive bitmap
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (7 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 08/14] softmmu: Add history of excl accesses Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 10/14] softmmu: Support MMIO exclusive accesses Alvise Rigo
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

The pages set as exclusive (clean) in the DIRTY_MEMORY_EXCLUSIVE bitmap
have to have their TLB entries flagged with TLB_EXCL. The accesses to
pages with TLB_EXCL flag set have to be properly handled in that they
can potentially invalidate an open LL/SC transaction.

Modify the TLB entries generation to honor the new bitmap and extend
the softmmu_template to handle the accesses made to guest pages marked
as exclusive. The TLB_EXCL flag is used only for normal RAM memory.

Exclusive accesses to MMIO memory are still not supported, but they will
with the next patch.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c           | 36 ++++++++++++++++++++++++++----
 softmmu_template.h | 65 +++++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 89 insertions(+), 12 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 02b0d14..e5df3a5 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -416,11 +416,20 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
             || memory_region_is_romd(section->mr)) {
             /* Write access calls the I/O callback.  */
             te->addr_write = address | TLB_MMIO;
-        } else if (memory_region_is_ram(section->mr)
-                   && cpu_physical_memory_is_clean(section->mr->ram_addr
-                                                   + xlat)) {
-            te->addr_write = address | TLB_NOTDIRTY;
         } else {
+            if (memory_region_is_ram(section->mr)
+                    && cpu_physical_memory_is_clean(section->mr->ram_addr
+                                                   + xlat)) {
+                address |= TLB_NOTDIRTY;
+            }
+            /* Only normal RAM accesses need the TLB_EXCL flag to handle
+             * exclusive store operatoins. */
+            if (!(address & TLB_MMIO) &&
+                    cpu_physical_memory_is_excl(section->mr->ram_addr + xlat)) {
+                /* There is at least one vCPU that has flagged the address as
+                 * exclusive. */
+                address |= TLB_EXCL;
+            }
             te->addr_write = address;
         }
     } else {
@@ -496,6 +505,25 @@ static inline void excl_history_put_addr(hwaddr addr)
     excl_history.c_array[excl_history.last_idx] = addr & TARGET_PAGE_MASK;
 }
 
+/* For every vCPU compare the exclusive address and reset it in case of a
+ * match. Since only one vCPU is running at once, no lock has to be held to
+ * guard this operation. */
+static inline void reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        if (current_cpu != cpu &&
+            cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+            ranges_overlap(cpu->excl_protected_range.begin,
+                           cpu->excl_protected_range.end -
+                           cpu->excl_protected_range.begin,
+                           addr, size)) {
+            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+        }
+    }
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/softmmu_template.h b/softmmu_template.h
index ede1240..2934a0c 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -469,6 +469,43 @@ static inline void smmu_helper(do_ram_store)(CPUArchState *env,
 #endif
 }
 
+static inline void smmu_helper(do_excl_store)(CPUArchState *env,
+                                              bool little_endian,
+                                              DATA_TYPE val, target_ulong addr,
+                                              TCGMemOpIdx oi, int index,
+                                              uintptr_t retaddr)
+{
+    CPUIOTLBEntry *iotlbentry = &env->iotlb[get_mmuidx(oi)][index];
+    CPUState *cpu = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(cpu);
+    /* The slow-path has been forced since we are writing to
+     * exclusive-protected memory. */
+    hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+    /* The function reset_other_cpus_colliding_ll_addr could have reset
+     * the exclusive address. Fail the SC in this case.
+     * N.B.: here excl_succeed == true means that the caller is
+     * helper_stcond_name in softmmu_llsc_template.
+     * On the contrary, excl_succeeded == false occurs when a VCPU is
+     * writing through normal store to a page with TLB_EXCL bit set. */
+    if (cpu->excl_succeeded) {
+        if (!cc->cpu_valid_excl_access(cpu, hw_addr, DATA_SIZE)) {
+            /* The vCPU is SC-ing to an unprotected address. */
+            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+            cpu->excl_succeeded = false;
+
+            return;
+        }
+    }
+
+    smmu_helper(do_ram_store)(env, little_endian, val, addr, oi,
+                              get_mmuidx(oi), index, retaddr);
+
+    reset_other_cpus_colliding_ll_addr(hw_addr, DATA_SIZE);
+
+    return;
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -493,11 +530,17 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        smmu_helper(do_mmio_store)(env, true, val, addr, oi, mmu_idx, index,
-                                   retaddr);
-        return;
+        if (tlb_addr & TLB_EXCL) {
+            smmu_helper(do_excl_store)(env, true, val, addr, oi, index,
+                                       retaddr);
+            return;
+        } else {
+            smmu_helper(do_mmio_store)(env, true, val, addr, oi, mmu_idx,
+                                       index, retaddr);
+            return;
+        }
     }
 
     smmu_helper(do_ram_store)(env, true, val, addr, oi, mmu_idx, index,
@@ -529,11 +572,17 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        smmu_helper(do_mmio_store)(env, false, val, addr, oi, mmu_idx, index,
-                                   retaddr);
-        return;
+        if (tlb_addr & TLB_EXCL) {
+            smmu_helper(do_excl_store)(env, false, val, addr, oi, index,
+                                       retaddr);
+            return;
+        } else {
+            smmu_helper(do_mmio_store)(env, false, val, addr, oi, mmu_idx,
+                                       index, retaddr);
+            return;
+        }
     }
 
     smmu_helper(do_ram_store)(env, false, val, addr, oi, mmu_idx, index,
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 10/14] softmmu: Support MMIO exclusive accesses
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (8 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 09/14] softmmu: Honor the new exclusive bitmap Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 11/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

Enable exclusive accesses when the MMIO flag is set in the TLB entry.

In case a LL access is done to MMIO memory, we treat it differently from
a RAM access in that we do not rely on the EXCL bitmap to flag the page
as exclusive. In fact, we don't even need the TLB_EXCL flag to force the
slow path, since it is always forced anyway.

As for the RAM case, also the MMIO exclusive ranges have to be protected
by other CPU's accesses. In order to do that, we flag the accessed
MemoryRegion to mark that an exclusive access has been performed and is
not concluded yet. This flag will force the other CPUs to invalidate the
exclusive range in case of collision: basically, it serves the same
purpose as TLB_EXCL for the TLBEntries referring exclusive memory.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |  7 +++++--
 include/exec/memory.h   |  1 +
 softmmu_llsc_template.h | 11 +++++++----
 softmmu_template.h      | 22 ++++++++++++++++++++++
 4 files changed, 35 insertions(+), 6 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index e5df3a5..3cf40a3 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -29,7 +29,6 @@
 #include "exec/memory-internal.h"
 #include "exec/ram_addr.h"
 #include "tcg/tcg.h"
-#include "hw/hw.h"
 
 //#define DEBUG_TLB
 //#define DEBUG_TLB_CHECK
@@ -508,9 +507,10 @@ static inline void excl_history_put_addr(hwaddr addr)
 /* For every vCPU compare the exclusive address and reset it in case of a
  * match. Since only one vCPU is running at once, no lock has to be held to
  * guard this operation. */
-static inline void reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
+static inline bool reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
 {
     CPUState *cpu;
+    bool ret = false;
 
     CPU_FOREACH(cpu) {
         if (current_cpu != cpu &&
@@ -520,8 +520,11 @@ static inline void reset_other_cpus_colliding_ll_addr(hwaddr addr, hwaddr size)
                            cpu->excl_protected_range.begin,
                            addr, size)) {
             cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+            ret = true;
         }
     }
+
+    return ret;
 }
 
 #define MMUSUFFIX _mmu
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 71e0480..bacb3ad 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -171,6 +171,7 @@ struct MemoryRegion {
     bool rom_device;
     bool flush_coalesced_mmio;
     bool global_locking;
+    bool pending_excl_access; /* A vCPU issued an exclusive access */
     uint8_t dirty_log_mask;
     ram_addr_t ram_addr;
     Object *owner;
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 1e24fec..ca55502 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -84,15 +84,18 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
                 }
             }
         }
+        /* For this vCPU, just update the TLB entry, no need to flush. */
+        env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
     } else {
-        hw_error("EXCL accesses to MMIO regions not supported yet.");
+        /* Set a pending exclusive access in the MemoryRegion */
+        MemoryRegion *mr = iotlb_to_region(this_cpu,
+                                           env->iotlb[mmu_idx][index].addr,
+                                           env->iotlb[mmu_idx][index].attrs);
+        mr->pending_excl_access = true;
     }
 
     cc->cpu_set_excl_protected_range(this_cpu, hw_addr, DATA_SIZE);
 
-    /* For this vCPU, just update the TLB entry, no need to flush. */
-    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
-
     /* From now on we are in LL/SC context */
     this_cpu->ll_sc_context = true;
 
diff --git a/softmmu_template.h b/softmmu_template.h
index 2934a0c..2dc5e01 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -360,6 +360,28 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
     MemoryRegion *mr = iotlb_to_region(cpu, physaddr, iotlbentry->attrs);
 
     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
+
+    /* While for normal RAM accesses we define exclusive memory at TLBEntry
+     * granularity, for MMIO memory we use a MemoryRegion granularity.
+     * The pending_excl_access flag is the analogous of TLB_EXCL. */
+    if (unlikely(mr->pending_excl_access)) {
+        if (cpu->excl_succeeded) {
+            /* This SC access finalizes the LL/SC pair, thus the MemoryRegion
+             * has no pending exclusive access anymore.
+             * N.B.: Here excl_succeeded == true means that this access
+             * comes from an exclusive instruction. */
+            MemoryRegion *mr = iotlb_to_region(cpu, iotlbentry->addr,
+                                               iotlbentry->attrs);
+            mr->pending_excl_access = false;
+        } else {
+            /* This is a normal MMIO write access. Check if it collides
+             * with an existing exclusive range. */
+            if (reset_other_cpus_colliding_ll_addr(physaddr, 1 << SHIFT)) {
+                mr->pending_excl_access = false;
+            }
+        }
+    }
+
     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !cpu->can_do_io) {
         cpu_io_recompile(cpu, retaddr);
     }
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 11/14] tcg: Create new runtime helpers for excl accesses
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (9 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 10/14] softmmu: Support MMIO exclusive accesses Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 12/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Crosthwaite

Introduce a set of new runtime helpers to handle exclusive instructions.
These helpers are used as hooks to call the respective LL/SC helpers in
softmmu_llsc_template.h from TCG code.

The helpers ending with an "a" make an alignment check.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 Makefile.target             |   2 +-
 include/exec/helper-gen.h   |   3 ++
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   3 ++
 tcg-llsc-helper.c           | 104 ++++++++++++++++++++++++++++++++++++++++++++
 tcg-llsc-helper.h           |  61 ++++++++++++++++++++++++++
 tcg/tcg-llsc-gen-helper.h   |  67 ++++++++++++++++++++++++++++
 7 files changed, 240 insertions(+), 1 deletion(-)
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

diff --git a/Makefile.target b/Makefile.target
index 34ddb7e..faf32a2 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -135,7 +135,7 @@ obj-y += arch_init.o cpus.o monitor.o gdbstub.o balloon.o ioport.o numa.o
 obj-y += qtest.o bootdevice.o
 obj-y += hw/
 obj-$(CONFIG_KVM) += kvm-all.o
-obj-y += memory.o cputlb.o
+obj-y += memory.o cputlb.o tcg-llsc-helper.o
 obj-y += memory_mapping.o
 obj-y += dump.o
 obj-y += migration/ram.o migration/savevm.o
diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 0d0da3a..f8483a9 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -60,6 +60,9 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
 #include "tcg-runtime.h"
+#if defined(CONFIG_SOFTMMU)
+#include "tcg-llsc-gen-helper.h"
+#endif
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index effdd43..90be2fd 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -29,6 +29,7 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#include "tcg/tcg-llsc-gen-helper.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 79fa3c8..6228a7f 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -38,6 +38,9 @@
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#ifdef CONFIG_SOFTMMU
+#include "tcg-llsc-gen-helper.h"
+#endif
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/tcg-llsc-helper.c b/tcg-llsc-helper.c
new file mode 100644
index 0000000..646b4ba
--- /dev/null
+++ b/tcg-llsc-helper.c
@@ -0,0 +1,104 @@
+/*
+ * Runtime helpers for atomic istruction emulation
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "exec/cpu_ldst.h"
+#include "exec/helper-head.h"
+#include "tcg-llsc-helper.h"
+
+#define LDEX_HELPER(SUFF, OPC, FUNC)                                       \
+uint32_t HELPER(ldlink_i##SUFF)(CPUArchState *env, target_ulong addr,      \
+                                uint32_t index)                            \
+{                                                                          \
+    CPUArchState *state = env;                                             \
+    TCGMemOpIdx op;                                                        \
+                                                                           \
+    op = make_memop_idx((OPC), index);                                     \
+                                                                           \
+    return (uint32_t)FUNC(state, addr, op, GETRA());                       \
+}
+
+#define STEX_HELPER(SUFF, DATA_TYPE, OPC, FUNC)                            \
+target_ulong HELPER(stcond_i##SUFF)(CPUArchState *env, target_ulong addr,  \
+                                    uint32_t val, uint32_t index)          \
+{                                                                          \
+    CPUArchState *state = env;                                             \
+    TCGMemOpIdx op;                                                        \
+                                                                           \
+    op = make_memop_idx((OPC), index);                                     \
+                                                                           \
+    return (target_ulong)FUNC(state, addr, val, op, GETRA());              \
+}
+
+
+LDEX_HELPER(8, MO_UB, helper_ret_ldlinkub_mmu)
+LDEX_HELPER(16_be, MO_BEUW, helper_be_ldlinkuw_mmu)
+LDEX_HELPER(16_bea, MO_BEUW | MO_ALIGN, helper_be_ldlinkuw_mmu)
+LDEX_HELPER(32_be, MO_BEUL, helper_be_ldlinkul_mmu)
+LDEX_HELPER(32_bea, MO_BEUL | MO_ALIGN, helper_be_ldlinkul_mmu)
+LDEX_HELPER(16_le, MO_LEUW, helper_le_ldlinkuw_mmu)
+LDEX_HELPER(16_lea, MO_LEUW | MO_ALIGN, helper_le_ldlinkuw_mmu)
+LDEX_HELPER(32_le, MO_LEUL, helper_le_ldlinkul_mmu)
+LDEX_HELPER(32_lea, MO_LEUL | MO_ALIGN, helper_le_ldlinkul_mmu)
+
+STEX_HELPER(8, uint8_t, MO_UB, helper_ret_stcondb_mmu)
+STEX_HELPER(16_be, uint16_t, MO_BEUW, helper_be_stcondw_mmu)
+STEX_HELPER(16_bea, uint16_t, MO_BEUW | MO_ALIGN, helper_be_stcondw_mmu)
+STEX_HELPER(32_be, uint32_t, MO_BEUL, helper_be_stcondl_mmu)
+STEX_HELPER(32_bea, uint32_t, MO_BEUL | MO_ALIGN, helper_be_stcondl_mmu)
+STEX_HELPER(16_le, uint16_t, MO_LEUW, helper_le_stcondw_mmu)
+STEX_HELPER(16_lea, uint16_t, MO_LEUW | MO_ALIGN, helper_le_stcondw_mmu)
+STEX_HELPER(32_le, uint32_t, MO_LEUL, helper_le_stcondl_mmu)
+STEX_HELPER(32_lea, uint32_t, MO_LEUL | MO_ALIGN, helper_le_stcondl_mmu)
+
+#define LDEX_HELPER_64(SUFF, OPC, FUNC)                                     \
+uint64_t HELPER(ldlink_i##SUFF)(CPUArchState *env, target_ulong addr,       \
+                                uint32_t index)                             \
+{                                                                           \
+    CPUArchState *state = env;                                              \
+    TCGMemOpIdx op;                                                         \
+                                                                            \
+    op = make_memop_idx((OPC), index);                                      \
+                                                                            \
+    return FUNC(state, addr, op, GETRA());                                  \
+}
+
+#define STEX_HELPER_64(SUFF, OPC, FUNC)                                     \
+target_ulong HELPER(stcond_i##SUFF)(CPUArchState *env, target_ulong addr,   \
+                                    uint64_t val, uint32_t index)           \
+{                                                                           \
+    CPUArchState *state = env;                                              \
+    TCGMemOpIdx op;                                                         \
+                                                                            \
+    op = make_memop_idx((OPC), index);                                      \
+                                                                            \
+    return (target_ulong)FUNC(state, addr, val, op, GETRA());               \
+}
+
+LDEX_HELPER_64(64_be, MO_BEQ, helper_be_ldlinkq_mmu)
+LDEX_HELPER_64(64_bea, MO_BEQ | MO_ALIGN, helper_be_ldlinkq_mmu)
+LDEX_HELPER_64(64_le, MO_LEQ, helper_le_ldlinkq_mmu)
+LDEX_HELPER_64(64_lea, MO_LEQ | MO_ALIGN, helper_le_ldlinkq_mmu)
+
+STEX_HELPER_64(64_be, MO_BEQ, helper_be_stcondq_mmu)
+STEX_HELPER_64(64_bea, MO_BEQ | MO_ALIGN, helper_be_stcondq_mmu)
+STEX_HELPER_64(64_le, MO_LEQ, helper_le_stcondq_mmu)
+STEX_HELPER_64(64_lea, MO_LEQ | MO_ALIGN, helper_le_stcondq_mmu)
diff --git a/tcg-llsc-helper.h b/tcg-llsc-helper.h
new file mode 100644
index 0000000..8f7adf0
--- /dev/null
+++ b/tcg-llsc-helper.h
@@ -0,0 +1,61 @@
+#ifndef HELPER_LLSC_HEAD_H
+#define HELPER_LLSC_HEAD_H 1
+
+uint32_t HELPER(ldlink_i8)(CPUArchState *env, target_ulong addr,
+                           uint32_t index);
+uint32_t HELPER(ldlink_i16_be)(CPUArchState *env, target_ulong addr,
+                               uint32_t index);
+uint32_t HELPER(ldlink_i32_be)(CPUArchState *env, target_ulong addr,
+                               uint32_t index);
+uint64_t HELPER(ldlink_i64_be)(CPUArchState *env, target_ulong addr,
+                               uint32_t index);
+uint32_t HELPER(ldlink_i16_le)(CPUArchState *env, target_ulong addr,
+                               uint32_t index);
+uint32_t HELPER(ldlink_i32_le)(CPUArchState *env, target_ulong addr,
+                               uint32_t index);
+uint64_t HELPER(ldlink_i64_le)(CPUArchState *env, target_ulong addr,
+                               uint32_t index);
+
+target_ulong HELPER(stcond_i8)(CPUArchState *env, target_ulong addr,
+                               uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i16_be)(CPUArchState *env, target_ulong addr,
+                                   uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i32_be)(CPUArchState *env, target_ulong addr,
+                                   uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i64_be)(CPUArchState *env, target_ulong addr,
+                                   uint64_t val, uint32_t index);
+target_ulong HELPER(stcond_i16_le)(CPUArchState *env, target_ulong addr,
+                                   uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i32_le)(CPUArchState *env, target_ulong addr,
+                                   uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i64_le)(CPUArchState *env, target_ulong addr,
+                                   uint64_t val, uint32_t index);
+
+/* Aligned versions */
+uint32_t HELPER(ldlink_i16_bea)(CPUArchState *env, target_ulong addr,
+                                uint32_t index);
+uint32_t HELPER(ldlink_i32_bea)(CPUArchState *env, target_ulong addr,
+                                uint32_t index);
+uint64_t HELPER(ldlink_i64_bea)(CPUArchState *env, target_ulong addr,
+                                uint32_t index);
+uint32_t HELPER(ldlink_i16_lea)(CPUArchState *env, target_ulong addr,
+                                uint32_t index);
+uint32_t HELPER(ldlink_i32_lea)(CPUArchState *env, target_ulong addr,
+                                uint32_t index);
+uint64_t HELPER(ldlink_i64_lea)(CPUArchState *env, target_ulong addr,
+                                uint32_t index);
+
+target_ulong HELPER(stcond_i16_bea)(CPUArchState *env, target_ulong addr,
+                                    uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i32_bea)(CPUArchState *env, target_ulong addr,
+                                    uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i64_bea)(CPUArchState *env, target_ulong addr,
+                                    uint64_t val, uint32_t index);
+target_ulong HELPER(stcond_i16_lea)(CPUArchState *env, target_ulong addr,
+                                    uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i32_lea)(CPUArchState *env, target_ulong addr,
+                                    uint32_t val, uint32_t index);
+target_ulong HELPER(stcond_i64_lea)(CPUArchState *env, target_ulong addr,
+                                    uint64_t val, uint32_t index);
+
+#endif
diff --git a/tcg/tcg-llsc-gen-helper.h b/tcg/tcg-llsc-gen-helper.h
new file mode 100644
index 0000000..01c0a67
--- /dev/null
+++ b/tcg/tcg-llsc-gen-helper.h
@@ -0,0 +1,67 @@
+#if TARGET_LONG_BITS == 32
+#define TYPE i32
+#else
+#define TYPE i64
+#endif
+
+DEF_HELPER_3(ldlink_i8, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i16_be, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i32_be, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i64_be, i64, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i16_le, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i32_le, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i64_le, i64, env, TYPE, i32)
+
+DEF_HELPER_4(stcond_i8, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i16_be, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i32_be, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i64_be, TYPE, env, TYPE, i64, i32)
+DEF_HELPER_4(stcond_i16_le, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i32_le, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i64_le, TYPE, env, TYPE, i64, i32)
+
+/* Aligned versions */
+DEF_HELPER_3(ldlink_i16_bea, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i32_bea, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i64_bea, i64, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i16_lea, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i32_lea, i32, env, TYPE, i32)
+DEF_HELPER_3(ldlink_i64_lea, i64, env, TYPE, i32)
+
+DEF_HELPER_4(stcond_i16_bea, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i32_bea, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i64_bea, TYPE, env, TYPE, i64, i32)
+DEF_HELPER_4(stcond_i16_lea, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i32_lea, TYPE, env, TYPE, i32, i32)
+DEF_HELPER_4(stcond_i64_lea, TYPE, env, TYPE, i64, i32)
+
+/* Convenient aliases */
+#ifdef TARGET_WORDS_BIGENDIAN
+#define gen_helper_stcond_i16 gen_helper_stcond_i16_be
+#define gen_helper_stcond_i32 gen_helper_stcond_i32_be
+#define gen_helper_stcond_i64 gen_helper_stcond_i64_be
+#define gen_helper_ldlink_i16 gen_helper_ldlink_i16_be
+#define gen_helper_ldlink_i32 gen_helper_ldlink_i32_be
+#define gen_helper_ldlink_i64 gen_helper_ldlink_i64_be
+#define gen_helper_stcond_i16a gen_helper_stcond_i16_bea
+#define gen_helper_stcond_i32a gen_helper_stcond_i32_bea
+#define gen_helper_stcond_i64a gen_helper_stcond_i64_bea
+#define gen_helper_ldlink_i16a gen_helper_ldlink_i16_bea
+#define gen_helper_ldlink_i32a gen_helper_ldlink_i32_bea
+#define gen_helper_ldlink_i64a gen_helper_ldlink_i64_bea
+#else
+#define gen_helper_stcond_i16 gen_helper_stcond_i16_le
+#define gen_helper_stcond_i32 gen_helper_stcond_i32_le
+#define gen_helper_stcond_i64 gen_helper_stcond_i64_le
+#define gen_helper_ldlink_i16 gen_helper_ldlink_i16_le
+#define gen_helper_ldlink_i32 gen_helper_ldlink_i32_le
+#define gen_helper_ldlink_i64 gen_helper_ldlink_i64_le
+#define gen_helper_stcond_i16a gen_helper_stcond_i16_lea
+#define gen_helper_stcond_i32a gen_helper_stcond_i32_lea
+#define gen_helper_stcond_i64a gen_helper_stcond_i64_lea
+#define gen_helper_ldlink_i16a gen_helper_ldlink_i16_lea
+#define gen_helper_ldlink_i32a gen_helper_ldlink_i32_lea
+#define gen_helper_ldlink_i64a gen_helper_ldlink_i64_lea
+#endif
+
+#undef TYPE
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 12/14] target-arm: translate: Use ld/st excl for atomic insns
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (10 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 11/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 13/14] target-arm: cpu64: use custom set_excl hook Alvise Rigo
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Maydell, open list:ARM

Use the new LL/SC runtime helpers to handle the ARM atomic instructions
in softmmu_llsc_template.h.

In general, the helper generator
gen_{ldrex,strex}_{8,16a,32a,64a}() calls the function
helper_{le,be}_{ldlink,stcond}{ub,uw,ul,q}_mmu() implemented in
softmmu_llsc_template.h, doing an alignment check.

In addition, add a simple helper function to emulate the CLREX instruction.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/cpu.h       |   3 +
 target-arm/helper.h    |   2 +
 target-arm/machine.c   |   7 ++
 target-arm/op_helper.c |  14 ++-
 target-arm/translate.c | 258 ++++++++++++++++++++++++++++---------------------
 5 files changed, 174 insertions(+), 110 deletions(-)

diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index b8b3364..46ab87f 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -462,6 +462,9 @@ typedef struct CPUARMState {
         float_status fp_status;
         float_status standard_fp_status;
     } vfp;
+    /* Even if we don't use these values anymore, we still keep them for
+     * retro-compatibility in case of migration toward QEMU versions without
+     * the LoadLink/StoreExclusive backend. */
     uint64_t exclusive_addr;
     uint64_t exclusive_val;
     uint64_t exclusive_high;
diff --git a/target-arm/helper.h b/target-arm/helper.h
index c2a85c7..37cec49 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -532,6 +532,8 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_1(atomic_clear, void, env)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/machine.c b/target-arm/machine.c
index ed1925a..9660163 100644
--- a/target-arm/machine.c
+++ b/target-arm/machine.c
@@ -203,6 +203,7 @@ static const VMStateInfo vmstate_cpsr = {
 static void cpu_pre_save(void *opaque)
 {
     ARMCPU *cpu = opaque;
+    CPUARMState *env = &cpu->env;
 
     if (kvm_enabled()) {
         if (!write_kvmstate_to_list(cpu)) {
@@ -221,6 +222,12 @@ static void cpu_pre_save(void *opaque)
            cpu->cpreg_array_len * sizeof(uint64_t));
     memcpy(cpu->cpreg_vmstate_values, cpu->cpreg_values,
            cpu->cpreg_array_len * sizeof(uint64_t));
+
+    /* Ensure to fail the next STREX for versions of QEMU with the
+     * old backend. */
+    env->exclusive_addr = -1;
+    env->exclusive_val = -1;
+    env->exclusive_high = -1;
 }
 
 static int cpu_post_load(void *opaque, int version_id)
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index a5ee65f..3ae0b6a 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -29,11 +29,13 @@ static void raise_exception(CPUARMState *env, uint32_t excp,
                             uint32_t syndrome, uint32_t target_el)
 {
     CPUState *cs = CPU(arm_env_get_cpu(env));
+    CPUClass *cc = CPU_GET_CLASS(cs);
 
     assert(!excp_is_internal(excp));
     cs->exception_index = excp;
     env->exception.syndrome = syndrome;
     env->exception.target_el = target_el;
+    cc->cpu_reset_excl_context(cs);
     cpu_loop_exit(cs);
 }
 
@@ -51,6 +53,14 @@ static int exception_target_el(CPUARMState *env)
     return target_el;
 }
 
+void HELPER(atomic_clear)(CPUARMState *env)
+{
+    CPUState *cs = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(cs);
+
+    cc->cpu_reset_excl_context(cs);
+}
+
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
                           uint32_t rn, uint32_t maxindex)
 {
@@ -681,6 +691,8 @@ static int el_from_spsr(uint32_t spsr)
 
 void HELPER(exception_return)(CPUARMState *env)
 {
+    CPUState *cs = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(cs);
     int cur_el = arm_current_el(env);
     unsigned int spsr_idx = aarch64_banked_spsr_index(cur_el);
     uint32_t spsr = env->banked_spsr[spsr_idx];
@@ -689,7 +701,7 @@ void HELPER(exception_return)(CPUARMState *env)
 
     aarch64_save_sp(env, cur_el);
 
-    env->exclusive_addr = -1;
+    cc->cpu_reset_excl_context(cs);
 
     /* We must squash the PSTATE.SS bit to zero unless both of the
      * following hold:
diff --git a/target-arm/translate.c b/target-arm/translate.c
index cff511b..9c2b197 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -60,6 +60,7 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+/* The following two variables are still used by the aarch64 front-end */
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
 #ifdef CONFIG_USER_ONLY
@@ -7413,57 +7414,139 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
     tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented by remembering
-   the value/address loaded, and seeing if these are the same
-   when the store is performed. This should be sufficient to implement
-   the architecturally mandated semantics, and avoids having to monitor
-   regular stores.
+/* If the softmmu is enabled, the translation of Load/Store exclusive
+   instructions will rely on the gen_helper_{ldlink,stcond} helpers,
+   offloading most of the work to the softmmu_llsc_template.h functions.
+   All the accesses made by the exclusive instructions include an
+   alignment check.
+
+   In user emulation mode we throw an exception and handle the atomic
+   operation elsewhere.  */
+
+#if TARGET_LONG_BITS == 32
+#define DO_GEN_LDREX(SUFF)                                             \
+static inline void gen_ldrex_##SUFF(TCGv_i32 dst, TCGv_i32 addr,       \
+                                    TCGv_i32 index)                    \
+{                                                                      \
+    gen_helper_ldlink_##SUFF(dst, cpu_env, addr, index);               \
+}
+
+#define DO_GEN_STREX(SUFF)                                             \
+static inline void gen_strex_##SUFF(TCGv_i32 dst, TCGv_i32 addr,       \
+                                    TCGv_i32 val, TCGv_i32 index)      \
+{                                                                      \
+    gen_helper_stcond_##SUFF(dst, cpu_env, addr, val, index);          \
+}
+
+static inline void gen_ldrex_i64a(TCGv_i64 dst, TCGv_i32 addr, TCGv_i32 index)
+{
+    gen_helper_ldlink_i64a(dst, cpu_env, addr, index);
+}
+
+static inline void gen_strex_i64a(TCGv_i32 dst, TCGv_i32 addr, TCGv_i64 val,
+                                  TCGv_i32 index)
+{
+
+    gen_helper_stcond_i64a(dst, cpu_env, addr, val, index);
+}
+#else
+#define DO_GEN_LDREX(SUFF)                                             \
+static inline void gen_ldrex_##SUFF(TCGv_i32 dst, TCGv_i32 addr,       \
+                                         TCGv_i32 index)               \
+{                                                                      \
+    TCGv addr64 = tcg_temp_new();                                      \
+    tcg_gen_extu_i32_i64(addr64, addr);                                \
+    gen_helper_ldlink_##SUFF(dst, cpu_env, addr64, index);             \
+    tcg_temp_free(addr64);                                             \
+}
+
+#define DO_GEN_STREX(SUFF)                                             \
+static inline void gen_strex_##SUFF(TCGv_i32 dst, TCGv_i32 addr,       \
+                                    TCGv_i32 val, TCGv_i32 index)      \
+{                                                                      \
+    TCGv addr64 = tcg_temp_new();                                      \
+    TCGv dst64 = tcg_temp_new();                                       \
+    tcg_gen_extu_i32_i64(addr64, addr);                                \
+    gen_helper_stcond_##SUFF(dst64, cpu_env, addr64, val, index);      \
+    tcg_gen_extrl_i64_i32(dst, dst64);                                 \
+    tcg_temp_free(dst64);                                              \
+    tcg_temp_free(addr64);                                             \
+}
+
+static inline void gen_ldrex_i64a(TCGv_i64 dst, TCGv_i32 addr, TCGv_i32 index)
+{
+    TCGv addr64 = tcg_temp_new();
+    tcg_gen_extu_i32_i64(addr64, addr);
+    gen_helper_ldlink_i64a(dst, cpu_env, addr64, index);
+    tcg_temp_free(addr64);
+}
+
+static inline void gen_strex_i64a(TCGv_i32 dst, TCGv_i32 addr, TCGv_i64 val,
+                                  TCGv_i32 index)
+{
+    TCGv addr64 = tcg_temp_new();
+    TCGv dst64 = tcg_temp_new();
+
+    tcg_gen_extu_i32_i64(addr64, addr);
+    gen_helper_stcond_i64a(dst64, cpu_env, addr64, val, index);
+    tcg_gen_extrl_i64_i32(dst, dst64);
+
+    tcg_temp_free(dst64);
+    tcg_temp_free(addr64);
+}
+#endif
+
+DO_GEN_LDREX(i8)
+DO_GEN_LDREX(i16a)
+DO_GEN_LDREX(i32a)
+
+DO_GEN_STREX(i8)
+DO_GEN_STREX(i16a)
+DO_GEN_STREX(i32a)
 
-   In system emulation mode only one CPU will be running at once, so
-   this sequence is effectively atomic.  In user emulation mode we
-   throw an exception and handle the atomic operation elsewhere.  */
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
-{
+ {
     TCGv_i32 tmp = tcg_temp_new_i32();
+    TCGv_i32 mem_idx = tcg_temp_new_i32();
 
-    s->is_ldex = true;
-
-    switch (size) {
-    case 0:
-        gen_aa32_ld8u(tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_ld16ua(tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_ld32ua(tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
 
-    if (size == 3) {
-        TCGv_i32 tmp2 = tcg_temp_new_i32();
-        TCGv_i32 tmp3 = tcg_temp_new_i32();
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_ldrex_i8(tmp, addr, mem_idx);
+            break;
+        case 1:
+            gen_ldrex_i16a(tmp, addr, mem_idx);
+            break;
+        case 2:
+            gen_ldrex_i32a(tmp, addr, mem_idx);
+            break;
+        default:
+            abort();
+        }
 
-        tcg_gen_addi_i32(tmp2, addr, 4);
-        gen_aa32_ld32u(tmp3, tmp2, get_mem_index(s));
-        tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(cpu_exclusive_val, tmp, tmp3);
-        store_reg(s, rt2, tmp3);
+        store_reg(s, rt, tmp);
     } else {
-        tcg_gen_extu_i32_i64(cpu_exclusive_val, tmp);
+        TCGv_i64 tmp64 = tcg_temp_new_i64();
+        TCGv_i32 tmph = tcg_temp_new_i32();
+
+        gen_ldrex_i64a(tmp64, addr, mem_idx);
+        tcg_gen_extr_i64_i32(tmp, tmph, tmp64);
+
+        store_reg(s, rt, tmp);
+        store_reg(s, rt2, tmph);
+
+        tcg_temp_free_i64(tmp64);
     }
 
-    store_reg(s, rt, tmp);
-    tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
+    tcg_temp_free_i32(mem_idx);
 }
 
 static void gen_clrex(DisasContext *s)
 {
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    gen_helper_atomic_clear(cpu_env);
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7479,85 +7562,42 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
 {
-    TCGv_i32 tmp;
-    TCGv_i64 val64, extaddr;
-    TCGLabel *done_label;
-    TCGLabel *fail_label;
-
-    /* if (env->exclusive_addr == addr && env->exclusive_val == [addr]) {
-         [addr] = {Rt};
-         {Rd} = 0;
-       } else {
-         {Rd} = 1;
-       } */
-    fail_label = gen_new_label();
-    done_label = gen_new_label();
-    extaddr = tcg_temp_new_i64();
-    tcg_gen_extu_i32_i64(extaddr, addr);
-    tcg_gen_brcond_i64(TCG_COND_NE, extaddr, cpu_exclusive_addr, fail_label);
-    tcg_temp_free_i64(extaddr);
+    TCGv_i32 tmp, mem_idx;
 
-    tmp = tcg_temp_new_i32();
-    switch (size) {
-    case 0:
-        gen_aa32_ld8u(tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_ld16u(tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_ld32u(tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
-    }
+    mem_idx = tcg_temp_new_i32();
 
-    val64 = tcg_temp_new_i64();
-    if (size == 3) {
-        TCGv_i32 tmp2 = tcg_temp_new_i32();
-        TCGv_i32 tmp3 = tcg_temp_new_i32();
-        tcg_gen_addi_i32(tmp2, addr, 4);
-        gen_aa32_ld32u(tmp3, tmp2, get_mem_index(s));
-        tcg_temp_free_i32(tmp2);
-        tcg_gen_concat_i32_i64(val64, tmp, tmp3);
-        tcg_temp_free_i32(tmp3);
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
+    tmp = load_reg(s, rt);
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_strex_i8(cpu_R[rd], addr, tmp, mem_idx);
+            break;
+        case 1:
+            gen_strex_i16a(cpu_R[rd], addr, tmp, mem_idx);
+            break;
+        case 2:
+            gen_strex_i32a(cpu_R[rd], addr, tmp, mem_idx);
+            break;
+        default:
+            abort();
+        }
     } else {
-        tcg_gen_extu_i32_i64(val64, tmp);
-    }
-    tcg_temp_free_i32(tmp);
+        TCGv_i64 tmp64;
+        TCGv_i32 tmp2;
 
-    tcg_gen_brcond_i64(TCG_COND_NE, val64, cpu_exclusive_val, fail_label);
-    tcg_temp_free_i64(val64);
+        tmp64 = tcg_temp_new_i64();
+        tmp2 = load_reg(s, rt2);
+        tcg_gen_concat_i32_i64(tmp64, tmp, tmp2);
+        gen_strex_i64a(cpu_R[rd], addr, tmp64, mem_idx);
 
-    tmp = load_reg(s, rt);
-    switch (size) {
-    case 0:
-        gen_aa32_st8(tmp, addr, get_mem_index(s));
-        break;
-    case 1:
-        gen_aa32_st16(tmp, addr, get_mem_index(s));
-        break;
-    case 2:
-    case 3:
-        gen_aa32_st32(tmp, addr, get_mem_index(s));
-        break;
-    default:
-        abort();
+        tcg_temp_free_i32(tmp2);
+        tcg_temp_free_i64(tmp64);
     }
+
     tcg_temp_free_i32(tmp);
-    if (size == 3) {
-        tcg_gen_addi_i32(addr, addr, 4);
-        tmp = load_reg(s, rt2);
-        gen_aa32_st32(tmp, addr, get_mem_index(s));
-        tcg_temp_free_i32(tmp);
-    }
-    tcg_gen_movi_i32(cpu_R[rd], 0);
-    tcg_gen_br(done_label);
-    gen_set_label(fail_label);
-    tcg_gen_movi_i32(cpu_R[rd], 1);
-    gen_set_label(done_label);
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    tcg_temp_free_i32(mem_idx);
 }
 #endif
 
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 13/14] target-arm: cpu64: use custom set_excl hook
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (11 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 12/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 14/14] target-arm: aarch64: Use ls/st exclusive for atomic insns Alvise Rigo
  2016-06-09 11:42 ` [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Sergey Fedorov
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Maydell, open list:ARM

In aarch64 the LDXP/STXP instructions allow to perform up to 128 bits
exclusive accesses. However, due to a softmmu limitation, such wide
accesses are not allowed.

To workaround this limitation, we need to support LoadLink instructions
that cover at least 128 consecutive bits (see the next patch for more
details).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/cpu64.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/target-arm/cpu64.c b/target-arm/cpu64.c
index cc177bb..1d45e66 100644
--- a/target-arm/cpu64.c
+++ b/target-arm/cpu64.c
@@ -287,6 +287,13 @@ static void aarch64_cpu_set_pc(CPUState *cs, vaddr value)
     }
 }
 
+static void aarch64_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+    cpu->excl_protected_range.begin = addr;
+    /* At least cover 128 bits for a STXP access (two paired doublewords case)*/
+    cpu->excl_protected_range.end = addr + 16;
+}
+
 static void aarch64_cpu_class_init(ObjectClass *oc, void *data)
 {
     CPUClass *cc = CPU_CLASS(oc);
@@ -297,6 +304,7 @@ static void aarch64_cpu_class_init(ObjectClass *oc, void *data)
     cc->gdb_write_register = aarch64_cpu_gdb_write_register;
     cc->gdb_num_core_regs = 34;
     cc->gdb_core_xml_file = "aarch64-core.xml";
+    cc->cpu_set_excl_protected_range = aarch64_set_excl_range;
 }
 
 static void aarch64_cpu_register(const ARMCPUInfo *info)
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [Qemu-devel] [RFC v8 14/14] target-arm: aarch64: Use ls/st exclusive for atomic insns
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (12 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 13/14] target-arm: cpu64: use custom set_excl hook Alvise Rigo
@ 2016-04-19 13:39 ` Alvise Rigo
  2016-06-09 11:42 ` [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Sergey Fedorov
  14 siblings, 0 replies; 18+ messages in thread
From: Alvise Rigo @ 2016-04-19 13:39 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth,
	serge.fdrv, Alvise Rigo, Peter Maydell, open list:ARM

Use the new LL/SC runtime helpers to handle the aarch64 atomic instructions
in softmmu_llsc_template.h.

The STXP emulation required a dedicated helper to handle the paired
doubleword case.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/helper-a64.c    |  55 +++++++++++++++
 target-arm/helper-a64.h    |   2 +
 target-arm/translate-a64.c | 168 +++++++++++++++++++++++++--------------------
 target-arm/translate.c     |   7 --
 4 files changed, 149 insertions(+), 83 deletions(-)

diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c
index c7bfb4d..170c59b 100644
--- a/target-arm/helper-a64.c
+++ b/target-arm/helper-a64.c
@@ -26,6 +26,7 @@
 #include "qemu/bitops.h"
 #include "internals.h"
 #include "qemu/crc32c.h"
+#include "tcg/tcg.h"
 #include <zlib.h> /* For crc32 */
 
 /* C2.4.7 Multiply and divide */
@@ -443,3 +444,57 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, uint32_t bytes)
     /* Linux crc32c converts the output to one's complement.  */
     return crc32c(acc, buf, bytes) ^ 0xffffffff;
 }
+
+/* STXP emulation for two 64 bit doublewords. We can't use directly two
+ * stcond_i64 accesses, otherwise the first will conclude the LL/SC pair.
+ * Instead, two normal 64-bit accesses are used and the CPUState is
+ * updated accordingly.
+ *
+ * We do not support paired STXPs to MMIO memory, this will become trivial
+ * when the softmmu will support 128bit memory accesses.
+ */
+target_ulong HELPER(stxp_i128)(CPUArchState *env, target_ulong addr,
+                               uint64_t vall, uint64_t valh,
+                               uint32_t mmu_idx)
+{
+    CPUState *cpu = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(cpu);
+    TCGMemOpIdx op;
+    target_ulong ret = 0;
+
+    if (!cpu->ll_sc_context) {
+        cpu->excl_succeeded = false;
+        ret = 1;
+        goto out;
+    }
+
+    op = make_memop_idx(MO_BEQ, mmu_idx);
+
+    /* According to section C6.6.191 of ARM ARM DDI 0487A.h, the access has
+     * to be quadword aligned. */
+    if (addr & 0xf) {
+        /* TODO: Do unaligned access */
+        qemu_log_mask(LOG_UNIMP, "aarch64: silently executing STXP quadword"
+                      "unaligned, exception not implemented yet.\n");
+    }
+
+    /* Setting excl_succeeded to true will make the store exclusive. */
+    cpu->excl_succeeded = true;
+    helper_ret_stq_mmu(env, addr, vall, op, GETRA());
+
+    if (!cpu->excl_succeeded) {
+        ret = 1;
+        goto out;
+    }
+
+    helper_ret_stq_mmu(env, addr + 8, valh, op, GETRA());
+    if (!cpu->excl_succeeded) {
+        ret = 1;
+    }
+
+out:
+    /* Unset LL/SC context */
+    cc->cpu_reset_excl_context(cpu);
+
+    return ret;
+}
diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h
index 1d3d10f..4ecb118 100644
--- a/target-arm/helper-a64.h
+++ b/target-arm/helper-a64.h
@@ -46,3 +46,5 @@ DEF_HELPER_FLAGS_2(frecpx_f32, TCG_CALL_NO_RWG, f32, f32, ptr)
 DEF_HELPER_FLAGS_2(fcvtx_f64_to_f32, TCG_CALL_NO_RWG, f32, f64, env)
 DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
 DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
+/* STXP helper */
+DEF_HELPER_5(stxp_i128, i64, env, i64, i64, i64, i32)
diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c
index 80f6c20..d5f613e 100644
--- a/target-arm/translate-a64.c
+++ b/target-arm/translate-a64.c
@@ -37,9 +37,6 @@
 static TCGv_i64 cpu_X[32];
 static TCGv_i64 cpu_pc;
 
-/* Load/store exclusive handling */
-static TCGv_i64 cpu_exclusive_high;
-
 static const char *regnames[] = {
     "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
     "x8", "x9", "x10", "x11", "x12", "x13", "x14", "x15",
@@ -93,9 +90,6 @@ void a64_translate_init(void)
                                           offsetof(CPUARMState, xregs[i]),
                                           regnames[i]);
     }
-
-    cpu_exclusive_high = tcg_global_mem_new_i64(TCG_AREG0,
-        offsetof(CPUARMState, exclusive_high), "exclusive_high");
 }
 
 static inline ARMMMUIdx get_a64_user_mem_index(DisasContext *s)
@@ -1219,7 +1213,7 @@ static void handle_hint(DisasContext *s, uint32_t insn,
 
 static void gen_clrex(DisasContext *s, uint32_t insn)
 {
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+    gen_helper_atomic_clear(cpu_env);
 }
 
 /* CLREX, DSB, DMB, ISB */
@@ -1685,11 +1679,9 @@ static void disas_b_exc_sys(DisasContext *s, uint32_t insn)
 }
 
 /*
- * Load/Store exclusive instructions are implemented by remembering
- * the value/address loaded, and seeing if these are the same
- * when the store is performed. This is not actually the architecturally
- * mandated semantics, but it works for typical guest code sequences
- * and avoids having to monitor regular stores.
+ * If the softmmu is enabled, the translation of Load/Store exclusive
+ * instructions will rely on the gen_helper_{ldlink,stcond} helpers,
+ * offloading most of the work to the softmmu_llsc_template.h functions.
  *
  * In system emulation mode only one CPU will be running at once, so
  * this sequence is effectively atomic.  In user emulation mode we
@@ -1698,13 +1690,48 @@ static void disas_b_exc_sys(DisasContext *s, uint32_t insn)
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i64 addr, int size, bool is_pair)
 {
-    TCGv_i64 tmp = tcg_temp_new_i64();
-    TCGMemOp memop = MO_TE + size;
+    /* In case @is_pair is set, we have to guarantee that at least the 128 bits
+     * accessed by a Load Exclusive Pair (64-bit variant) are protected. Since
+     * we do not have 128-bit helpers, we split the access in two halves, the
+     * first of them will set the exclusive region to cover at least 128 bits
+     * (this is why aarch64 has a custom cc->cpu_set_excl_protected_range which
+     * covers 128 bits).
+     * */
+    TCGv_i32 mem_idx = tcg_temp_new_i32();
+
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
 
     g_assert(size <= 3);
-    tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), memop);
+
+    if (size < 3) {
+        TCGv_i32 tmp = tcg_temp_new_i32();
+
+        switch (size) {
+        case 0:
+            gen_helper_ldlink_i8(tmp, cpu_env, addr, mem_idx);
+            break;
+        case 1:
+            gen_helper_ldlink_i16(tmp, cpu_env, addr, mem_idx);
+            break;
+        case 2:
+            gen_helper_ldlink_i32(tmp, cpu_env, addr, mem_idx);
+            break;
+        default:
+            abort();
+        }
+
+        TCGv_i64 tmp64 = tcg_temp_new_i64();
+        tcg_gen_ext_i32_i64(tmp64, tmp);
+        tcg_gen_mov_i64(cpu_reg(s, rt), tmp64);
+
+        tcg_temp_free_i32(tmp);
+        tcg_temp_free_i64(tmp64);
+    } else {
+        gen_helper_ldlink_i64(cpu_reg(s, rt), cpu_env, addr, mem_idx);
+    }
 
     if (is_pair) {
+        TCGMemOp memop = MO_TE + size;
         TCGv_i64 addr2 = tcg_temp_new_i64();
         TCGv_i64 hitmp = tcg_temp_new_i64();
 
@@ -1712,16 +1739,11 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
         tcg_gen_addi_i64(addr2, addr, 1 << size);
         tcg_gen_qemu_ld_i64(hitmp, addr2, get_mem_index(s), memop);
         tcg_temp_free_i64(addr2);
-        tcg_gen_mov_i64(cpu_exclusive_high, hitmp);
         tcg_gen_mov_i64(cpu_reg(s, rt2), hitmp);
         tcg_temp_free_i64(hitmp);
     }
 
-    tcg_gen_mov_i64(cpu_exclusive_val, tmp);
-    tcg_gen_mov_i64(cpu_reg(s, rt), tmp);
-
-    tcg_temp_free_i64(tmp);
-    tcg_gen_mov_i64(cpu_exclusive_addr, addr);
+    tcg_temp_free_i32(mem_idx);
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -1735,68 +1757,62 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
 }
 #else
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
-                                TCGv_i64 inaddr, int size, int is_pair)
-{
-    /* if (env->exclusive_addr == addr && env->exclusive_val == [addr]
-     *     && (!is_pair || env->exclusive_high == [addr + datasize])) {
-     *     [addr] = {Rt};
-     *     if (is_pair) {
-     *         [addr + datasize] = {Rt2};
-     *     }
-     *     {Rd} = 0;
-     * } else {
-     *     {Rd} = 1;
-     * }
-     * env->exclusive_addr = -1;
-     */
-    TCGLabel *fail_label = gen_new_label();
-    TCGLabel *done_label = gen_new_label();
-    TCGv_i64 addr = tcg_temp_local_new_i64();
-    TCGv_i64 tmp;
-
-    /* Copy input into a local temp so it is not trashed when the
-     * basic block ends at the branch insn.
-     */
-    tcg_gen_mov_i64(addr, inaddr);
-    tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label);
+                                TCGv_i64 addr, int size, int is_pair)
+{
+    /* Don't bother to check if we are actually in exclusive context since the
+     * helpers keep care of it. */
+    TCGv_i32 mem_idx = tcg_temp_new_i32();
 
-    tmp = tcg_temp_new_i64();
-    tcg_gen_qemu_ld_i64(tmp, addr, get_mem_index(s), MO_TE + size);
-    tcg_gen_brcond_i64(TCG_COND_NE, tmp, cpu_exclusive_val, fail_label);
-    tcg_temp_free_i64(tmp);
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
 
+    g_assert(size <= 3);
     if (is_pair) {
-        TCGv_i64 addrhi = tcg_temp_new_i64();
-        TCGv_i64 tmphi = tcg_temp_new_i64();
-
-        tcg_gen_addi_i64(addrhi, addr, 1 << size);
-        tcg_gen_qemu_ld_i64(tmphi, addrhi, get_mem_index(s), MO_TE + size);
-        tcg_gen_brcond_i64(TCG_COND_NE, tmphi, cpu_exclusive_high, fail_label);
-
-        tcg_temp_free_i64(tmphi);
-        tcg_temp_free_i64(addrhi);
-    }
+        if (size == 3) {
+            gen_helper_stxp_i128(cpu_reg(s, rd), cpu_env, addr, cpu_reg(s, rt),
+                                 cpu_reg(s, rt2), mem_idx);
+        } else if (size == 2) {
+            /* Paired single word case. After merging the two registers into
+             * one, we use one stcond_i64 to store the value to memory. */
+            TCGv_i64 val = tcg_temp_new_i64();
+            TCGv_i64 valh = tcg_temp_new_i64();
+            tcg_gen_shli_i64(valh, cpu_reg(s, rt2), 32);
+            tcg_gen_and_i64(val, valh, cpu_reg(s, rt));
+            gen_helper_stcond_i64(cpu_reg(s, rd), cpu_env, addr, val, mem_idx);
+            tcg_temp_free_i64(valh);
+            tcg_temp_free_i64(val);
+        } else {
+            abort();
+        }
+    } else {
+        if (size < 3) {
+            TCGv_i32 val = tcg_temp_new_i32();
 
-    /* We seem to still have the exclusive monitor, so do the store */
-    tcg_gen_qemu_st_i64(cpu_reg(s, rt), addr, get_mem_index(s), MO_TE + size);
-    if (is_pair) {
-        TCGv_i64 addrhi = tcg_temp_new_i64();
+            tcg_gen_extrl_i64_i32(val, cpu_reg(s, rt));
 
-        tcg_gen_addi_i64(addrhi, addr, 1 << size);
-        tcg_gen_qemu_st_i64(cpu_reg(s, rt2), addrhi,
-                            get_mem_index(s), MO_TE + size);
-        tcg_temp_free_i64(addrhi);
+            switch (size) {
+            case 0:
+                gen_helper_stcond_i8(cpu_reg(s, rd), cpu_env, addr, val,
+                                     mem_idx);
+                break;
+            case 1:
+                gen_helper_stcond_i16(cpu_reg(s, rd), cpu_env, addr, val,
+                                      mem_idx);
+                break;
+            case 2:
+                gen_helper_stcond_i32(cpu_reg(s, rd), cpu_env, addr, val,
+                                      mem_idx);
+                break;
+            default:
+                abort();
+            }
+            tcg_temp_free_i32(val);
+        } else {
+            gen_helper_stcond_i64(cpu_reg(s, rd), cpu_env, addr, cpu_reg(s, rt),
+                                  mem_idx);
+        }
     }
 
-    tcg_temp_free_i64(addr);
-
-    tcg_gen_movi_i64(cpu_reg(s, rd), 0);
-    tcg_gen_br(done_label);
-    gen_set_label(fail_label);
-    tcg_gen_movi_i64(cpu_reg(s, rd), 1);
-    gen_set_label(done_label);
-    tcg_gen_movi_i64(cpu_exclusive_addr, -1);
-
+    tcg_temp_free_i32(mem_idx);
 }
 #endif
 
diff --git a/target-arm/translate.c b/target-arm/translate.c
index 9c2b197..6f930ef 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -60,9 +60,6 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
-/* The following two variables are still used by the aarch64 front-end */
-TCGv_i64 cpu_exclusive_addr;
-TCGv_i64 cpu_exclusive_val;
 #ifdef CONFIG_USER_ONLY
 TCGv_i64 cpu_exclusive_test;
 TCGv_i32 cpu_exclusive_info;
@@ -95,10 +92,6 @@ void arm_translate_init(void)
     cpu_VF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, VF), "VF");
     cpu_ZF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, ZF), "ZF");
 
-    cpu_exclusive_addr = tcg_global_mem_new_i64(TCG_AREG0,
-        offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
-    cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
-        offsetof(CPUARMState, exclusive_val), "exclusive_val");
 #ifdef CONFIG_USER_ONLY
     cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_test), "exclusive_test");
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation
  2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (13 preceding siblings ...)
  2016-04-19 13:39 ` [Qemu-devel] [RFC v8 14/14] target-arm: aarch64: Use ls/st exclusive for atomic insns Alvise Rigo
@ 2016-06-09 11:42 ` Sergey Fedorov
  2016-06-09 12:35   ` alvise rigo
  14 siblings, 1 reply; 18+ messages in thread
From: Sergey Fedorov @ 2016-06-09 11:42 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: jani.kokkonen, claudio.fontana, tech, alex.bennee, pbonzini, rth

Hi,

On 19/04/16 16:39, Alvise Rigo wrote:
> This patch series provides an infrastructure for atomic instruction
> implementation in QEMU, thus offering a 'legacy' solution for
> translating guest atomic instructions. Moreover, it can be considered as
> a first step toward a multi-thread TCG.
>
> The underlying idea is to provide new TCG helpers (sort of softmmu
> helpers) that guarantee atomicity to some memory accesses or in general
> a way to define memory transactions.
>
> More specifically, the new softmmu helpers behave as LoadLink and
> StoreConditional instructions, and are called from TCG code by means of
> target specific helpers. This work includes the implementation for all
> the ARM atomic instructions, see target-arm/op_helper.c.

I think that is a generally good idea to provide LL/SC TCG operations
for emulating guest atomic instruction behaviour as those operations
allow to implement other atomic primitives such as copmare-and-swap and
atomic arithmetic easily. Another advantage of these operations is that
they are free from ABA problem.

> The implementation heavily uses the software TLB together with a new
> bitmap that has been added to the ram_list structure which flags, on a
> per-CPU basis, all the memory pages that are in the middle of a LoadLink
> (LL), StoreConditional (SC) operation.  Since all these pages can be
> accessed directly through the fast-path and alter a vCPU's linked value,
> the new bitmap has been coupled with a new TLB flag for the TLB virtual
> address which forces the slow-path execution for all the accesses to a
> page containing a linked address.

But I'm afraid we've got a scalability problem using software TLB engine
heavily. This approach relies on TLB flush of all CPUs which is not very
cheap operation. That is going to be even more expansive in case of
MTTCG as you need to exit the CPU execution loop in order to avoid
deadlocks.

I see you try mitigate this issue by introducing a history of N last
pages touched by an exclusive access. That would work fine avoiding
excessive TLB flushes as long as the current working set of exclusively
accessed pages does not go beyond N. Once we exceed this limit we'll get
a global TLB flush on most LL operations. I'm afraid we can get dramatic
performance decrease as guest code implements finer-grained locking
scheme. I would like to emphasise that performance can degrade sharply
and dramatically as soon as the limit gets exceeded. How could we tackle
this problem?

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation
  2016-06-09 11:42 ` [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Sergey Fedorov
@ 2016-06-09 12:35   ` alvise rigo
  2016-06-09 12:52     ` Sergey Fedorov
  0 siblings, 1 reply; 18+ messages in thread
From: alvise rigo @ 2016-06-09 12:35 UTC (permalink / raw)
  To: Sergey Fedorov
  Cc: QEMU Developers, MTTCG Devel, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, Alex Bennée,
	Paolo Bonzini, Richard Henderson

Hi Sergey,

Thank you for this precise summary.

On Thu, Jun 9, 2016 at 1:42 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
> Hi,
>
> On 19/04/16 16:39, Alvise Rigo wrote:
>> This patch series provides an infrastructure for atomic instruction
>> implementation in QEMU, thus offering a 'legacy' solution for
>> translating guest atomic instructions. Moreover, it can be considered as
>> a first step toward a multi-thread TCG.
>>
>> The underlying idea is to provide new TCG helpers (sort of softmmu
>> helpers) that guarantee atomicity to some memory accesses or in general
>> a way to define memory transactions.
>>
>> More specifically, the new softmmu helpers behave as LoadLink and
>> StoreConditional instructions, and are called from TCG code by means of
>> target specific helpers. This work includes the implementation for all
>> the ARM atomic instructions, see target-arm/op_helper.c.
>
> I think that is a generally good idea to provide LL/SC TCG operations
> for emulating guest atomic instruction behaviour as those operations
> allow to implement other atomic primitives such as copmare-and-swap and
> atomic arithmetic easily. Another advantage of these operations is that
> they are free from ABA problem.
>
>> The implementation heavily uses the software TLB together with a new
>> bitmap that has been added to the ram_list structure which flags, on a
>> per-CPU basis, all the memory pages that are in the middle of a LoadLink
>> (LL), StoreConditional (SC) operation.  Since all these pages can be
>> accessed directly through the fast-path and alter a vCPU's linked value,
>> the new bitmap has been coupled with a new TLB flag for the TLB virtual
>> address which forces the slow-path execution for all the accesses to a
>> page containing a linked address.
>
> But I'm afraid we've got a scalability problem using software TLB engine
> heavily. This approach relies on TLB flush of all CPUs which is not very
> cheap operation. That is going to be even more expansive in case of
> MTTCG as you need to exit the CPU execution loop in order to avoid
> deadlocks.
>
> I see you try mitigate this issue by introducing a history of N last
> pages touched by an exclusive access. That would work fine avoiding
> excessive TLB flushes as long as the current working set of exclusively
> accessed pages does not go beyond N. Once we exceed this limit we'll get
> a global TLB flush on most LL operations. I'm afraid we can get dramatic

Indeed, if the guest does a loop of N+1 atomic operations, at each
iteration we will have N flushes.

> performance decrease as guest code implements finer-grained locking
> scheme. I would like to emphasise that performance can degrade sharply
> and dramatically as soon as the limit gets exceeded. How could we tackle
> this problem?

In my opinion, the length of the history should not be fixed to avoid
the drawback of above. We can make the history's length dynamic (until
a given threshold is reached) according to the pressure of atomic
instructions. What should remain constant is the time elapsed to make
a cycle of the history's array. We can for instance store in the lower
bits of the addresses in the history a sort of timestamp used to
calculate the period and adjust accordingly the length of the history.
What do you think?

I will try to explore other ways to tackle the problem.

Best regards,
alvise

>
> Kind regards,
> Sergey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation
  2016-06-09 12:35   ` alvise rigo
@ 2016-06-09 12:52     ` Sergey Fedorov
  0 siblings, 0 replies; 18+ messages in thread
From: Sergey Fedorov @ 2016-06-09 12:52 UTC (permalink / raw)
  To: alvise rigo
  Cc: QEMU Developers, MTTCG Devel, Jani Kokkonen, Claudio Fontana,
	VirtualOpenSystems Technical Team, Alex Bennée,
	Paolo Bonzini, Richard Henderson

On 09/06/16 15:35, alvise rigo wrote:
> On Thu, Jun 9, 2016 at 1:42 PM, Sergey Fedorov <serge.fdrv@gmail.com> wrote:
>> On 19/04/16 16:39, Alvise Rigo wrote:
>>> The implementation heavily uses the software TLB together with a new
>>> bitmap that has been added to the ram_list structure which flags, on a
>>> per-CPU basis, all the memory pages that are in the middle of a LoadLink
>>> (LL), StoreConditional (SC) operation.  Since all these pages can be
>>> accessed directly through the fast-path and alter a vCPU's linked value,
>>> the new bitmap has been coupled with a new TLB flag for the TLB virtual
>>> address which forces the slow-path execution for all the accesses to a
>>> page containing a linked address.
>> But I'm afraid we've got a scalability problem using software TLB engine
>> heavily. This approach relies on TLB flush of all CPUs which is not very
>> cheap operation. That is going to be even more expansive in case of
>> MTTCG as you need to exit the CPU execution loop in order to avoid
>> deadlocks.
>>
>> I see you try mitigate this issue by introducing a history of N last
>> pages touched by an exclusive access. That would work fine avoiding
>> excessive TLB flushes as long as the current working set of exclusively
>> accessed pages does not go beyond N. Once we exceed this limit we'll get
>> a global TLB flush on most LL operations. I'm afraid we can get dramatic
> Indeed, if the guest does a loop of N+1 atomic operations, at each
> iteration we will have N flushes.
>
>> performance decrease as guest code implements finer-grained locking
>> scheme. I would like to emphasise that performance can degrade sharply
>> and dramatically as soon as the limit gets exceeded. How could we tackle
>> this problem?
> In my opinion, the length of the history should not be fixed to avoid
> the drawback of above. We can make the history's length dynamic (until
> a given threshold is reached) according to the pressure of atomic
> instructions. What should remain constant is the time elapsed to make
> a cycle of the history's array. We can for instance store in the lower
> bits of the addresses in the history a sort of timestamp used to
> calculate the period and adjust accordingly the length of the history.
> What do you think?

It really depends on what algorithm we'll introduce for dynamic history
length. I'm afraid it could complicate things and introduce its own
overhead. I'm also going to look at Emilio's approach
http://thread.gmane.org/gmane.comp.emulators.qemu/335297.

Kind regards,
Sergey

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-06-09 12:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-19 13:39 [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 02/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 03/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 04/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 05/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 06/14] qom: cpu: Add CPUClass hooks for exclusive range Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 07/14] softmmu: Add helpers for a new slowpath Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 08/14] softmmu: Add history of excl accesses Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 09/14] softmmu: Honor the new exclusive bitmap Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 10/14] softmmu: Support MMIO exclusive accesses Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 11/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 12/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 13/14] target-arm: cpu64: use custom set_excl hook Alvise Rigo
2016-04-19 13:39 ` [Qemu-devel] [RFC v8 14/14] target-arm: aarch64: Use ls/st exclusive for atomic insns Alvise Rigo
2016-06-09 11:42 ` [Qemu-devel] [RFC v8 00/14] Slow-path for atomic instruction translation Sergey Fedorov
2016-06-09 12:35   ` alvise rigo
2016-06-09 12:52     ` Sergey Fedorov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).