[Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation
@ 2015-08-07 17:03 Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
                   ` (8 more replies)
  0 siblings, 9 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

This is the fourth iteration of the patch series which applies to the
upstream branch of QEMU (v2.4.0-rc0).

Changes versus previous versions are at the bottom of this cover letter.

The code is also available at following repository:
https://git.virtualopensystems.com/dev/qemu-mt.git
branch:
slowpath-for-atomic-v4-no-mttcg
(To enable the new slow path configure QEMU with:
./configure --enable-tcg-ldst-excl ...)

This patch series provides an infrastructure for atomic instruction
implementation in QEMU, thus offering a 'legacy' solution for
translating guest atomic instructions. Moreover, it can be considered as
a first step toward a multi-thread TCG.

The underlying idea is to provide new TCG instructions that guarantee
atomicity to some memory accesses or in general a way to define memory
transactions.  More specifically, a new version of TCG
qemu_{ld,st}_{i32,i64} instructions has been implemented that behaves as
a pair of LoadLink and StoreConditional atomic instructions. The
implementation heavily uses the software TLB together with a new bitmap
that has been added to the ram_list structure which flags, on a per-CPU
basis, all the memory pages that are in the middle of a LoadLink (LL),
StoreConditional (SC) operation.
Since all these pages can be accessed directly through the fast-path
and alter a vCPU's linked value, the new bitmap has been coupled with a
new TLB flag for the TLB virtual address which forces the slow-path
execution for all the accesses to a page containing a linked address.

The new slow-path is implemented such that:
- the LL behaves as a normal load slow-path, except for clearing the
  dirty flag in the bitmap.  The cputlb.c code while generating a TLB
  entry, checks if there is at least one vCPU that has the bit cleared
  in the exclusive bitmap, it that case the TLB entry will have the EXCL
  flag set, thus forcing the slow-path.  In order to ensure that all the
  vCPUs will follow the slow-path for that page, we flush the TLB cache
  of all the other vCPUs.

  The LL will also set the linked address and size of the access in a
  vCPU's private variable. After the corresponding SC, this address will
  be set to a reset value.

- the SC can fail returning 1, or succeed, returning 0.  It has to come
  always after a LL and has to access the same address 'linked' by the
  previous LL, otherwise it will fail. If in the time window delimited
  by a legit pair of LL/SC operations another write access happens to
  the linked address, the SC will fail.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

In this series the ARM ldrex/strex instructions (all flavours) are
implemented for arm, aarch64 and i386 hosts.
The code has been tested with bare-metal test cases and by booting Linux.

* Performance considerations
The new slow-path adds some overhead to the translation of the ARM
atomic instructions, since their emulation doesn't happen anymore only
in the guest (by mean of pure TCG generated code), but requires the
execution of two helpers functions. Despite this, the additional time
required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
negligible.
Instead, on a LL/SC bound test scenario - like:
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
solution requires 30% (1 million iterations) and 70% (10 millions
iterations) of additional time for the test to complete.

Changes from v3:
- based on upstream QEMU
- addressed comments from Alex Bennée
- the slow path can be enabled by the user with:
  ./configure --enable-tcg-ldst-excl only if the backend supports it
- all the ARM ldex/stex instructions make now use of the slow path
- added aarch64 TCG backend support
- part of the code has been rewritten

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
  a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (9):
  exec.c: Add new exclusive bitmap to ram_list
  softmmu: Add new TLB_EXCL flag
  softmmu: Add helpers for a new slowpath
  tcg-op: create new TCG qemu_{ld,st} excl variants
  configure: Enable/disable new qemu_{ld,st} excl insns
  tcg-i386: Implement excl variants of qemu_{ld,st}
  tcg-arm: Implement excl variants of qemu_{ld,st}
  tcg-aarch64: Implement excl variants of qemu_{ld,st}
  target-arm: translate: Use ld/st excl for atomic insns

 configure                |  21 ++++++
 cputlb.c                 |  42 ++++++++++-
 exec.c                   |   7 +-
 include/exec/cpu-all.h   |   8 ++
 include/exec/cpu-defs.h  |  12 +++
 include/exec/memory.h    |   3 +-
 include/exec/ram_addr.h  |  68 +++++++++++++++++
 softmmu_llsc_template.h  | 124 ++++++++++++++++++++++++++++++
 softmmu_template.h       | 124 ++++++++++++++++++++++++------
 target-arm/translate.c   | 191 +++++++++++++++++++++++++++++++++++++++++++++--
 tcg/aarch64/tcg-target.c |  99 ++++++++++++++++++++++--
 tcg/arm/tcg-target.c     | 152 +++++++++++++++++++++++++++++--------
 tcg/i386/tcg-target.c    | 148 +++++++++++++++++++++++++++++++-----
 tcg/tcg-be-ldst.h        |   1 +
 tcg/tcg-op.c             |  65 ++++++++++++++++
 tcg/tcg-op.h             |   4 +
 tcg/tcg-opc.h            |   8 ++
 tcg/tcg.c                |   2 +
 tcg/tcg.h                |  32 ++++++++
 19 files changed, 1018 insertions(+), 93 deletions(-)
 create mode 100644 softmmu_llsc_template.h

-- 
2.5.0

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-11 13:52   ` Paolo Bonzini
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 2/9] softmmu: Add new TLB_EXCL flag Alvise Rigo
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
basis.
For all these pages, the corresponding TLB entries will be generated
in such a way to force the slow-path if at least one vCPU has the bit
not set.
When the system starts, the whole memory is dirty (all the bitmap is
set). A page, after being marked as exclusively-clean, will be
restored as dirty after the SC.

The accessors to this bitmap are currently not atomic, but they have to
be so in a real multi-threading TCG.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 exec.c                  |  7 +++--
 include/exec/memory.h   |  3 ++-
 include/exec/ram_addr.h | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 7d60e15..f113076 100644
--- a/exec.c
+++ b/exec.c
@@ -1493,11 +1493,14 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
         int i;
 
         /* ram_list.dirty_memory[] is protected by the iothread lock.  */
-        for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
+        for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
             ram_list.dirty_memory[i] =
                 bitmap_zero_extend(ram_list.dirty_memory[i],
                                    old_ram_size, new_ram_size);
-       }
+        }
+        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
+                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                old_ram_size * smp_cpus, new_ram_size * smp_cpus);
     }
     cpu_physical_memory_set_dirty_range(new_block->offset,
                                         new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 1394715..a525259 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index c113f21..6b678d6 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 ram_addr_t qemu_ram_alloc_from_file(ram_addr_t size, MemoryRegion *mr,
                                     bool share, const char *mem_path,
@@ -135,6 +136,10 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
     if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
         bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
     }
+    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE],
+                          page * smp_cpus, (end - page) * smp_cpus);
+    }
     xen_modified_memory(start, length);
 }
 
@@ -249,5 +254,68 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
     return num_dirty;
 }
 
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_GET_OFFSET(addr) (smp_cpus * (addr >> TARGET_PAGE_BITS))
+static inline void cpu_physical_memory_set_excl_dirty(ram_addr_t addr,
+                                                      uint32_t cpu_index)
+{
+    set_bit_atomic(EXCL_BITMAP_GET_OFFSET(addr) + cpu_index,
+                   ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
+{
+    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
+    unsigned long next, end;
+
+    if (likely(smp_cpus <= BITS_PER_LONG)) {
+        unsigned long mask = (1 << smp_cpus) - 1;
+
+        return
+            (mask & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
+            (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)))) != mask;
+    }
+
+    end = BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)) + smp_cpus;
+    next = find_next_zero_bit(bitmap, end,
+                              BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)));
+
+    return next < end;
+}
+
+static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr,
+                                                    unsigned long cpu)
+{
+    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
+    unsigned long end, next;
+    uint32_t add;
+
+    assert(cpu <= smp_cpus);
+
+    if (likely(smp_cpus <= BITS_PER_LONG)) {
+        cpu = (cpu == smp_cpus) ? (1 << cpu) - 1 : (1 << cpu);
+
+        return cpu & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
+                     (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)));
+    }
+
+    add = (cpu == smp_cpus) ? 0 : 1;
+    end = BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)) + cpu + add;
+    next = find_next_bit(bitmap, end, BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))
+                         + (cpu % smp_cpus));
+
+    return next < end;
+}
+
+static inline bool cpu_physical_memory_clear_excl_dirty(ram_addr_t addr,
+                                                        uint32_t cpu_index)
+{
+    return bitmap_test_and_clear_atomic(
+                                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                                EXCL_BITMAP_GET_OFFSET(addr) + cpu_index, 1);
+}
+
+
+
 #endif
 #endif
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel]  [RFC v4 2/9] softmmu: Add new TLB_EXCL flag
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath Alvise Rigo
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Add a new TLB flag to force all the accesses made to a page to follow
the slow-path.

In the case we remove a TLB entry marked as EXCL, we unset the
corresponding exclusive bit in the bitmap.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |  39 ++++++++++++++++-
 include/exec/cpu-all.h  |   8 ++++
 include/exec/cpu-defs.h |  12 ++++++
 softmmu_template.h      | 112 ++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 148 insertions(+), 23 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index a506086..251eec8 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -299,6 +299,13 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     env->tlb_v_table[mmu_idx][vidx] = *te;
     env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
+    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) {
+        /* We are removing an exclusive entry, set the page to dirty. */
+        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
+                                          (te->addr_write & TARGET_PAGE_MASK);
+        cpu_physical_memory_set_excl_dirty(hw_addr, cpu->cpu_index);
+    }
+
     /* refill the tlb */
     env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
     env->iotlb[mmu_idx][index].attrs = attrs;
@@ -324,7 +331,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
                                                    + xlat)) {
             te->addr_write = address | TLB_NOTDIRTY;
         } else {
-            te->addr_write = address;
+            if (!(address & TLB_MMIO) &&
+                cpu_physical_memory_excl_atleast_one_clean(section->mr->ram_addr
+                                                           + xlat)) {
+                /* There is at least one vCPU that has flagged the address as
+                 * exclusive. */
+                te->addr_write = address | TLB_EXCL;
+            } else {
+                te->addr_write = address;
+            }
         }
     } else {
         te->addr_write = -1;
@@ -376,6 +391,28 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
     return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* Atomic insn translation TLB support. */
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+/* For every vCPU compare the exclusive address and reset it in case of a
+ * match. Since only one vCPU is running at once, no lock has to be held to
+ * guard this operation. */
+static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
+{
+    CPUState *cpu;
+    CPUArchState *acpu;
+
+    CPU_FOREACH(cpu) {
+        acpu = (CPUArchState *)cpu->env_ptr;
+
+        if (acpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+            ranges_overlap(acpu->excl_protected_range.begin,
+            acpu->excl_protected_range.end - acpu->excl_protected_range.begin,
+            addr, size)) {
+            acpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+        }
+    }
+}
+
 #define MMUSUFFIX _mmu
 
 #define SHIFT 0
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index ea6a9a6..ad6afcb 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -320,6 +320,14 @@ extern RAMList ram_list;
 #define TLB_NOTDIRTY    (1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
+/* Set if TLB entry references a page that requires exclusive access.  */
+#define TLB_EXCL        (1 << 6)
+
+/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
+ * above. */
+#if TLB_EXCL >= TARGET_PAGE_SIZE
+#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
+#endif
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 98b9cff..a67f295 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -27,6 +27,7 @@
 #include <inttypes.h>
 #include "qemu/osdep.h"
 #include "qemu/queue.h"
+#include "qemu/range.h"
 #include "tcg-target.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
@@ -150,5 +151,16 @@ typedef struct CPUIOTLBEntry {
 #define CPU_COMMON                                                      \
     /* soft mmu support */                                              \
     CPU_COMMON_TLB                                                      \
+                                                                        \
+    /* Used by the atomic insn translation backend. */                  \
+    int ll_sc_context;                                                  \
+    /* vCPU current exclusive addresses range.
+     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
+     * in the middle of a LL/SC. */                                     \
+    struct Range excl_protected_range;                                  \
+    /* Used to carry the SC result but also to flag a normal (legacy)
+     * store access made by a stcond (see softmmu_template.h). */       \
+    int excl_succeeded;                                                 \
+
 
 #endif
diff --git a/softmmu_template.h b/softmmu_template.h
index d42d89d..e4431e8 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -409,19 +409,53 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            /* The function lookup_and_reset_cpus_ll_addr could have reset the
+             * exclusive address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
+             * not been called by a softmmu_llsc_template.h. */
+            if(env->excl_succeeded) {
+                if (env->excl_protected_range.begin != hw_addr) {
+                    /* The vCPU is SC-ing to an unprotected address. */
+                    env->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                    env->excl_succeeded = 0;
+
+                    return;
+                }
+
+                cpu_physical_memory_set_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
+
+            return;
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+            iotlbentry = &env->iotlb[mmu_idx][index];
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io.  */
+            val = TGT_LE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_LE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
     }
 
     /* Handle slow unaligned access (it spans two pages or IO).  */
@@ -489,19 +523,53 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            /* The function lookup_and_reset_cpus_ll_addr could have reset the
+             * exclusive address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
+             * not been called by a softmmu_llsc_template.h. */
+            if(env->excl_succeeded) {
+                if (env->excl_protected_range.begin != hw_addr) {
+                    /* The vCPU is SC-ing to an unprotected address. */
+                    env->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                    env->excl_succeeded = 0;
+
+                    return;
+                }
+
+                cpu_physical_memory_set_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
+
+            return;
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+            iotlbentry = &env->iotlb[mmu_idx][index];
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io.  */
+            val = TGT_BE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_BE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
     }
 
     /* Handle slow unaligned access (it spans two pages or IO).  */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 2/9] softmmu: Add new TLB_EXCL flag Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-11 13:32   ` alvise rigo
  2015-08-12 12:43   ` Paolo Bonzini
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 4/9] tcg-op: create new TCG qemu_{ld, st} excl variants Alvise Rigo
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

The new helpers rely on the legacy ones to perform the actual read/write.

The LoadLink helper (helper_ldlink_name) prepares the way for the
following SC operation. It sets the linked address and the size of the
access.
These helper also update the TLB entry of the page involved in the
LL/SC for those vCPUs that have the bit set (dirty), so that the
following accesses made by all the vCPUs will follow the slow path.

The StoreConditional helper (helper_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU. A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |   3 ++
 softmmu_llsc_template.h | 124 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |  12 +++++
 3 files changed, 139 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index 251eec8..f45afc6 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -415,6 +415,8 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -427,6 +429,7 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 0000000..d2e92b4
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,124 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* This template does not generate together the le and be version, but only one
+ * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
+ * The same nomenclature as softmmu_template.h is used for the exclusive
+ * helpers.  */
+
+#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
+
+#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld_legacy glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
+
+#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
+
+#if DATA_SIZE > 1
+#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#else /* DATA_SIZE <= 1 */
+#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld_legacy glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+#define helper_st_legacy glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
+#endif
+
+#endif
+
+WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
+                                TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    CPUState *cpu;
+    hwaddr hw_addr;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* Use the proper load helper from cpu_ldst.h */
+    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
+
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
+     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
+    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
+
+    cpu_physical_memory_clear_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
+    /* If all the vCPUs have the EXCL bit set for this page there is no need
+     * to request any flush. */
+    if (cpu_physical_memory_excl_is_dirty(hw_addr, smp_cpus)) {
+        CPU_FOREACH(cpu) {
+            if (current_cpu != cpu) {
+                if (cpu_physical_memory_excl_is_dirty(hw_addr,
+                                                    cpu->cpu_index)) {
+                    cpu_physical_memory_clear_excl_dirty(hw_addr,
+                                                         cpu->cpu_index);
+                    tlb_flush(cpu, 1);
+                }
+            }
+        }
+    }
+
+    env->excl_protected_range.begin = hw_addr;
+    env->excl_protected_range.end = hw_addr + DATA_SIZE;
+
+    /* For this vCPU, just update the TLB entry, no need to flush. */
+    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
+
+    return ret;
+}
+
+WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
+                             DATA_TYPE val, TCGMemOpIdx oi,
+                             uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* We set it preventively to true to distinguish the following legacy
+     * access as one made by the store conditional wrapper. If the store
+     * conditional does not succeed, the value will be set to 0.*/
+    env->excl_succeeded = 1;
+    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
+
+    if (env->excl_succeeded) {
+        env->excl_succeeded = 0;
+        ret = 0;
+    } else {
+        ret = 1;
+    }
+
+    return ret;
+}
+
+#undef helper_ldlink_name
+#undef helper_stcond_name
+#undef helper_ld_legacy
+#undef helper_st_legacy
diff --git a/softmmu_template.h b/softmmu_template.h
index e4431e8..ad65d20 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -640,6 +640,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
 #endif
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
+#ifdef GEN_EXCLUSIVE_HELPERS
+
+#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
+#define BIGENDIAN_EXCLUSIVE_HELPERS
+#include "softmmu_llsc_template.h"
+#undef BIGENDIAN_EXCLUSIVE_HELPERS
+#endif
+
+#include "softmmu_llsc_template.h"
+
+#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
+
 #undef READ_ACCESS_TYPE
 #undef SHIFT
 #undef DATA_TYPE
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel]  [RFC v4 4/9] tcg-op: create new TCG qemu_{ld, st} excl variants
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
                   ` (2 preceding siblings ...)
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns Alvise Rigo
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Introduce the MO_EXCL TCGMemOp flag that marks a load access as LL and
a store access as SC.

While the LL variant has been implemented without the need of any extra
TCG instruction, for the SC case this was not possible since the
instruction has a return value which is the return state of the store
operation.
It could be possible to add a return value to the plain qemu_st, but
this would make the code complicated since it would require to change
the operands of all the qemu_st operations also for those backends that
don't support the new slow-path.

When all the backends will support the atomic slow path, then
including the qemu_stcond in the plain qemu_st will be much easier.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/tcg-be-ldst.h |  1 +
 tcg/tcg-op.c      | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg-op.h      |  4 ++++
 tcg/tcg-opc.h     |  8 +++++++
 tcg/tcg.c         |  2 ++
 tcg/tcg.h         | 32 +++++++++++++++++++++++++++
 6 files changed, 112 insertions(+)

diff --git a/tcg/tcg-be-ldst.h b/tcg/tcg-be-ldst.h
index 40a2369..b3f9c51 100644
--- a/tcg/tcg-be-ldst.h
+++ b/tcg/tcg-be-ldst.h
@@ -24,6 +24,7 @@
 
 typedef struct TCGLabelQemuLdst {
     bool is_ld;             /* qemu_ld: true, qemu_st: false */
+    TCGReg llsc_success;    /* reg index for qemu_stcond outcome */
     TCGMemOpIdx oi;
     TCGType type;           /* result type of a load */
     TCGReg addrlo_reg;      /* reg index for low word of guest virtual addr */
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 45098c3..55b6090 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -1885,6 +1885,25 @@ static void gen_ldst_i32(TCGOpcode opc, TCGv_i32 val, TCGv addr,
 #endif
 }
 
+/* An output operand to return the StoreConditional result */
+static void gen_stcond_i32(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i32 val,
+                           TCGv addr, TCGMemOp memop, TCGArg idx)
+{
+    TCGMemOpIdx oi = make_memop_idx(memop, idx);
+
+#if TARGET_LONG_BITS == 32
+    tcg_gen_op4i_i32(opc, is_dirty, val, addr, oi);
+#else
+    if (TCG_TARGET_REG_BITS == 32) {
+        tcg_gen_op5i_i32(opc, is_dirty, val, TCGV_LOW(addr), TCGV_HIGH(addr),
+                         oi);
+    } else {
+        tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I32(is_dirty), GET_TCGV_I32(val),
+                    GET_TCGV_I64(addr), oi);
+    }
+#endif
+}
+
 static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
                          TCGMemOp memop, TCGArg idx)
 {
@@ -1905,6 +1924,32 @@ static void gen_ldst_i64(TCGOpcode opc, TCGv_i64 val, TCGv addr,
 #endif
 }
 
+static void gen_stcond_i64(TCGOpcode opc, TCGv_i32 is_dirty, TCGv_i64 val,
+                           TCGv addr, TCGMemOp memop, TCGArg idx)
+{
+    TCGMemOpIdx oi = make_memop_idx(memop, idx);
+#if TARGET_LONG_BITS == 32
+    if (TCG_TARGET_REG_BITS == 32) {
+        tcg_gen_op5i_i32(opc, is_dirty, TCGV_LOW(val), TCGV_HIGH(val),
+                         addr, oi);
+    } else {
+        tcg_gen_op4(&tcg_ctx, opc, GET_TCGV_I32(is_dirty), GET_TCGV_I64(val),
+                    GET_TCGV_I32(addr), oi);
+    }
+#else
+    if (TCG_TARGET_REG_BITS == 32) {
+        tcg_gen_op6i_i32(opc, is_dirty, TCGV_LOW(val), TCGV_HIGH(val),
+                         TCGV_LOW(addr), TCGV_HIGH(addr), oi);
+    } else {
+        TCGv_i64 is_dirty64 = tcg_temp_new_i64();
+
+        tcg_gen_extu_i32_i64(is_dirty64, is_dirty);
+        tcg_gen_op4i_i64(opc, is_dirty64, val, addr, oi);
+        tcg_temp_free_i64(is_dirty64);
+    }
+#endif
+}
+
 void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     memop = tcg_canonicalize_memop(memop, 0, 0);
@@ -1917,6 +1962,13 @@ void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     gen_ldst_i32(INDEX_op_qemu_st_i32, val, addr, memop, idx);
 }
 
+void tcg_gen_qemu_stcond_i32(TCGv_i32 is_dirty, TCGv_i32 val, TCGv addr,
+                             TCGArg idx, TCGMemOp memop)
+{
+    memop = tcg_canonicalize_memop(memop, 0, 1) | MO_EXCL;
+    gen_stcond_i32(INDEX_op_qemu_stcond_i32, is_dirty, val, addr, memop, idx);
+}
+
 void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
 {
     if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
@@ -1943,3 +1995,16 @@ void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, TCGMemOp memop)
     memop = tcg_canonicalize_memop(memop, 1, 1);
     gen_ldst_i64(INDEX_op_qemu_st_i64, val, addr, memop, idx);
 }
+
+void tcg_gen_qemu_stcond_i64(TCGv_i32 is_dirty, TCGv_i64 val, TCGv addr,
+                             TCGArg idx, TCGMemOp memop)
+{
+    if (TCG_TARGET_REG_BITS == 32 && (memop & MO_SIZE) < MO_64) {
+        tcg_gen_qemu_stcond_i32(is_dirty, TCGV_LOW(val), addr, idx,
+                                memop | MO_EXCL);
+        return;
+    }
+
+    memop = tcg_canonicalize_memop(memop, 1, 1) | MO_EXCL;
+    gen_stcond_i64(INDEX_op_qemu_stcond_i64, is_dirty, val, addr, memop, idx);
+}
diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index d1d763f..23ef7a8 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -754,6 +754,10 @@ void tcg_gen_qemu_st_i32(TCGv_i32, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_ld_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 void tcg_gen_qemu_st_i64(TCGv_i64, TCGv, TCGArg, TCGMemOp);
 
+/* Exclusive variants */
+void tcg_gen_qemu_stcond_i32(TCGv_i32, TCGv_i32, TCGv, TCGArg, TCGMemOp);
+void tcg_gen_qemu_stcond_i64(TCGv_i32, TCGv_i64, TCGv, TCGArg, TCGMemOp);
+
 static inline void tcg_gen_qemu_ld8u(TCGv ret, TCGv addr, int mem_index)
 {
     tcg_gen_qemu_ld_tl(ret, addr, mem_index, MO_UB);
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 13ccb60..4bd0783 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -183,10 +183,18 @@ DEF(qemu_ld_i32, 1, TLADDR_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
 DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
+DEF(qemu_ldlink_i32, 1, TLADDR_ARGS, 2,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
+DEF(qemu_stcond_i32, 1, TLADDR_ARGS + 1, 2,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)
 DEF(qemu_ld_i64, DATA64_ARGS, TLADDR_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
+DEF(qemu_ldlink_i64, DATA64_ARGS, TLADDR_ARGS, 1,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
 DEF(qemu_st_i64, 0, TLADDR_ARGS + DATA64_ARGS, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
+DEF(qemu_stcond_i64, 1, TLADDR_ARGS + DATA64_ARGS, 1,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT)
 
 #undef TLADDR_ARGS
 #undef DATA64_ARGS
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 7e088b1..e4f9db9 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1068,9 +1068,11 @@ void tcg_dump_ops(TCGContext *s)
                 i = 1;
                 break;
             case INDEX_op_qemu_ld_i32:
+            case INDEX_op_qemu_stcond_i32:
             case INDEX_op_qemu_st_i32:
             case INDEX_op_qemu_ld_i64:
             case INDEX_op_qemu_st_i64:
+            case INDEX_op_qemu_stcond_i64:
                 {
                     TCGMemOpIdx oi = args[k++];
                     TCGMemOp op = get_memop(oi);
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 231a781..d9da2f9 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -284,6 +284,8 @@ typedef enum TCGMemOp {
     MO_TEQ   = MO_TE | MO_Q,
 
     MO_SSIZE = MO_SIZE | MO_SIGN,
+
+    MO_EXCL  = 32, /* Set for exclusive memory access */
 } TCGMemOp;
 
 typedef tcg_target_ulong TCGArg;
@@ -957,6 +959,21 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -984,6 +1001,21 @@ void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
+                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel]  [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
                   ` (3 preceding siblings ...)
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 4/9] tcg-op: create new TCG qemu_{ld, st} excl variants Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-08 12:44   ` Aurelien Jarno
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st} Alvise Rigo
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Introduce the new --enable-tcg-ldst-excl configure option to enable the
LL/SC operations only for those backends that support them.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 configure | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/configure b/configure
index 33b9455..c3ea3db 100755
--- a/configure
+++ b/configure
@@ -264,6 +264,7 @@ debug_tcg="no"
 debug="no"
 strip_opt="yes"
 tcg_interpreter="no"
+tcg_use_ldst_excl="no"
 bigendian="no"
 mingw32="no"
 gcov="no"
@@ -934,6 +935,10 @@ for opt do
   ;;
   --enable-tcg-interpreter) tcg_interpreter="yes"
   ;;
+  --disable-tcg-ldst-excl) tcg_use_ldst_excl="no"
+  ;;
+  --enable-tcg-ldst-excl) tcg_use_ldst_excl="yes"
+  ;;
   --disable-cap-ng)  cap_ng="no"
   ;;
   --enable-cap-ng) cap_ng="yes"
@@ -1392,6 +1397,18 @@ if test "$ARCH" = "unknown"; then
     fi
 fi
 
+# Check if the slow-path for atomic instructions is supported by the
+# TCG backend.
+case $cpu in
+    i386|x86_64|arm|aarch64)
+        ;;
+    *)
+        if test "$tcg_use_ldst_excl" = "yes"; then
+            error_exit "Load and store exclusive not supported for $cpu hosts"
+        fi
+        ;;
+esac
+
 # Consult white-list to determine whether to enable werror
 # by default.  Only enable by default for git builds
 z_version=`cut -f3 -d. $source_path/VERSION`
@@ -4526,6 +4543,7 @@ echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
 echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
+echo "use ld/st excl    $tcg_use_ldst_excl"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
 echo "fdatasync         $fdatasync"
@@ -4903,6 +4921,9 @@ fi
 if test "$tcg_interpreter" = "yes" ; then
   echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
 fi
+if test "$tcg_use_ldst_excl" = "yes" ; then
+  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
+fi
 if test "$fdatasync" = "yes" ; then
   echo "CONFIG_FDATASYNC=y" >> $config_host_mak
 fi
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st}
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
                   ` (4 preceding siblings ...)
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-08 13:00   ` Aurelien Jarno
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 7/9] tcg-arm: " Alvise Rigo
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Implement exclusive variants of qemu_{ld,st}_{i32,i64} for tcg-i386.
The lookup for the proper memory helper has been rewritten to take
into account the new exclusive helpers.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/i386/tcg-target.c | 148 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 129 insertions(+), 19 deletions(-)

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index ff4d9cf..011907c 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -1137,6 +1137,28 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BEQ]  = helper_be_ldq_mmu,
 };
 
+/* LoadLink helpers, only unsigned. Use the macro below to access them. */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_UB]   = helper_ret_ldlinkub_mmu,
+
+    [MO_LEUW] = helper_le_ldlinkuw_mmu,
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+    [MO_LEQ]  = helper_le_ldlinkq_mmu,
+
+    [MO_BEUW] = helper_be_ldlinkuw_mmu,
+    [MO_BEUL] = helper_be_ldlinkul_mmu,
+    [MO_BEQ]  = helper_be_ldlinkq_mmu,
+};
+
+static inline tcg_insn_unit *ld_helper(TCGMemOp opc)
+{
+    if (opc & MO_EXCL) {
+        return qemu_ldex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
+    }
+
+    return qemu_ld_helpers[opc & ~MO_SIGN];
+}
+
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
@@ -1150,6 +1172,26 @@ static void * const qemu_st_helpers[16] = {
     [MO_BEQ]  = helper_be_stq_mmu,
 };
 
+/* StoreConditional helpers. Use the macro below to access them. */
+static void * const qemu_stex_helpers[16] = {
+    [MO_UB]   = helper_ret_stcondb_mmu,
+    [MO_LEUW] = helper_le_stcondw_mmu,
+    [MO_LEUL] = helper_le_stcondl_mmu,
+    [MO_LEQ]  = helper_le_stcondq_mmu,
+    [MO_BEUW] = helper_be_stcondw_mmu,
+    [MO_BEUL] = helper_be_stcondl_mmu,
+    [MO_BEQ]  = helper_be_stcondq_mmu,
+};
+
+static inline tcg_insn_unit *st_helper(TCGMemOp opc)
+{
+    if (opc & MO_EXCL) {
+        return qemu_stex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
+    }
+
+    return qemu_st_helpers[opc];
+}
+
 /* Perform the TLB load and compare.
 
    Inputs:
@@ -1245,6 +1287,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
  * for a load or store, so that we can later generate the correct helper code
  */
 static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
+                                TCGReg llsc_success,
                                 TCGReg datalo, TCGReg datahi,
                                 TCGReg addrlo, TCGReg addrhi,
                                 tcg_insn_unit *raddr,
@@ -1253,6 +1296,7 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
     TCGLabelQemuLdst *label = new_ldst_label(s);
 
     label->is_ld = is_ld;
+    label->llsc_success = llsc_success;
     label->oi = oi;
     label->datalo_reg = datalo;
     label->datahi_reg = datahi;
@@ -1307,7 +1351,7 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
                      (uintptr_t)l->raddr);
     }
 
-    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_call(s, ld_helper(opc));
 
     data_reg = l->datalo_reg;
     switch (opc & MO_SSIZE) {
@@ -1411,9 +1455,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
         }
     }
 
-    /* "Tail call" to the helper, with the return address back inline.  */
-    tcg_out_push(s, retaddr);
-    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    if (opc & MO_EXCL) {
+        tcg_out_call(s, st_helper(opc));
+        /* Save the output of the StoreConditional */
+        tcg_out_mov(s, TCG_TYPE_I32, l->llsc_success, TCG_REG_EAX);
+        tcg_out_jmp(s, l->raddr);
+    } else {
+        /* "Tail call" to the helper, with the return address back inline.  */
+        tcg_out_push(s, retaddr);
+        tcg_out_jmp(s, st_helper(opc));
+    }
 }
 #elif defined(__x86_64__) && defined(__linux__)
 # include <asm/prctl.h>
@@ -1549,14 +1600,34 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
     mem_index = get_mmuidx(oi);
     s_bits = opc & MO_SIZE;
 
-    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
-                     label_ptr, offsetof(CPUTLBEntry, addr_read));
+    if (opc & MO_EXCL) {
+        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
+                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
+        /* The JMP address will be patched afterwards,
+         * in tcg_out_qemu_ld_slow_path (two times when
+         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
+        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
+
+        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+            /* Store the second part of the address. */
+            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
+            /* We add 4 to include the jmp that follows. */
+            label_ptr[1] = s->code_ptr + 4;
+        }
 
-    /* TLB Hit.  */
-    tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
+        label_ptr[0] = s->code_ptr;
+        s->code_ptr += 4;
+    } else {
+        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
+                         label_ptr, offsetof(CPUTLBEntry, addr_read));
+
+        /* TLB Hit.  */
+        tcg_out_qemu_ld_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+    }
 
     /* Record the current context of a load into ldst label */
-    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
+    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
                         s->code_ptr, label_ptr);
 #else
     {
@@ -1659,9 +1730,10 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
     }
 }
 
-static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isStoreCond)
 {
-    TCGReg datalo, datahi, addrlo;
+    TCGReg datalo, datahi, addrlo, llsc_success;
     TCGReg addrhi __attribute__((unused));
     TCGMemOpIdx oi;
     TCGMemOp opc;
@@ -1671,6 +1743,9 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
     tcg_insn_unit *label_ptr[2];
 #endif
 
+    /* The stcond variant has one more param */
+    llsc_success = (isStoreCond ? *args++ : 0);
+
     datalo = *args++;
     datahi = (TCG_TARGET_REG_BITS == 32 && is64 ? *args++ : 0);
     addrlo = *args++;
@@ -1682,15 +1757,35 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
     mem_index = get_mmuidx(oi);
     s_bits = opc & MO_SIZE;
 
-    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
-                     label_ptr, offsetof(CPUTLBEntry, addr_write));
+    if (opc & MO_EXCL) {
+        TCGType t = ((TCG_TARGET_REG_BITS == 64) && (TARGET_LONG_BITS == 64)) ?
+                                                   TCG_TYPE_I64 : TCG_TYPE_I32;
+        /* The JMP address will be filled afterwards,
+         * in tcg_out_qemu_ld_slow_path (two times when
+         * TARGET_LONG_BITS > TCG_TARGET_REG_BITS). */
+        tcg_out_mov(s, t, TCG_REG_L1, addrlo);
+
+        if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+            /* Store the second part of the address. */
+            tcg_out_mov(s, t, TCG_REG_L0, addrhi);
+            /* We add 4 to include the jmp that follows. */
+            label_ptr[1] = s->code_ptr + 4;
+        }
 
-    /* TLB Hit.  */
-    tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+        tcg_out_opc(s, OPC_JMP_long, 0, 0, 0);
+        label_ptr[0] = s->code_ptr;
+        s->code_ptr += 4;
+    } else {
+        tcg_out_tlb_load(s, addrlo, addrhi, mem_index, s_bits,
+                         label_ptr, offsetof(CPUTLBEntry, addr_write));
+
+        /* TLB Hit.  */
+        tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, 0, 0, opc);
+    }
 
     /* Record the current context of a store into ldst label */
-    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
-                        s->code_ptr, label_ptr);
+    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
+                        addrhi, s->code_ptr, label_ptr);
 #else
     {
         int32_t offset = GUEST_BASE;
@@ -1957,10 +2052,16 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         tcg_out_qemu_ld(s, args, 1);
         break;
     case INDEX_op_qemu_st_i32:
-        tcg_out_qemu_st(s, args, 0);
+        tcg_out_qemu_st(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_stcond_i32:
+        tcg_out_qemu_st(s, args, 0, 1);
         break;
     case INDEX_op_qemu_st_i64:
-        tcg_out_qemu_st(s, args, 1);
+        tcg_out_qemu_st(s, args, 1, 0);
+        break;
+    case INDEX_op_qemu_stcond_i64:
+        tcg_out_qemu_st(s, args, 1, 1);
         break;
 
     OP_32_64(mulu2):
@@ -2182,19 +2283,28 @@ static const TCGTargetOpDef x86_op_defs[] = {
 
 #if TCG_TARGET_REG_BITS == 64
     { INDEX_op_qemu_ld_i32, { "r", "L" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
     { INDEX_op_qemu_st_i32, { "L", "L" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
     { INDEX_op_qemu_ld_i64, { "r", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L" } },
+    { INDEX_op_qemu_stcond_i64, { "r", "L", "L" } },
 #elif TARGET_LONG_BITS <= TCG_TARGET_REG_BITS
     { INDEX_op_qemu_ld_i32, { "r", "L" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "L" } },
     { INDEX_op_qemu_st_i32, { "L", "L" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "L", "L" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L", "L" } },
+    { INDEX_op_qemu_stcond_i64, { "r", "L", "L", "L" } },
 #else
     { INDEX_op_qemu_ld_i32, { "r", "L", "L" } },
+    { INDEX_op_qemu_ldlink_i32, { "r", "L", "L" } },
     { INDEX_op_qemu_st_i32, { "L", "L", "L" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "L", "L", "L" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "L", "L" } },
     { INDEX_op_qemu_st_i64, { "L", "L", "L", "L" } },
+    { INDEX_op_qemu_stcond_i64, { "r", "L", "L", "L", "L" } },
 #endif
     { -1 },
 };
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [RFC v4 7/9] tcg-arm: Implement excl variants of qemu_{ld, st}
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
                   ` (5 preceding siblings ...)
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st} Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 8/9] tcg-aarch64: " Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 9/9] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
  8 siblings, 0 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Implement the exclusive variants of qemu_{ld,st}_{i32,i64} for tcg-arm.
The lookup for the proper memory helper has been rewritten to take
into account the new exclusive helpers.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/arm/tcg-target.c | 152 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 119 insertions(+), 33 deletions(-)

diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index ae2ec7a..c2fae5e 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -1069,6 +1069,42 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BESL] = helper_be_ldul_mmu,
 };
 
+/* LoadLink helpers, only unsigned. Use the macro below to access them. */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_UB]   = helper_ret_ldlinkub_mmu,
+
+    [MO_LEUW] = helper_le_ldlinkuw_mmu,
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+    [MO_LEQ]  = helper_le_ldlinkq_mmu,
+
+    [MO_BEUW] = helper_be_ldlinkuw_mmu,
+    [MO_BEUL] = helper_be_ldlinkul_mmu,
+    [MO_BEQ]  = helper_be_ldlinkq_mmu,
+};
+
+static inline tcg_insn_unit *ld_helper(TCGMemOp *opc)
+{
+    if (*opc & MO_EXCL) {
+        /* No signed-extended exclusive variants for ARM. */
+        assert(!(*opc & MO_SIGN));
+
+        return qemu_ldex_helpers[((int)*opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
+    } else {
+        /* For armv6 we can use the canonical unsigned helpers and minimize
+           icache usage.  For pre-armv6, use the signed helpers since we do
+           not have a single insn sign-extend.  */
+        if (!use_armv6_instructions) {
+            if (*opc & MO_SIGN) {
+                *opc = MO_UL;
+            }
+
+            return qemu_ld_helpers[*opc & (MO_BSWAP | MO_SSIZE)];
+        }
+    }
+
+    return qemu_ld_helpers[*opc & (MO_BSWAP | MO_SIZE)];
+}
+
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
  */
@@ -1082,6 +1118,26 @@ static void * const qemu_st_helpers[16] = {
     [MO_BEQ]  = helper_be_stq_mmu,
 };
 
+/* StoreConditional helpers. Use the macro below to access them. */
+static void * const qemu_stex_helpers[16] = {
+    [MO_UB]   = helper_ret_stcondb_mmu,
+    [MO_LEUW] = helper_le_stcondw_mmu,
+    [MO_LEUL] = helper_le_stcondl_mmu,
+    [MO_LEQ]  = helper_le_stcondq_mmu,
+    [MO_BEUW] = helper_be_stcondw_mmu,
+    [MO_BEUL] = helper_be_stcondl_mmu,
+    [MO_BEQ]  = helper_be_stcondq_mmu,
+};
+
+static inline tcg_insn_unit *st_helper(TCGMemOp *opc)
+{
+    if (*opc & MO_EXCL) {
+        return qemu_stex_helpers[((int)*opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
+    }
+
+    return qemu_st_helpers[*opc & (MO_BSWAP | MO_SIZE)];
+}
+
 /* Helper routines for marshalling helper function arguments into
  * the correct registers and stack.
  * argreg is where we want to put this argument, arg is the argument itself.
@@ -1222,13 +1278,14 @@ static TCGReg tcg_out_tlb_read(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
    path for a load or store, so that we can later generate the correct
    helper code.  */
 static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
-                                TCGReg datalo, TCGReg datahi, TCGReg addrlo,
-                                TCGReg addrhi, tcg_insn_unit *raddr,
-                                tcg_insn_unit *label_ptr)
+                                TCGReg llsc_success, TCGReg datalo,
+                                TCGReg datahi, TCGReg addrlo, TCGReg addrhi,
+                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
 {
     TCGLabelQemuLdst *label = new_ldst_label(s);
 
     label->is_ld = is_ld;
+    label->llsc_success = llsc_success;
     label->oi = oi;
     label->datalo_reg = datalo;
     label->datahi_reg = datahi;
@@ -1256,17 +1313,7 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     argreg = tcg_out_arg_imm32(s, argreg, oi);
     argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
 
-    /* For armv6 we can use the canonical unsigned helpers and minimize
-       icache usage.  For pre-armv6, use the signed helpers since we do
-       not have a single insn sign-extend.  */
-    if (use_armv6_instructions) {
-        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
-    } else {
-        func = qemu_ld_helpers[opc & (MO_BSWAP | MO_SSIZE)];
-        if (opc & MO_SIGN) {
-            opc = MO_UL;
-        }
-    }
+    func = ld_helper(&opc);
     tcg_out_call(s, func);
 
     datalo = lb->datalo_reg;
@@ -1336,8 +1383,15 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     argreg = tcg_out_arg_imm32(s, argreg, oi);
     argreg = tcg_out_arg_reg32(s, argreg, TCG_REG_R14);
 
-    /* Tail-call to the helper, which will return to the fast path.  */
-    tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    if (opc & MO_EXCL) {
+        tcg_out_call(s, st_helper(&opc));
+        /* Save the output of the StoreConditional */
+        tcg_out_mov_reg(s, COND_AL, lb->llsc_success, TCG_REG_R0);
+        tcg_out_goto(s, COND_AL, lb->raddr);
+    } else {
+        /* Tail-call to the helper, which will return to the fast path.  */
+        tcg_out_goto(s, COND_AL, st_helper(&opc));
+    }
 }
 #endif /* SOFTMMU */
 
@@ -1468,7 +1522,6 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
     TCGMemOp opc;
 #ifdef CONFIG_SOFTMMU
     int mem_index;
-    TCGReg addend;
     tcg_insn_unit *label_ptr;
 #endif
 
@@ -1481,16 +1534,26 @@ static void tcg_out_qemu_ld(TCGContext *s, const TCGArg *args, bool is64)
 
 #ifdef CONFIG_SOFTMMU
     mem_index = get_mmuidx(oi);
-    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 1);
 
     /* This a conditional BL only to load a pointer within this opcode into LR
-       for the slow path.  We will not be using the value for a tail call.  */
-    label_ptr = s->code_ptr;
-    tcg_out_bl_noaddr(s, COND_NE);
+       for the slow path.  We will not be using the value for a tail call.
+       In the context of a LoadLink instruction, we don't check the TLB but we
+       always follow the slow path.  */
+    if (opc & MO_EXCL) {
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_AL);
+    } else {
+        TCGReg addend;
 
-    tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE,
+                                  mem_index, 1);
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_NE);
 
-    add_qemu_ldst_label(s, true, oi, datalo, datahi, addrlo, addrhi,
+        tcg_out_qemu_ld_index(s, opc, datalo, datahi, addrlo, addend);
+    }
+
+    add_qemu_ldst_label(s, true, oi, 0, datalo, datahi, addrlo, addrhi,
                         s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
@@ -1592,15 +1655,20 @@ static inline void tcg_out_qemu_st_direct(TCGContext *s, TCGMemOp opc,
     }
 }
 
-static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
+static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64,
+                            bool isStoreCond)
 {
     TCGReg addrlo, datalo, datahi, addrhi __attribute__((unused));
     TCGMemOpIdx oi;
     TCGMemOp opc;
 #ifdef CONFIG_SOFTMMU
+    TCGReg llsc_success;
     int mem_index;
     TCGReg addend;
     tcg_insn_unit *label_ptr;
+
+    /* The stcond variant has one more param */
+    llsc_success = (isStoreCond ? *args++ : 0);
 #endif
 
     datalo = *args++;
@@ -1612,16 +1680,24 @@ static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
 
 #ifdef CONFIG_SOFTMMU
     mem_index = get_mmuidx(oi);
-    addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE, mem_index, 0);
 
-    tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
+    if (isStoreCond) {
+        /* Always follow the slow-path for an exclusive access */
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_AL);
+    } else {
+        addend = tcg_out_tlb_read(s, addrlo, addrhi, opc & MO_SIZE,
+                                  mem_index, 0);
 
-    /* The conditional call must come last, as we're going to return here.  */
-    label_ptr = s->code_ptr;
-    tcg_out_bl_noaddr(s, COND_NE);
+        tcg_out_qemu_st_index(s, COND_EQ, opc, datalo, datahi, addrlo, addend);
 
-    add_qemu_ldst_label(s, false, oi, datalo, datahi, addrlo, addrhi,
-                        s->code_ptr, label_ptr);
+        /* The conditional call must come last, as we're going to return here.*/
+        label_ptr = s->code_ptr;
+        tcg_out_bl_noaddr(s, COND_NE);
+    }
+
+    add_qemu_ldst_label(s, false, oi, llsc_success, datalo, datahi, addrlo,
+                        addrhi, s->code_ptr, label_ptr);
 #else /* !CONFIG_SOFTMMU */
     if (GUEST_BASE) {
         tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_TMP, GUEST_BASE);
@@ -1870,10 +1946,16 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
         tcg_out_qemu_ld(s, args, 1);
         break;
     case INDEX_op_qemu_st_i32:
-        tcg_out_qemu_st(s, args, 0);
+        tcg_out_qemu_st(s, args, 0, 0);
+        break;
+    case INDEX_op_qemu_stcond_i32:
+        tcg_out_qemu_st(s, args, 0, 1); /* StoreConditional */
         break;
     case INDEX_op_qemu_st_i64:
-        tcg_out_qemu_st(s, args, 1);
+        tcg_out_qemu_st(s, args, 1, 0);
+        break;
+    case INDEX_op_qemu_stcond_i64:
+        tcg_out_qemu_st(s, args, 1, 1); /* StoreConditional */
         break;
 
     case INDEX_op_bswap16_i32:
@@ -1959,12 +2041,16 @@ static const TCGTargetOpDef arm_op_defs[] = {
     { INDEX_op_qemu_ld_i32, { "r", "l" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "l" } },
     { INDEX_op_qemu_st_i32, { "s", "s" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "s", "s" } },
     { INDEX_op_qemu_st_i64, { "s", "s", "s" } },
+    { INDEX_op_qemu_stcond_i64, { "r", "s", "s", "s" } },
 #else
     { INDEX_op_qemu_ld_i32, { "r", "l", "l" } },
     { INDEX_op_qemu_ld_i64, { "r", "r", "l", "l" } },
     { INDEX_op_qemu_st_i32, { "s", "s", "s" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "s", "s", "s" } },
     { INDEX_op_qemu_st_i64, { "s", "s", "s", "s" } },
+    { INDEX_op_qemu_stcond_i64, { "r", "s", "s", "s", "s" } },
 #endif
 
     { INDEX_op_bswap16_i32, { "r", "r" } },
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [RFC v4 8/9] tcg-aarch64: Implement excl variants of qemu_{ld, st}
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
                   ` (6 preceding siblings ...)
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 7/9] tcg-arm: " Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 9/9] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
  8 siblings, 0 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Implement the exclusive variants of qemu_{ld,st}_{i32,i64} for
tcg-aarch64.
The lookup for the proper memory helper has been rewritten to take
into account the new exclusive helpers.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 tcg/aarch64/tcg-target.c | 99 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 7 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c b/tcg/aarch64/tcg-target.c
index fe44ad7..790ae90 100644
--- a/tcg/aarch64/tcg-target.c
+++ b/tcg/aarch64/tcg-target.c
@@ -971,6 +971,32 @@ static void * const qemu_ld_helpers[16] = {
     [MO_BEQ]  = helper_be_ldq_mmu,
 };
 
+/* LoadLink helpers, only unsigned. Use the macro below to access them. */
+static void * const qemu_ldex_helpers[16] = {
+    [MO_UB]   = helper_ret_ldlinkub_mmu,
+
+    [MO_LEUW] = helper_le_ldlinkuw_mmu,
+    [MO_LEUL] = helper_le_ldlinkul_mmu,
+    [MO_LEQ]  = helper_le_ldlinkq_mmu,
+
+    [MO_BEUW] = helper_be_ldlinkuw_mmu,
+    [MO_BEUL] = helper_be_ldlinkul_mmu,
+    [MO_BEQ]  = helper_be_ldlinkq_mmu,
+};
+
+static inline tcg_insn_unit *ld_helper(TCGMemOp opc)
+{
+    if (opc & MO_EXCL) {
+        /* No signed-extended exclusive variants for ARM. */
+        assert(!(opc & MO_SIGN));
+
+        return qemu_ldex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SIZE)];
+    }
+
+    return qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)];
+}
+
+
 /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
  *                                     uintxx_t val, TCGMemOpIdx oi,
  *                                     uintptr_t ra)
@@ -985,6 +1011,26 @@ static void * const qemu_st_helpers[16] = {
     [MO_BEQ]  = helper_be_stq_mmu,
 };
 
+/* StoreConditional helpers. Use the macro below to access them. */
+static void * const qemu_stex_helpers[16] = {
+    [MO_UB]   = helper_ret_stcondb_mmu,
+    [MO_LEUW] = helper_le_stcondw_mmu,
+    [MO_LEUL] = helper_le_stcondl_mmu,
+    [MO_LEQ]  = helper_le_stcondq_mmu,
+    [MO_BEUW] = helper_be_stcondw_mmu,
+    [MO_BEUL] = helper_be_stcondl_mmu,
+    [MO_BEQ]  = helper_be_stcondq_mmu,
+};
+
+static inline tcg_insn_unit *st_helper(TCGMemOp opc)
+{
+    if (opc & MO_EXCL) {
+        return qemu_stex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
+    }
+
+    return qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)];
+}
+
 static inline void tcg_out_adr(TCGContext *s, TCGReg rd, void *target)
 {
     ptrdiff_t offset = tcg_pcrel_diff(s, target);
@@ -1004,7 +1050,7 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tcg_out_mov(s, TARGET_LONG_BITS == 64, TCG_REG_X1, lb->addrlo_reg);
     tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_X2, oi);
     tcg_out_adr(s, TCG_REG_X3, lb->raddr);
-    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_call(s, ld_helper(opc));
     if (opc & MO_SIGN) {
         tcg_out_sxt(s, lb->type, size, lb->datalo_reg, TCG_REG_X0);
     } else {
@@ -1027,17 +1073,23 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tcg_out_mov(s, size == MO_64, TCG_REG_X2, lb->datalo_reg);
     tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_X3, oi);
     tcg_out_adr(s, TCG_REG_X4, lb->raddr);
-    tcg_out_call(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+
+    tcg_out_call(s, st_helper(opc));
+    if (opc & MO_EXCL) {
+        tcg_out_mov(s, TCG_TYPE_I32, lb->llsc_success, TCG_REG_X0);
+    }
     tcg_out_goto(s, lb->raddr);
 }
 
 static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
                                 TCGType ext, TCGReg data_reg, TCGReg addr_reg,
-                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr)
+                                tcg_insn_unit *raddr, tcg_insn_unit *label_ptr,
+                                TCGReg llsc_success)
 {
     TCGLabelQemuLdst *label = new_ldst_label(s);
 
     label->is_ld = is_ld;
+    label->llsc_success = llsc_success;
     label->oi = oi;
     label->type = ext;
     label->datalo_reg = data_reg;
@@ -1206,10 +1258,18 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
     TCGMemOp s_bits = memop & MO_SIZE;
     tcg_insn_unit *label_ptr;
 
-    tcg_out_tlb_read(s, addr_reg, s_bits, &label_ptr, mem_index, 1);
-    tcg_out_qemu_ld_direct(s, memop, ext, data_reg, addr_reg, TCG_REG_X1);
+    if (memop & MO_EXCL) {
+        /* If this is a LL access, we don't read the TLB but we always follow
+         * the slow path.  */
+        label_ptr = s->code_ptr;
+        tcg_out_goto_cond_noaddr(s, TCG_COND_ALWAYS);
+    } else {
+        tcg_out_tlb_read(s, addr_reg, s_bits, &label_ptr, mem_index, 1);
+        tcg_out_qemu_ld_direct(s, memop, ext, data_reg, addr_reg, TCG_REG_X1);
+    }
+
     add_qemu_ldst_label(s, true, oi, ext, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
+                        s->code_ptr, label_ptr, 0);
 #else /* !CONFIG_SOFTMMU */
     tcg_out_qemu_ld_direct(s, memop, ext, data_reg, addr_reg,
                            GUEST_BASE ? TCG_REG_GUEST_BASE : TCG_REG_XZR);
@@ -1225,16 +1285,32 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
     TCGMemOp s_bits = memop & MO_SIZE;
     tcg_insn_unit *label_ptr;
 
+
     tcg_out_tlb_read(s, addr_reg, s_bits, &label_ptr, mem_index, 0);
     tcg_out_qemu_st_direct(s, memop, data_reg, addr_reg, TCG_REG_X1);
     add_qemu_ldst_label(s, false, oi, s_bits == MO_64, data_reg, addr_reg,
-                        s->code_ptr, label_ptr);
+                        s->code_ptr, label_ptr, 0);
 #else /* !CONFIG_SOFTMMU */
     tcg_out_qemu_st_direct(s, memop, data_reg, addr_reg,
                            GUEST_BASE ? TCG_REG_GUEST_BASE : TCG_REG_XZR);
 #endif /* CONFIG_SOFTMMU */
 }
 
+static void tcg_out_qemu_stcond(TCGContext *s, TCGReg llsc_success,
+                                TCGReg data_reg, TCGReg addr_reg,
+                                TCGMemOpIdx oi)
+{
+    TCGMemOp memop = get_memop(oi);
+    TCGMemOp s_bits = memop & MO_SIZE;
+    tcg_insn_unit *label_ptr;
+
+    label_ptr = s->code_ptr;
+    tcg_out_goto_cond_noaddr(s, TCG_COND_ALWAYS);
+
+    add_qemu_ldst_label(s, false, oi, s_bits == MO_64, data_reg, addr_reg,
+                        s->code_ptr, label_ptr, llsc_success);
+}
+
 static tcg_insn_unit *tb_ret_addr;
 
 static void tcg_out_op(TCGContext *s, TCGOpcode opc,
@@ -1526,6 +1602,13 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_i64:
         tcg_out_qemu_st(s, REG0(0), a1, a2);
         break;
+    case INDEX_op_qemu_stcond_i32:
+    case INDEX_op_qemu_stcond_i64:
+        if (false)
+        tcg_out_qemu_st(s, REG0(0), a1, a2);
+        else
+        tcg_out_qemu_stcond(s, REG0(0), a1, a2, args[3]);
+        break;
 
     case INDEX_op_bswap64_i64:
         tcg_out_rev64(s, a0, a1);
@@ -1684,7 +1767,9 @@ static const TCGTargetOpDef aarch64_op_defs[] = {
     { INDEX_op_qemu_ld_i32, { "r", "l" } },
     { INDEX_op_qemu_ld_i64, { "r", "l" } },
     { INDEX_op_qemu_st_i32, { "lZ", "l" } },
+    { INDEX_op_qemu_stcond_i32, { "r", "lZ", "l" } },
     { INDEX_op_qemu_st_i64, { "lZ", "l" } },
+    { INDEX_op_qemu_stcond_i64, { "r", "lZ", "l" } },
 
     { INDEX_op_bswap16_i32, { "r", "r" } },
     { INDEX_op_bswap32_i32, { "r", "r" } },
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [Qemu-devel] [RFC v4 9/9] target-arm: translate: Use ld/st excl for atomic insns
  2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
                   ` (7 preceding siblings ...)
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 8/9] tcg-aarch64: " Alvise Rigo
@ 2015-08-07 17:03 ` Alvise Rigo
  8 siblings, 0 replies; 42+ messages in thread
From: Alvise Rigo @ 2015-08-07 17:03 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, pbonzini

Use the TCG load and store exclusive operataions if QEMU has been
configured to do so.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/translate.c | 191 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 183 insertions(+), 8 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 69ac18c..791263a 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -65,8 +65,12 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 static TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+#ifndef CONFIG_TCG_USE_LDST_EXCL
 static TCGv_i64 cpu_exclusive_addr;
 static TCGv_i64 cpu_exclusive_val;
+#else
+static TCGv_i32 cpu_ll_sc_context;
+#endif
 #ifdef CONFIG_USER_ONLY
 static TCGv_i64 cpu_exclusive_test;
 static TCGv_i32 cpu_exclusive_info;
@@ -99,10 +103,15 @@ void arm_translate_init(void)
     cpu_VF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, VF), "VF");
     cpu_ZF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, ZF), "ZF");
 
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+    cpu_ll_sc_context = tcg_global_mem_new_i32(TCG_AREG0,
+        offsetof(CPUARMState, ll_sc_context), "ll_sc_context");
+#else
     cpu_exclusive_addr = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
     cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_val), "exclusive_val");
+#endif
 #ifdef CONFIG_USER_ONLY
     cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_test), "exclusive_test");
@@ -904,16 +913,34 @@ static inline void gen_aa32_st##SUFF(TCGv_i32 val, TCGv_i32 addr, int index) \
     tcg_gen_qemu_st_i32(val, addr, index, OPC);                          \
 }
 
+#define DO_GEN_STEX(SUFF, OPC)                                           \
+static inline void gen_aa32_stex##SUFF(TCGv_i32 is_dirty, TCGv_i32 val,  \
+                                       TCGv_i32 addr, int index)         \
+{                                                                        \
+    tcg_gen_qemu_stcond_i32(is_dirty, val, addr, index, OPC);            \
+}
+
 static inline void gen_aa32_ld64(TCGv_i64 val, TCGv_i32 addr, int index)
 {
     tcg_gen_qemu_ld_i64(val, addr, index, MO_TEQ);
 }
 
+static inline void gen_aa32_ld64ex(TCGv_i64 val, TCGv_i32 addr, int index)
+{
+    tcg_gen_qemu_ld_i64(val, addr, index, MO_TEQ | MO_EXCL);
+}
+
 static inline void gen_aa32_st64(TCGv_i64 val, TCGv_i32 addr, int index)
 {
     tcg_gen_qemu_st_i64(val, addr, index, MO_TEQ);
 }
 
+static inline void gen_aa32_stex64(TCGv_i32 is_dirty, TCGv_i64 val,
+                                   TCGv_i32 addr, int index)
+{
+    tcg_gen_qemu_stcond_i64(is_dirty, val, addr, index, MO_TEQ | MO_EXCL);
+}
+
 #else
 
 #define DO_GEN_LD(SUFF, OPC)                                             \
@@ -934,6 +961,16 @@ static inline void gen_aa32_st##SUFF(TCGv_i32 val, TCGv_i32 addr, int index) \
     tcg_temp_free(addr64);                                               \
 }
 
+#define DO_GEN_STEX(SUFF, OPC)                                           \
+static inline void gen_aa32_stex##SUFF(TCGv_i32 is_dirty, TCGv_i32 val,  \
+                                               TCGv_i32 addr, int index) \
+{                                                                        \
+    TCGv addr64 = tcg_temp_new();                                        \
+    tcg_gen_extu_i32_i64(addr64, addr);                                  \
+    tcg_gen_qemu_stcond_i32(is_dirty, val, addr64, index, OPC);          \
+    tcg_temp_free(addr64);                                               \
+}
+
 static inline void gen_aa32_ld64(TCGv_i64 val, TCGv_i32 addr, int index)
 {
     TCGv addr64 = tcg_temp_new();
@@ -942,6 +979,14 @@ static inline void gen_aa32_ld64(TCGv_i64 val, TCGv_i32 addr, int index)
     tcg_temp_free(addr64);
 }
 
+static inline void gen_aa32_ld64ex(TCGv_i64 val, TCGv_i32 addr, int index)
+{
+    TCGv addr64 = tcg_temp_new();
+    tcg_gen_extu_i32_i64(addr64, addr);
+    tcg_gen_qemu_ld_i64(val, addr64, index, MO_TEQ | MO_EXCL);
+    tcg_temp_free(addr64);
+}
+
 static inline void gen_aa32_st64(TCGv_i64 val, TCGv_i32 addr, int index)
 {
     TCGv addr64 = tcg_temp_new();
@@ -950,17 +995,34 @@ static inline void gen_aa32_st64(TCGv_i64 val, TCGv_i32 addr, int index)
     tcg_temp_free(addr64);
 }
 
+static inline void gen_aa32_stex64(TCGv_i32 is_dirty, TCGv_i64 val,
+                                   TCGv_i32 addr, int index)
+{
+    TCGv addr64 = tcg_temp_new();
+    tcg_gen_extu_i32_i64(addr64, addr);
+    tcg_gen_qemu_stcond_i64(is_dirty, val, addr64, index, MO_TEQ | MO_EXCL);
+    tcg_temp_free(addr64);
+}
+
 #endif
 
 DO_GEN_LD(8s, MO_SB)
 DO_GEN_LD(8u, MO_UB)
+DO_GEN_LD(8uex, MO_UB | MO_EXCL)
 DO_GEN_LD(16s, MO_TESW)
 DO_GEN_LD(16u, MO_TEUW)
+DO_GEN_LD(16uex, MO_TEUW | MO_EXCL)
 DO_GEN_LD(32u, MO_TEUL)
+DO_GEN_LD(32uex, MO_TEUL | MO_EXCL)
 DO_GEN_ST(8, MO_UB)
 DO_GEN_ST(16, MO_TEUW)
 DO_GEN_ST(32, MO_TEUL)
 
+/* Load/Store exclusive generators (always unsigned) */
+DO_GEN_STEX(8, MO_UB)
+DO_GEN_STEX(16, MO_TEUW)
+DO_GEN_STEX(32, MO_TEUL)
+
 static inline void gen_set_pc_im(DisasContext *s, target_ulong val)
 {
     tcg_gen_movi_i32(cpu_R[15], val);
@@ -7382,15 +7444,60 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
     tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented by remembering
+/* If the TCG backend supports the slowpath for atomic instructions,
+   then the Load/Store exclusive instructions will use it, offloading
+   most of the work to the softmmu_llsc_template.h functions.
+
+   Otherwise, these instructions are implemented by remembering
    the value/address loaded, and seeing if these are the same
    when the store is performed. This should be sufficient to implement
    the architecturally mandated semantics, and avoids having to monitor
    regular stores.
 
-   In system emulation mode only one CPU will be running at once, so
-   this sequence is effectively atomic.  In user emulation mode we
-   throw an exception and handle the atomic operation elsewhere.  */
+   In user emulation mode we throw an exception and handle the atomic
+   operation elsewhere.  */
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
+                               TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp = tcg_temp_new_i32();
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_aa32_ld8uex(tmp, addr, get_mem_index(s));
+            break;
+        case 1:
+            gen_aa32_ld16uex(tmp, addr, get_mem_index(s));
+            break;
+        case 2:
+            gen_aa32_ld32uex(tmp, addr, get_mem_index(s));
+            break;
+        default:
+            abort();
+        }
+
+        store_reg(s, rt, tmp);
+    } else {
+        TCGv_i64 tmp64 = tcg_temp_new_i64();
+        TCGv_i32 tmph = tcg_temp_new_i32();
+
+        gen_aa32_ld64ex(tmp64, addr, get_mem_index(s));
+        tcg_gen_extr_i64_i32(tmp, tmph, tmp64);
+
+        store_reg(s, rt, tmp);
+        store_reg(s, rt2, tmph);
+
+        tcg_temp_free_i32(tmph);
+        tcg_temp_free_i64(tmp64);
+    }
+
+    tcg_temp_free_i32(tmp);
+
+    /* From now on we are in LL/SC context. */
+    tcg_gen_movi_i32(cpu_ll_sc_context, 1);
+}
+#else
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
 {
@@ -7429,10 +7536,14 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     store_reg(s, rt, tmp);
     tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
 }
+#endif
 
 static void gen_clrex(DisasContext *s)
 {
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+#else
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+#endif
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7444,6 +7555,66 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                      size | (rd << 4) | (rt << 8) | (rt2 << 12));
     gen_exception_internal_insn(s, 4, EXCP_STREX);
 }
+#elif defined CONFIG_TCG_USE_LDST_EXCL
+static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
+                                TCGv_i32 addr, int size)
+{
+    TCGv_i32 is_dirty;
+    TCGLabel *done_label;
+    TCGLabel *fail_label;
+
+    fail_label = gen_new_label();
+    done_label = gen_new_label();
+
+    is_dirty = tcg_temp_new_i32();
+
+    /* Fail if we are not in LL/SC context. */
+    tcg_gen_brcondi_i32(TCG_COND_NE, cpu_ll_sc_context, 1, fail_label);
+    tcg_gen_movi_i32(cpu_ll_sc_context, 0);
+
+    TCGv_i32 tmp;
+    tmp = tcg_temp_new_i32();
+    tmp = load_reg(s, rt);
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_aa32_stex8(is_dirty, tmp, addr, get_mem_index(s));
+            break;
+        case 1:
+            gen_aa32_stex16(is_dirty, tmp, addr, get_mem_index(s));
+            break;
+        case 2:
+            gen_aa32_stex32(is_dirty, tmp, addr, get_mem_index(s));
+            break;
+        default:
+            abort();
+        }
+    } else {
+        TCGv_i64 tmp64;
+        TCGv_i32 tmp2;
+
+        tmp64 = tcg_temp_new_i64();
+        tmp2 = tcg_temp_new_i32();
+        tmp2 = load_reg(s, rt2);
+        tcg_gen_concat_i32_i64(tmp64, tmp, tmp2);
+
+        gen_aa32_stex64(is_dirty, tmp64, addr, get_mem_index(s));
+        tcg_temp_free_i32(tmp2);
+        tcg_temp_free_i64(tmp64);
+    }
+    tcg_temp_free_i32(tmp);
+
+    /* Check if the store conditional has to fail. */
+    tcg_gen_brcondi_i32(TCG_COND_EQ, is_dirty, 1, fail_label);
+    tcg_temp_free_i32(is_dirty);
+
+    tcg_gen_movi_i32(cpu_R[rd], 0); /* is_dirty = 0 */
+    tcg_gen_br(done_label);
+    gen_set_label(fail_label);
+    tcg_gen_movi_i32(cpu_R[rd], 1); /* is_dirty = 1 */
+    gen_set_label(done_label);
+}
 #else
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
@@ -8394,16 +8565,20 @@ static void disas_arm_insn(DisasContext *s, unsigned int insn)
                             rm = insn & 0xf;
                             switch (op1) {
                             case 0:  /*  strex */
-                                gen_store_exclusive(s, rd, rm, 15, addr, 2);
+                                gen_store_exclusive(s, rd, rm, 15,
+                                                    addr, 2);
                                 break;
                             case 1: /*  strexd */
-                                gen_store_exclusive(s, rd, rm, rm + 1, addr, 3);
+                                gen_store_exclusive(s, rd, rm, rm + 1,
+                                                    addr, 3);
                                 break;
                             case 2: /*  strexb */
-                                gen_store_exclusive(s, rd, rm, 15, addr, 0);
+                                gen_store_exclusive(s, rd, rm, 15,
+                                                    addr, 0);
                                 break;
                             case 3: /* strexh */
-                                gen_store_exclusive(s, rd, rm, 15, addr, 1);
+                                gen_store_exclusive(s, rd, rm, 15,
+                                                    addr, 1);
                                 break;
                             default:
                                 abort();
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns Alvise Rigo
@ 2015-08-08 12:44   ` Aurelien Jarno
  2015-08-08 13:57     ` Peter Maydell
  2015-08-09  8:11     ` Alex Bennée
  0 siblings, 2 replies; 42+ messages in thread
From: Aurelien Jarno @ 2015-08-08 12:44 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech,
	alex.bennee

On 2015-08-07 19:03, Alvise Rigo wrote:
> Introduce the new --enable-tcg-ldst-excl configure option to enable the
> LL/SC operations only for those backends that support them.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  configure | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)

We have seen that for this kind of patch, it's better to add support in
all backends, otherwise it takes ages to get all the backends converted.
I think you should involve the backend maintainers. I can try to provide
the corresponding patches for mips and ia64.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st}
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st} Alvise Rigo
@ 2015-08-08 13:00   ` Aurelien Jarno
  2015-08-10  7:50     ` alvise rigo
  0 siblings, 1 reply; 42+ messages in thread
From: Aurelien Jarno @ 2015-08-08 13:00 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech,
	alex.bennee, Richard Henderson

On 2015-08-07 19:03, Alvise Rigo wrote:
> Implement exclusive variants of qemu_{ld,st}_{i32,i64} for tcg-i386.
> The lookup for the proper memory helper has been rewritten to take
> into account the new exclusive helpers.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  tcg/i386/tcg-target.c | 148 +++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 129 insertions(+), 19 deletions(-)
> 
> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
> index ff4d9cf..011907c 100644
> --- a/tcg/i386/tcg-target.c
> +++ b/tcg/i386/tcg-target.c
> @@ -1137,6 +1137,28 @@ static void * const qemu_ld_helpers[16] = {
>      [MO_BEQ]  = helper_be_ldq_mmu,
>  };
>  
> +/* LoadLink helpers, only unsigned. Use the macro below to access them. */
> +static void * const qemu_ldex_helpers[16] = {
> +    [MO_UB]   = helper_ret_ldlinkub_mmu,
> +
> +    [MO_LEUW] = helper_le_ldlinkuw_mmu,
> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
> +    [MO_LEQ]  = helper_le_ldlinkq_mmu,
> +
> +    [MO_BEUW] = helper_be_ldlinkuw_mmu,
> +    [MO_BEUL] = helper_be_ldlinkul_mmu,
> +    [MO_BEQ]  = helper_be_ldlinkq_mmu,
> +};
> +
> +static inline tcg_insn_unit *ld_helper(TCGMemOp opc)
> +{
> +    if (opc & MO_EXCL) {
> +        return qemu_ldex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
> +    }
> +
> +    return qemu_ld_helpers[opc & ~MO_SIGN];
> +}
> +
>  /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
>   *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
>   */
> @@ -1150,6 +1172,26 @@ static void * const qemu_st_helpers[16] = {
>      [MO_BEQ]  = helper_be_stq_mmu,
>  };
>  
> +/* StoreConditional helpers. Use the macro below to access them. */
> +static void * const qemu_stex_helpers[16] = {
> +    [MO_UB]   = helper_ret_stcondb_mmu,
> +    [MO_LEUW] = helper_le_stcondw_mmu,
> +    [MO_LEUL] = helper_le_stcondl_mmu,
> +    [MO_LEQ]  = helper_le_stcondq_mmu,
> +    [MO_BEUW] = helper_be_stcondw_mmu,
> +    [MO_BEUL] = helper_be_stcondl_mmu,
> +    [MO_BEQ]  = helper_be_stcondq_mmu,
> +};
> +
> +static inline tcg_insn_unit *st_helper(TCGMemOp opc)
> +{
> +    if (opc & MO_EXCL) {
> +        return qemu_stex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
> +    }
> +
> +    return qemu_st_helpers[opc];
> +}
> +
>  /* Perform the TLB load and compare.
>  
>     Inputs:
> @@ -1245,6 +1287,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
>   * for a load or store, so that we can later generate the correct helper code
>   */
>  static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
> +                                TCGReg llsc_success,
>                                  TCGReg datalo, TCGReg datahi,
>                                  TCGReg addrlo, TCGReg addrhi,
>                                  tcg_insn_unit *raddr,
> @@ -1253,6 +1296,7 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>      TCGLabelQemuLdst *label = new_ldst_label(s);
>  
>      label->is_ld = is_ld;
> +    label->llsc_success = llsc_success;
>      label->oi = oi;
>      label->datalo_reg = datalo;
>      label->datahi_reg = datahi;
> @@ -1307,7 +1351,7 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>                       (uintptr_t)l->raddr);
>      }
>  
> -    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
> +    tcg_out_call(s, ld_helper(opc));
>  
>      data_reg = l->datalo_reg;
>      switch (opc & MO_SSIZE) {
> @@ -1411,9 +1455,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>          }
>      }
>  
> -    /* "Tail call" to the helper, with the return address back inline.  */
> -    tcg_out_push(s, retaddr);
> -    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
> +    if (opc & MO_EXCL) {
> +        tcg_out_call(s, st_helper(opc));
> +        /* Save the output of the StoreConditional */
> +        tcg_out_mov(s, TCG_TYPE_I32, l->llsc_success, TCG_REG_EAX);
> +        tcg_out_jmp(s, l->raddr);
> +    } else {
> +        /* "Tail call" to the helper, with the return address back inline.  */
> +        tcg_out_push(s, retaddr);
> +        tcg_out_jmp(s, st_helper(opc));
> +    }
>  }

I am not sure it's a good idea to try to use the existing code. For
LL/SC ops, we don't have the slow path and the fast path, as we always
call the helpers. It's probably better to use dedicated code for calling
the helper.

Also given that TCG is already able to handle call helpers, and that the
LL/SC ops basically always call an helper, I do wonder if we can handle
LL/SC ops just like we do for some arithmetic ops, see in tcg-runtime.c
and tcg/tcg-runtime.h. That way the op will be implemented automatically
for all backends. Of course we might still want a wrapper so that the
tcg_gen_* functions transparently call the right helper.
 

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-08 12:44   ` Aurelien Jarno
@ 2015-08-08 13:57     ` Peter Maydell
  2015-08-09  8:11     ` Alex Bennée
  1 sibling, 0 replies; 42+ messages in thread
From: Peter Maydell @ 2015-08-08 13:57 UTC (permalink / raw)
  To: Alvise Rigo, QEMU Developers, mttcg, Alex Bennée,
	Jani Kokkonen, tech@virtualopensystems.com, Claudio Fontana,
	Paolo Bonzini

On 8 August 2015 at 13:44, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On 2015-08-07 19:03, Alvise Rigo wrote:
>> Introduce the new --enable-tcg-ldst-excl configure option to enable the
>> LL/SC operations only for those backends that support them.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  configure | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>
> We have seen that for this kind of patch, it's better to add support in
> all backends, otherwise it takes ages to get all the backends converted.
> I think you should involve the backend maintainers. I can try to provide
> the corresponding patches for mips and ia64.

...and if we do need to do it one backend at a time we should do
this automatically, not by requiring users to give configure
arguments...

-- PMM

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-08 12:44   ` Aurelien Jarno
  2015-08-08 13:57     ` Peter Maydell
@ 2015-08-09  8:11     ` Alex Bennée
  2015-08-09  8:40       ` Aurelien Jarno
  1 sibling, 1 reply; 42+ messages in thread
From: Alex Bennée @ 2015-08-09  8:11 UTC (permalink / raw)
  To: Aurelien Jarno
  Cc: mttcg, claudio.fontana, Alvise Rigo, qemu-devel, pbonzini,
	jani.kokkonen, tech


Aurelien Jarno <aurelien@aurel32.net> writes:

> On 2015-08-07 19:03, Alvise Rigo wrote:
>> Introduce the new --enable-tcg-ldst-excl configure option to enable the
>> LL/SC operations only for those backends that support them.
>> 
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  configure | 21 +++++++++++++++++++++
>>  1 file changed, 21 insertions(+)
>
> We have seen that for this kind of patch, it's better to add support in
> all backends, otherwise it takes ages to get all the backends converted.
> I think you should involve the backend maintainers. I can try to provide
> the corresponding patches for mips and ia64.

We discussed this on the last MTTCG call and agree. However we will need
help from the other TCG maintainers for the backends. The changes should
be fairly mechanical though.

However in the spirit of keeping trees building in the meantime should
we change this from a configure option to just a static option for each
given backend as it is converted?

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-09  8:11     ` Alex Bennée
@ 2015-08-09  8:40       ` Aurelien Jarno
  2015-08-09  9:51         ` Alex Bennée
  0 siblings, 1 reply; 42+ messages in thread
From: Aurelien Jarno @ 2015-08-09  8:40 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, claudio.fontana, Alvise Rigo, qemu-devel, pbonzini,
	jani.kokkonen, tech

On 2015-08-09 09:11, Alex Bennée wrote:
> 
> Aurelien Jarno <aurelien@aurel32.net> writes:
> 
> > On 2015-08-07 19:03, Alvise Rigo wrote:
> >> Introduce the new --enable-tcg-ldst-excl configure option to enable the
> >> LL/SC operations only for those backends that support them.
> >> 
> >> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> >> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> >> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> >> ---
> >>  configure | 21 +++++++++++++++++++++
> >>  1 file changed, 21 insertions(+)
> >
> > We have seen that for this kind of patch, it's better to add support in
> > all backends, otherwise it takes ages to get all the backends converted.
> > I think you should involve the backend maintainers. I can try to provide
> > the corresponding patches for mips and ia64.
> 
> We discussed this on the last MTTCG call and agree. However we will need
> help from the other TCG maintainers for the backends. The changes should
> be fairly mechanical though.
> 
> However in the spirit of keeping trees building in the meantime should
> we change this from a configure option to just a static option for each
> given backend as it is converted?

I am not even sure we need a static option. I guess providing we are
doing that early enough in the 2.5 cycle, we can just add the new ops
and start using them. Of course we should put the backends maintainers
in the loop so they can fix that quickly and don't get a surprise weeks
afters.

That said, please see my other email, I am not sure we actually need to
modify backends, I think we can implement these new "ops" through
tcg-runtime.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-09  8:40       ` Aurelien Jarno
@ 2015-08-09  9:51         ` Alex Bennée
  2015-08-09 10:13           ` Aurelien Jarno
  0 siblings, 1 reply; 42+ messages in thread
From: Alex Bennée @ 2015-08-09  9:51 UTC (permalink / raw)
  To: Aurelien Jarno
  Cc: mttcg, claudio.fontana, Alvise Rigo, qemu-devel, pbonzini,
	jani.kokkonen, tech


Aurelien Jarno <aurelien@aurel32.net> writes:

> On 2015-08-09 09:11, Alex Bennée wrote:
>> 
>> Aurelien Jarno <aurelien@aurel32.net> writes:
>> 
>> > On 2015-08-07 19:03, Alvise Rigo wrote:
>> >> Introduce the new --enable-tcg-ldst-excl configure option to enable the
>> >> LL/SC operations only for those backends that support them.
>> >> 
>> >> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> >> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> >> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> >> ---
>> >>  configure | 21 +++++++++++++++++++++
>> >>  1 file changed, 21 insertions(+)
>> >
>> > We have seen that for this kind of patch, it's better to add support in
>> > all backends, otherwise it takes ages to get all the backends converted.
>> > I think you should involve the backend maintainers. I can try to provide
>> > the corresponding patches for mips and ia64.
>> 
>> We discussed this on the last MTTCG call and agree. However we will need
>> help from the other TCG maintainers for the backends. The changes should
>> be fairly mechanical though.
>> 
>> However in the spirit of keeping trees building in the meantime should
>> we change this from a configure option to just a static option for each
>> given backend as it is converted?
>
> I am not even sure we need a static option. I guess providing we are
> doing that early enough in the 2.5 cycle, we can just add the new ops
> and start using them. Of course we should put the backends maintainers
> in the loop so they can fix that quickly and don't get a surprise weeks
> afters.
>
> That said, please see my other email, I am not sure we actually need to
> modify backends, I think we can implement these new "ops" through
> tcg-runtime.

We still need to ensure "normal" ld/st operations trip the exclusive bit
though.

>
> Aurelien

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-09  9:51         ` Alex Bennée
@ 2015-08-09 10:13           ` Aurelien Jarno
  2015-08-09 16:27             ` Alex Bennée
  0 siblings, 1 reply; 42+ messages in thread
From: Aurelien Jarno @ 2015-08-09 10:13 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, claudio.fontana, Alvise Rigo, qemu-devel, pbonzini,
	jani.kokkonen, tech

On 2015-08-09 10:51, Alex Bennée wrote:
> 
> Aurelien Jarno <aurelien@aurel32.net> writes:
> 
> > On 2015-08-09 09:11, Alex Bennée wrote:
> >> 
> >> Aurelien Jarno <aurelien@aurel32.net> writes:
> >> 
> >> > On 2015-08-07 19:03, Alvise Rigo wrote:
> >> >> Introduce the new --enable-tcg-ldst-excl configure option to enable the
> >> >> LL/SC operations only for those backends that support them.
> >> >> 
> >> >> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> >> >> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> >> >> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> >> >> ---
> >> >>  configure | 21 +++++++++++++++++++++
> >> >>  1 file changed, 21 insertions(+)
> >> >
> >> > We have seen that for this kind of patch, it's better to add support in
> >> > all backends, otherwise it takes ages to get all the backends converted.
> >> > I think you should involve the backend maintainers. I can try to provide
> >> > the corresponding patches for mips and ia64.
> >> 
> >> We discussed this on the last MTTCG call and agree. However we will need
> >> help from the other TCG maintainers for the backends. The changes should
> >> be fairly mechanical though.
> >> 
> >> However in the spirit of keeping trees building in the meantime should
> >> we change this from a configure option to just a static option for each
> >> given backend as it is converted?
> >
> > I am not even sure we need a static option. I guess providing we are
> > doing that early enough in the 2.5 cycle, we can just add the new ops
> > and start using them. Of course we should put the backends maintainers
> > in the loop so they can fix that quickly and don't get a surprise weeks
> > afters.
> >
> > That said, please see my other email, I am not sure we actually need to
> > modify backends, I think we can implement these new "ops" through
> > tcg-runtime.
> 
> We still need to ensure "normal" ld/st operations trip the exclusive bit
> though.

Isn't that taken care of by the TLB_EXCL flag, causing pages in
exclusive mode to always go through the slow path? Then we should not
need to modify the backends as they already check for non zero bits in
this area.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns
  2015-08-09 10:13           ` Aurelien Jarno
@ 2015-08-09 16:27             ` Alex Bennée
  0 siblings, 0 replies; 42+ messages in thread
From: Alex Bennée @ 2015-08-09 16:27 UTC (permalink / raw)
  To: Aurelien Jarno
  Cc: mttcg, claudio.fontana, Alvise Rigo, qemu-devel, pbonzini,
	jani.kokkonen, tech


Aurelien Jarno <aurelien@aurel32.net> writes:

> On 2015-08-09 10:51, Alex Bennée wrote:
>> 
>> Aurelien Jarno <aurelien@aurel32.net> writes:
>> 
>> > On 2015-08-09 09:11, Alex Bennée wrote:
>> >> 
>> >> Aurelien Jarno <aurelien@aurel32.net> writes:
>> >> 
>> >> > On 2015-08-07 19:03, Alvise Rigo wrote:
>> >> >> Introduce the new --enable-tcg-ldst-excl configure option to enable the
>> >> >> LL/SC operations only for those backends that support them.
>> >> >> 
>> >> >> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> >> >> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> >> >> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> >> >> ---
>> >> >>  configure | 21 +++++++++++++++++++++
>> >> >>  1 file changed, 21 insertions(+)
>> >> >
>> >> > We have seen that for this kind of patch, it's better to add support in
>> >> > all backends, otherwise it takes ages to get all the backends converted.
>> >> > I think you should involve the backend maintainers. I can try to provide
>> >> > the corresponding patches for mips and ia64.
>> >> 
>> >> We discussed this on the last MTTCG call and agree. However we will need
>> >> help from the other TCG maintainers for the backends. The changes should
>> >> be fairly mechanical though.
>> >> 
>> >> However in the spirit of keeping trees building in the meantime should
>> >> we change this from a configure option to just a static option for each
>> >> given backend as it is converted?
>> >
>> > I am not even sure we need a static option. I guess providing we are
>> > doing that early enough in the 2.5 cycle, we can just add the new ops
>> > and start using them. Of course we should put the backends maintainers
>> > in the loop so they can fix that quickly and don't get a surprise weeks
>> > afters.
>> >
>> > That said, please see my other email, I am not sure we actually need to
>> > modify backends, I think we can implement these new "ops" through
>> > tcg-runtime.
>> 
>> We still need to ensure "normal" ld/st operations trip the exclusive bit
>> though.
>
> Isn't that taken care of by the TLB_EXCL flag, causing pages in
> exclusive mode to always go through the slow path? Then we should not
> need to modify the backends as they already check for non zero bits in
> this area.

Ahh I see. Yes of course that sounds like a good idea. 

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st}
  2015-08-08 13:00   ` Aurelien Jarno
@ 2015-08-10  7:50     ` alvise rigo
  0 siblings, 0 replies; 42+ messages in thread
From: alvise rigo @ 2015-08-10  7:50 UTC (permalink / raw)
  To: Alvise Rigo, QEMU Developers, mttcg, Alex Bennée,
	Jani Kokkonen, VirtualOpenSystems Technical Team, Claudio Fontana,
	Paolo Bonzini, Richard Henderson

Good point, I will look at the code you pointed.
If we really can avoid to modify every backend, than it's worth more
than a look.

Thanks,
alvise

On Sat, Aug 8, 2015 at 3:00 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
> On 2015-08-07 19:03, Alvise Rigo wrote:
>> Implement exclusive variants of qemu_{ld,st}_{i32,i64} for tcg-i386.
>> The lookup for the proper memory helper has been rewritten to take
>> into account the new exclusive helpers.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  tcg/i386/tcg-target.c | 148 +++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 129 insertions(+), 19 deletions(-)
>>
>> diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
>> index ff4d9cf..011907c 100644
>> --- a/tcg/i386/tcg-target.c
>> +++ b/tcg/i386/tcg-target.c
>> @@ -1137,6 +1137,28 @@ static void * const qemu_ld_helpers[16] = {
>>      [MO_BEQ]  = helper_be_ldq_mmu,
>>  };
>>
>> +/* LoadLink helpers, only unsigned. Use the macro below to access them. */
>> +static void * const qemu_ldex_helpers[16] = {
>> +    [MO_UB]   = helper_ret_ldlinkub_mmu,
>> +
>> +    [MO_LEUW] = helper_le_ldlinkuw_mmu,
>> +    [MO_LEUL] = helper_le_ldlinkul_mmu,
>> +    [MO_LEQ]  = helper_le_ldlinkq_mmu,
>> +
>> +    [MO_BEUW] = helper_be_ldlinkuw_mmu,
>> +    [MO_BEUL] = helper_be_ldlinkul_mmu,
>> +    [MO_BEQ]  = helper_be_ldlinkq_mmu,
>> +};
>> +
>> +static inline tcg_insn_unit *ld_helper(TCGMemOp opc)
>> +{
>> +    if (opc & MO_EXCL) {
>> +        return qemu_ldex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
>> +    }
>> +
>> +    return qemu_ld_helpers[opc & ~MO_SIGN];
>> +}
>> +
>>  /* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
>>   *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
>>   */
>> @@ -1150,6 +1172,26 @@ static void * const qemu_st_helpers[16] = {
>>      [MO_BEQ]  = helper_be_stq_mmu,
>>  };
>>
>> +/* StoreConditional helpers. Use the macro below to access them. */
>> +static void * const qemu_stex_helpers[16] = {
>> +    [MO_UB]   = helper_ret_stcondb_mmu,
>> +    [MO_LEUW] = helper_le_stcondw_mmu,
>> +    [MO_LEUL] = helper_le_stcondl_mmu,
>> +    [MO_LEQ]  = helper_le_stcondq_mmu,
>> +    [MO_BEUW] = helper_be_stcondw_mmu,
>> +    [MO_BEUL] = helper_be_stcondl_mmu,
>> +    [MO_BEQ]  = helper_be_stcondq_mmu,
>> +};
>> +
>> +static inline tcg_insn_unit *st_helper(TCGMemOp opc)
>> +{
>> +    if (opc & MO_EXCL) {
>> +        return qemu_stex_helpers[((int)opc - MO_EXCL) & (MO_BSWAP | MO_SSIZE)];
>> +    }
>> +
>> +    return qemu_st_helpers[opc];
>> +}
>> +
>>  /* Perform the TLB load and compare.
>>
>>     Inputs:
>> @@ -1245,6 +1287,7 @@ static inline void tcg_out_tlb_load(TCGContext *s, TCGReg addrlo, TCGReg addrhi,
>>   * for a load or store, so that we can later generate the correct helper code
>>   */
>>  static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>> +                                TCGReg llsc_success,
>>                                  TCGReg datalo, TCGReg datahi,
>>                                  TCGReg addrlo, TCGReg addrhi,
>>                                  tcg_insn_unit *raddr,
>> @@ -1253,6 +1296,7 @@ static void add_qemu_ldst_label(TCGContext *s, bool is_ld, TCGMemOpIdx oi,
>>      TCGLabelQemuLdst *label = new_ldst_label(s);
>>
>>      label->is_ld = is_ld;
>> +    label->llsc_success = llsc_success;
>>      label->oi = oi;
>>      label->datalo_reg = datalo;
>>      label->datahi_reg = datahi;
>> @@ -1307,7 +1351,7 @@ static void tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>>                       (uintptr_t)l->raddr);
>>      }
>>
>> -    tcg_out_call(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
>> +    tcg_out_call(s, ld_helper(opc));
>>
>>      data_reg = l->datalo_reg;
>>      switch (opc & MO_SSIZE) {
>> @@ -1411,9 +1455,16 @@ static void tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>>          }
>>      }
>>
>> -    /* "Tail call" to the helper, with the return address back inline.  */
>> -    tcg_out_push(s, retaddr);
>> -    tcg_out_jmp(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
>> +    if (opc & MO_EXCL) {
>> +        tcg_out_call(s, st_helper(opc));
>> +        /* Save the output of the StoreConditional */
>> +        tcg_out_mov(s, TCG_TYPE_I32, l->llsc_success, TCG_REG_EAX);
>> +        tcg_out_jmp(s, l->raddr);
>> +    } else {
>> +        /* "Tail call" to the helper, with the return address back inline.  */
>> +        tcg_out_push(s, retaddr);
>> +        tcg_out_jmp(s, st_helper(opc));
>> +    }
>>  }
>
> I am not sure it's a good idea to try to use the existing code. For
> LL/SC ops, we don't have the slow path and the fast path, as we always
> call the helpers. It's probably better to use dedicated code for calling
> the helper.
>
> Also given that TCG is already able to handle call helpers, and that the
> LL/SC ops basically always call an helper, I do wonder if we can handle
> LL/SC ops just like we do for some arithmetic ops, see in tcg-runtime.c
> and tcg/tcg-runtime.h. That way the op will be implemented automatically
> for all backends. Of course we might still want a wrapper so that the
> tcg_gen_* functions transparently call the right helper.
>
>
> --
> Aurelien Jarno                          GPG: 4096R/1DDD8C9B
> aurelien@aurel32.net                 http://www.aurel32.net

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2015-08-11 13:32   ` alvise rigo
  2015-08-11 13:52     ` Paolo Bonzini
  2015-08-12 12:43   ` Paolo Bonzini
  1 sibling, 1 reply; 42+ messages in thread
From: alvise rigo @ 2015-08-11 13:32 UTC (permalink / raw)
  To: QEMU Developers, mttcg
  Cc: Alex Bennée, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Claudio Fontana, Paolo Bonzini

On Fri, Aug 7, 2015 at 7:03 PM, Alvise Rigo
<a.rigo@virtualopensystems.com> wrote:
> The new helpers rely on the legacy ones to perform the actual read/write.
>
> The LoadLink helper (helper_ldlink_name) prepares the way for the
> following SC operation. It sets the linked address and the size of the
> access.
> These helper also update the TLB entry of the page involved in the
> LL/SC for those vCPUs that have the bit set (dirty), so that the
> following accesses made by all the vCPUs will follow the slow path.
>
> The StoreConditional helper (helper_stcond_name) returns 1 if the
> store has to fail due to a concurrent access to the same page by
> another vCPU. A 'concurrent access' can be a store made by *any* vCPU
> (although, some implementations allow stores made by the CPU that issued
> the LoadLink).
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c                |   3 ++
>  softmmu_llsc_template.h | 124 ++++++++++++++++++++++++++++++++++++++++++++++++
>  softmmu_template.h      |  12 +++++
>  3 files changed, 139 insertions(+)
>  create mode 100644 softmmu_llsc_template.h
>
> diff --git a/cputlb.c b/cputlb.c
> index 251eec8..f45afc6 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -415,6 +415,8 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>
>  #define MMUSUFFIX _mmu
>
> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
> +#define GEN_EXCLUSIVE_HELPERS
>  #define SHIFT 0
>  #include "softmmu_template.h"
>
> @@ -427,6 +429,7 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>  #define SHIFT 3
>  #include "softmmu_template.h"
>  #undef MMUSUFFIX
> +#undef GEN_EXCLUSIVE_HELPERS
>
>  #define MMUSUFFIX _cmmu
>  #undef GETPC_ADJ
> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
> new file mode 100644
> index 0000000..d2e92b4
> --- /dev/null
> +++ b/softmmu_llsc_template.h
> @@ -0,0 +1,124 @@
> +/*
> + *  Software MMU support (esclusive load/store operations)
> + *
> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
> + *
> + * Included from softmmu_template.h only.
> + *
> + * Copyright (c) 2015 Virtual Open Systems
> + *
> + * Authors:
> + *  Alvise Rigo <a.rigo@virtualopensystems.com>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/* This template does not generate together the le and be version, but only one
> + * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
> + * The same nomenclature as softmmu_template.h is used for the exclusive
> + * helpers.  */
> +
> +#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
> +
> +#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld_legacy glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st_legacy glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
> +
> +#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
> +
> +#if DATA_SIZE > 1
> +#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)

Wouldn't be better here to not use the helpers from softmmu_template.h?
Doing so, in softmmu_template.h we don't need to differentiate the
normal stores from the SCs
This at a cost of some additional boilerplate in softmmu_llsc_template.h.

Thanks,
alvise

> +#else /* DATA_SIZE <= 1 */
> +#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld_legacy glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st_legacy glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
> +#endif
> +
> +#endif
> +
> +WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
> +                                TCGMemOpIdx oi, uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    int index;
> +    CPUState *cpu;
> +    hwaddr hw_addr;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +
> +    /* Use the proper load helper from cpu_ldst.h */
> +    ret = helper_ld_legacy(env, addr, mmu_idx, retaddr);
> +
> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +
> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
> +
> +    cpu_physical_memory_clear_excl_dirty(hw_addr, ENV_GET_CPU(env)->cpu_index);
> +    /* If all the vCPUs have the EXCL bit set for this page there is no need
> +     * to request any flush. */
> +    if (cpu_physical_memory_excl_is_dirty(hw_addr, smp_cpus)) {
> +        CPU_FOREACH(cpu) {
> +            if (current_cpu != cpu) {
> +                if (cpu_physical_memory_excl_is_dirty(hw_addr,
> +                                                    cpu->cpu_index)) {
> +                    cpu_physical_memory_clear_excl_dirty(hw_addr,
> +                                                         cpu->cpu_index);
> +                    tlb_flush(cpu, 1);
> +                }
> +            }
> +        }
> +    }
> +
> +    env->excl_protected_range.begin = hw_addr;
> +    env->excl_protected_range.end = hw_addr + DATA_SIZE;
> +
> +    /* For this vCPU, just update the TLB entry, no need to flush. */
> +    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
> +
> +    return ret;
> +}
> +
> +WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
> +                             DATA_TYPE val, TCGMemOpIdx oi,
> +                             uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +
> +    /* We set it preventively to true to distinguish the following legacy
> +     * access as one made by the store conditional wrapper. If the store
> +     * conditional does not succeed, the value will be set to 0.*/
> +    env->excl_succeeded = 1;
> +    helper_st_legacy(env, addr, val, mmu_idx, retaddr);
> +
> +    if (env->excl_succeeded) {
> +        env->excl_succeeded = 0;
> +        ret = 0;
> +    } else {
> +        ret = 1;
> +    }
> +
> +    return ret;
> +}
> +
> +#undef helper_ldlink_name
> +#undef helper_stcond_name
> +#undef helper_ld_legacy
> +#undef helper_st_legacy
> diff --git a/softmmu_template.h b/softmmu_template.h
> index e4431e8..ad65d20 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -640,6 +640,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>  #endif
>  #endif /* !defined(SOFTMMU_CODE_ACCESS) */
>
> +#ifdef GEN_EXCLUSIVE_HELPERS
> +
> +#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
> +#define BIGENDIAN_EXCLUSIVE_HELPERS
> +#include "softmmu_llsc_template.h"
> +#undef BIGENDIAN_EXCLUSIVE_HELPERS
> +#endif
> +
> +#include "softmmu_llsc_template.h"
> +
> +#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
> +
>  #undef READ_ACCESS_TYPE
>  #undef SHIFT
>  #undef DATA_TYPE
> --
> 2.5.0
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-11 13:32   ` alvise rigo
@ 2015-08-11 13:52     ` Paolo Bonzini
  2015-08-11 15:55       ` alvise rigo
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-11 13:52 UTC (permalink / raw)
  To: alvise rigo, QEMU Developers, mttcg
  Cc: Claudio Fontana, Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée



On 11/08/2015 15:32, alvise rigo wrote:
>> > +#if DATA_SIZE > 1
>> > +#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
>> > +#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
>> > +#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
>> > +#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
> Wouldn't be better here to not use the helpers from softmmu_template.h?
> Doing so, in softmmu_template.h we don't need to differentiate the
> normal stores from the SCs
> This at a cost of some additional boilerplate in softmmu_llsc_template.h.

A lot more boilerplate I think?

This patch looks good, I would just s/_legacy// because it's not really
legacy.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-08-11 13:52   ` Paolo Bonzini
  2015-08-11 14:24     ` Peter Maydell
  2015-08-11 15:54     ` alvise rigo
  0 siblings, 2 replies; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-11 13:52 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee



On 07/08/2015 19:03, Alvise Rigo wrote:
> +static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
> +{
> +    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
> +    unsigned long next, end;
> +
> +    if (likely(smp_cpus <= BITS_PER_LONG)) {

This only works if smp_cpus divides BITS_PER_LONG, i.e. BITS_PER_LONG %
smp_cpus == 0.

> +        unsigned long mask = (1 << smp_cpus) - 1;
> +
> +        return
> +            (mask & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
> +            (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)))) != mask;
> +    }
> +
> +    end = BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)) + smp_cpus;
> +    next = find_next_zero_bit(bitmap, end,
> +                              BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr)));
> +
> +    return next < end;

> +static inline int cpu_physical_memory_excl_is_dirty(ram_addr_t addr,
> +                                                    unsigned long cpu)
> +{
> +    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
> +    unsigned long end, next;
> +    uint32_t add;
> +
> +    assert(cpu <= smp_cpus);
> +
> +    if (likely(smp_cpus <= BITS_PER_LONG)) {
> +        cpu = (cpu == smp_cpus) ? (1 << cpu) - 1 : (1 << cpu);
> +
> +        return cpu & (bitmap[BIT_WORD(EXCL_BITMAP_GET_OFFSET(addr))] >>
> +                     (EXCL_BITMAP_GET_OFFSET(addr) & (BITS_PER_LONG-1)));
> +    }
> +
> +    add = (cpu == smp_cpus) ? 0 : 1;

Why not have a separate function for the cpu == smp_cpus case?

I don't think real hardware has ll/sc per CPU.  Can we have the bitmap as:

- 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
has done a concurrent access

- 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
a concurrent access.

Then:

- ll sets the bit to 0, and requests a flush if it was 1

- when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0

- in the TLB_EXCL slow path, set the bit to 1 and, for conditional
stores, succeed if the bit was 0

- when removing an exclusive entry, set the bit to 1

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 13:52   ` Paolo Bonzini
@ 2015-08-11 14:24     ` Peter Maydell
  2015-08-11 14:34       ` alvise rigo
  2015-08-11 15:54     ` alvise rigo
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Maydell @ 2015-08-11 14:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, Alvise Rigo, QEMU Developers,
	Jani Kokkonen, tech@virtualopensystems.com, Alex Bennée

On 11 August 2015 at 14:52, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> I don't think real hardware has ll/sc per CPU.

On ARM, the exclusives are handled by the 'global monitor', which
supports tracking an exclusive access per CPU.

>  Can we have the bitmap as:
>
> - 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
> has done a concurrent access
>
> - 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
> a concurrent access.
>
> Then:
>
> - ll sets the bit to 0, and requests a flush if it was 1
>
> - when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0
>
> - in the TLB_EXCL slow path, set the bit to 1 and, for conditional
> stores, succeed if the bit was 0
>
> - when removing an exclusive entry, set the bit to 1

This doesn't sound like it has the correct semantics for ARM:
wouldn't it mean that CPU 1 could do an LL, and then CPU 2
could do an SC and have it succeed even if it hadn't
previously done an LL itself?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 14:24     ` Peter Maydell
@ 2015-08-11 14:34       ` alvise rigo
  0 siblings, 0 replies; 42+ messages in thread
From: alvise rigo @ 2015-08-11 14:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, tech@virtualopensystems.com, Alex Bennée

On Tue, Aug 11, 2015 at 4:24 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 11 August 2015 at 14:52, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>> I don't think real hardware has ll/sc per CPU.
>
> On ARM, the exclusives are handled by the 'global monitor', which
> supports tracking an exclusive access per CPU.
>
>>  Can we have the bitmap as:
>>
>> - 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
>> has done a concurrent access
>>
>> - 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
>> a concurrent access.
>>
>> Then:
>>
>> - ll sets the bit to 0, and requests a flush if it was 1
>>
>> - when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0
>>
>> - in the TLB_EXCL slow path, set the bit to 1 and, for conditional
>> stores, succeed if the bit was 0
>>
>> - when removing an exclusive entry, set the bit to 1
>
> This doesn't sound like it has the correct semantics for ARM:
> wouldn't it mean that CPU 1 could do an LL, and then CPU 2
> could do an SC and have it succeed even if it hadn't
> previously done an LL itself?

This case is correctly handled now. If the CPU 2 has not previously
done a LL, the SC will fail, without even entering the helper.
I think Paolo omitted this case in the list.

Regards,
alvise

>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 13:52   ` Paolo Bonzini
  2015-08-11 14:24     ` Peter Maydell
@ 2015-08-11 15:54     ` alvise rigo
  2015-08-11 15:55       ` Paolo Bonzini
  1 sibling, 1 reply; 42+ messages in thread
From: alvise rigo @ 2015-08-11 15:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On Tue, Aug 11, 2015 at 3:52 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 07/08/2015 19:03, Alvise Rigo wrote:
>> +static inline int cpu_physical_memory_excl_atleast_one_clean(ram_addr_t addr)
>> +{
>> +    unsigned long *bitmap = ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
>> +    unsigned long next, end;
>> +
>> +    if (likely(smp_cpus <= BITS_PER_LONG)) {
>
> This only works if smp_cpus divides BITS_PER_LONG, i.e. BITS_PER_LONG %
> smp_cpus == 0.

You are right,

>
> Why not have a separate function for the cpu == smp_cpus case?

I'll rework this part a bit.

>
> I don't think real hardware has ll/sc per CPU.  Can we have the bitmap as:
>
> - 0 if one or more CPUs have the address set to exclusive, _and_ no CPU
> has done a concurrent access
>
> - 1 if no CPUs have the address set to exclusive, _or_ one CPU has done
> a concurrent access.
>
> Then:
>
> - ll sets the bit to 0, and requests a flush if it was 1
>
> - when setting a TLB entry, set it to TLB_EXCL if the bitmap has 0
>
> - in the TLB_EXCL slow path, set the bit to 1 and, for conditional
> stores, succeed if the bit was 0
>
> - when removing an exclusive entry, set the bit to 1

This can lead to an excessive rate of flush requests, since for one
CPU that removes the TLB_EXCL flag, all the others that are competing
for the same excl address will need to flush the entire cache and
start all over again.

If for a start we want to have a simpler implementation, I can revert
back to the one-bit design.

alvise

>
> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-11 13:52     ` Paolo Bonzini
@ 2015-08-11 15:55       ` alvise rigo
  0 siblings, 0 replies; 42+ messages in thread
From: alvise rigo @ 2015-08-11 15:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On Tue, Aug 11, 2015 at 3:52 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 11/08/2015 15:32, alvise rigo wrote:
>>> > +#if DATA_SIZE > 1
>>> > +#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
>>> > +#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
>>> > +#define helper_ld_legacy glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
>>> > +#define helper_st_legacy glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
>> Wouldn't be better here to not use the helpers from softmmu_template.h?
>> Doing so, in softmmu_template.h we don't need to differentiate the
>> normal stores from the SCs
>> This at a cost of some additional boilerplate in softmmu_llsc_template.h.
>
> A lot more boilerplate I think?
>
> This patch looks good, I would just s/_legacy// because it's not really
> legacy.

Agreed.

alvise

>
> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 15:54     ` alvise rigo
@ 2015-08-11 15:55       ` Paolo Bonzini
  2015-08-11 16:11         ` alvise rigo
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-11 15:55 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 11/08/2015 17:54, alvise rigo wrote:
> This can lead to an excessive rate of flush requests, since for one
> CPU that removes the TLB_EXCL flag, all the others that are competing
> for the same excl address will need to flush the entire cache and
> start all over again.

Why flush the entire cache (I understand you mean TLB)?

Paolo

> If for a start we want to have a simpler implementation, I can revert
> back to the one-bit design.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 15:55       ` Paolo Bonzini
@ 2015-08-11 16:11         ` alvise rigo
  2015-08-11 16:32           ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: alvise rigo @ 2015-08-11 16:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On Tue, Aug 11, 2015 at 5:55 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 11/08/2015 17:54, alvise rigo wrote:
>> This can lead to an excessive rate of flush requests, since for one
>> CPU that removes the TLB_EXCL flag, all the others that are competing
>> for the same excl address will need to flush the entire cache and
>> start all over again.
>
> Why flush the entire cache (I understand you mean TLB)?

Sorry, I meant the TLB.
If for each removal of an exclusive entry we set also the bit to 1, we
force the following LL to make a tlb_flush() on every vCPU.

alvise

>
> Paolo
>
>> If for a start we want to have a simpler implementation, I can revert
>> back to the one-bit design.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 16:11         ` alvise rigo
@ 2015-08-11 16:32           ` Paolo Bonzini
  2015-08-12  7:31             ` alvise rigo
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-11 16:32 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 11/08/2015 18:11, alvise rigo wrote:
>> > Why flush the entire cache (I understand you mean TLB)?
> Sorry, I meant the TLB.
> If for each removal of an exclusive entry we set also the bit to 1, we
> force the following LL to make a tlb_flush() on every vCPU.

What if you only flush one entry with tlb_flush_entry (on every vCPU)?

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-11 16:32           ` Paolo Bonzini
@ 2015-08-12  7:31             ` alvise rigo
  2015-08-12 12:36               ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: alvise rigo @ 2015-08-12  7:31 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

I think that tlb_flush_entry is not enough, since in theory another
vCPU could have a different TLB address referring the same phys
address.

alvise

On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 11/08/2015 18:11, alvise rigo wrote:
>>> > Why flush the entire cache (I understand you mean TLB)?
>> Sorry, I meant the TLB.
>> If for each removal of an exclusive entry we set also the bit to 1, we
>> force the following LL to make a tlb_flush() on every vCPU.
>
> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>
> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-12  7:31             ` alvise rigo
@ 2015-08-12 12:36               ` Paolo Bonzini
  2015-08-12 13:02                 ` Peter Maydell
                                   ` (2 more replies)
  0 siblings, 3 replies; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-12 12:36 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 12/08/2015 09:31, alvise rigo wrote:
> I think that tlb_flush_entry is not enough, since in theory another
> vCPU could have a different TLB address referring the same phys
> address.

You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure 
what happens on ARM, since it has a virtually indexed (VIVT or VIPT) 
cache, but indeed it would be a problem when implementing e.g. CMPXCHG 
using the TCG ll/sc ops.

I'm a bit worried about adding such a big bitmap.  It's only used on 
TCG, but it is also allocated on KVM and on KVM you can have hundreds 
of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest 
memory) is obviously not a great idea. :(

Perhaps we can use a bytemap instead:

- 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
loads the TLB for this vaddr does not have to set it.

- 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
loads the TLB for this vaddr has to set it.

- 255 = TLB_EXCL not set in at least two VCPUs

Transitions:

- ll transitions: anything -> 254

- sc transitions: 254 -> current CPU_ID

- TLB_EXCL store transitions: 254 -> current CPU_ID

- tlb_st_page transitions: CPU_ID other than current -> 255

The initial value is 255 on SMP guests, 0 on UP guests.

The algorithms are very similar to yours, just using this approximate
representation.

ll algorithm:
  llsc_value = bytemap[vaddr]
  if llsc_value == CPU_ID
     do nothing
  elseif llsc_value < 254
     flush TLB of CPU llsc_value
  elseif llsc_value == 255
     flush all TLBs
  set TLB_EXCL
  bytemap[vaddr] = 254
  load

tlb_set_page algorithm:
  llsc_value = bytemap[vaddr]
  if llsc_value == CPU_ID or llsc_value == 255
     do nothing
  else if llsc_value == 254
     set TLB_EXCL
  else
     # two CPUs without TLB_EXCL
     bytemap[vaddr] = 255

TLB_EXCL slow path algorithm:
   if bytemap[vaddr] == 254
      bytemap[vaddr] = CPU_ID
   else
      # two CPUs without TLB_EXCL
      bytemap[vaddr] = 255
   clear TLB_EXCL in this CPU
   store

sc algorithm:
   if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
      bytemap[vaddr] = CPU_ID
      clear TLB_EXCL in this CPU
      store
      succeed
   else
      fail

clear algorithm:
   if bytemap[vaddr] == 254
      bytemap[vaddr] = CPU_ID

The UP case is optimized because bytemap[vaddr] will always be 0 or 254.

The algorithm requires the LL to be cleared e.g. on exceptions.
Paolo

> alvise
> 
> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>> Why flush the entire cache (I understand you mean TLB)?
>>> Sorry, I meant the TLB.
>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>> force the following LL to make a tlb_flush() on every vCPU.
>>
>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>
>> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-07 17:03 ` [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath Alvise Rigo
  2015-08-11 13:32   ` alvise rigo
@ 2015-08-12 12:43   ` Paolo Bonzini
  2015-08-12 13:09     ` alvise rigo
  1 sibling, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-12 12:43 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, jani.kokkonen, tech, alex.bennee



On 07/08/2015 19:03, Alvise Rigo wrote:
> +
> +    /* For this vCPU, just update the TLB entry, no need to flush. */
> +    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;

Couldn't this vCPU also have two aliasing entries in the TLB?

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-12 12:36               ` Paolo Bonzini
@ 2015-08-12 13:02                 ` Peter Maydell
  2015-08-12 14:04                 ` alvise rigo
  2015-09-10 13:04                 ` alvise rigo
  2 siblings, 0 replies; 42+ messages in thread
From: Peter Maydell @ 2015-08-12 13:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, alvise rigo, QEMU Developers,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée

On 12 August 2015 at 13:36, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 09:31, alvise rigo wrote:
>> I think that tlb_flush_entry is not enough, since in theory another
>> vCPU could have a different TLB address referring the same phys
>> address.
>
> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
> using the TCG ll/sc ops.

On ARM the exclusives operate on physical addresses, not virtual.

-- PMM

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-12 12:43   ` Paolo Bonzini
@ 2015-08-12 13:09     ` alvise rigo
  2015-08-12 13:14       ` Paolo Bonzini
  0 siblings, 1 reply; 42+ messages in thread
From: alvise rigo @ 2015-08-12 13:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

Yes, it could. However, it's really unlikely that a vCPU, after
issuing a LL to the virtual address x, it stores to the same phys
address using the virtual address y.

I'm not really sure If we really need to handle these cases.

alvise

On Wed, Aug 12, 2015 at 2:43 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 07/08/2015 19:03, Alvise Rigo wrote:
>> +
>> +    /* For this vCPU, just update the TLB entry, no need to flush. */
>> +    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
>
> Couldn't this vCPU also have two aliasing entries in the TLB?
>
> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath
  2015-08-12 13:09     ` alvise rigo
@ 2015-08-12 13:14       ` Paolo Bonzini
  0 siblings, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-12 13:14 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 12/08/2015 15:09, alvise rigo wrote:
> Yes, it could. However, it's really unlikely that a vCPU, after
> issuing a LL to the virtual address x, it stores to the same phys
> address using the virtual address y.
> 
> I'm not really sure If we really need to handle these cases.

Ok, if we had to it's just a matter of adding a TLB flush here.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-12 12:36               ` Paolo Bonzini
  2015-08-12 13:02                 ` Peter Maydell
@ 2015-08-12 14:04                 ` alvise rigo
  2015-08-12 14:10                   ` Paolo Bonzini
  2015-09-10 13:04                 ` alvise rigo
  2 siblings, 1 reply; 42+ messages in thread
From: alvise rigo @ 2015-08-12 14:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 09:31, alvise rigo wrote:
>> I think that tlb_flush_entry is not enough, since in theory another
>> vCPU could have a different TLB address referring the same phys
>> address.
>
> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
> using the TCG ll/sc ops.
>
> I'm a bit worried about adding such a big bitmap.  It's only used on
> TCG, but it is also allocated on KVM and on KVM you can have hundreds
> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
> memory) is obviously not a great idea. :(

I agree, it's a waste of memory.

>
> Perhaps we can use a bytemap instead:
>
> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
> loads the TLB for this vaddr does not have to set it.
>
> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
> loads the TLB for this vaddr has to set it.
>
> - 255 = TLB_EXCL not set in at least two VCPUs
>
> Transitions:
>
> - ll transitions: anything -> 254
>
> - sc transitions: 254 -> current CPU_ID
>
> - TLB_EXCL store transitions: 254 -> current CPU_ID
>
> - tlb_st_page transitions: CPU_ID other than current -> 255
>
> The initial value is 255 on SMP guests, 0 on UP guests.
>
> The algorithms are very similar to yours, just using this approximate
> representation.
>
> ll algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID
>      do nothing
>   elseif llsc_value < 254
>      flush TLB of CPU llsc_value
>   elseif llsc_value == 255
>      flush all TLBs
>   set TLB_EXCL
>   bytemap[vaddr] = 254
>   load
>
> tlb_set_page algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID or llsc_value == 255
>      do nothing
>   else if llsc_value == 254
>      set TLB_EXCL
>   else
>      # two CPUs without TLB_EXCL
>      bytemap[vaddr] = 255
>
> TLB_EXCL slow path algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>    else
>       # two CPUs without TLB_EXCL
>       bytemap[vaddr] = 255
>    clear TLB_EXCL in this CPU
>    store
>
> sc algorithm:
>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>       clear TLB_EXCL in this CPU
>       store
>       succeed
>    else
>       fail
>
> clear algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID

Isn't this also required for the clear algorithm?

    if bytemap[vaddr] < 254
        /* this can happen for the TLB_EXCL slow path effect */
        bytemap[vaddr] = 255

The whole idea makes sense, I will consider it for the next iteration
of the patches.

Thanks,
alvise

>
> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>
> The algorithm requires the LL to be cleared e.g. on exceptions.
> Paolo
>
>> alvise
>>
>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>> Sorry, I meant the TLB.
>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>
>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>
>>> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-12 14:04                 ` alvise rigo
@ 2015-08-12 14:10                   ` Paolo Bonzini
  2015-08-12 14:32                     ` alvise rigo
  0 siblings, 1 reply; 42+ messages in thread
From: Paolo Bonzini @ 2015-08-12 14:10 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 12/08/2015 16:04, alvise rigo wrote:
>> > clear algorithm:
>> >    if bytemap[vaddr] == 254
>> >       bytemap[vaddr] = CPU_ID
> Isn't this also required for the clear algorithm?
> 
>     if bytemap[vaddr] < 254
>         /* this can happen for the TLB_EXCL slow path effect */
>         bytemap[vaddr] = 255

I don't think so because clear doesn't clear TLB_EXCL.  But maybe we're
talking about two different things. :)

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-12 14:10                   ` Paolo Bonzini
@ 2015-08-12 14:32                     ` alvise rigo
  0 siblings, 0 replies; 42+ messages in thread
From: alvise rigo @ 2015-08-12 14:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

On Wed, Aug 12, 2015 at 4:10 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 16:04, alvise rigo wrote:
>>> > clear algorithm:
>>> >    if bytemap[vaddr] == 254
>>> >       bytemap[vaddr] = CPU_ID
>> Isn't this also required for the clear algorithm?
>>
>>     if bytemap[vaddr] < 254
>>         /* this can happen for the TLB_EXCL slow path effect */
>>         bytemap[vaddr] = 255
>
> I don't think so because clear doesn't clear TLB_EXCL.  But maybe we're
> talking about two different things. :)

Let me go through it again, I might have missed something :)

alvise

>
> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-08-12 12:36               ` Paolo Bonzini
  2015-08-12 13:02                 ` Peter Maydell
  2015-08-12 14:04                 ` alvise rigo
@ 2015-09-10 13:04                 ` alvise rigo
  2015-09-10 16:19                   ` Alex Bennée
  2015-09-10 16:25                   ` Paolo Bonzini
  2 siblings, 2 replies; 42+ messages in thread
From: alvise rigo @ 2015-09-10 13:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée

Hi Paolo,

A brief update on this. I have a first implementation of the idea you
proposed, though it's not working really well. The failing rate of SCs
for some reason is very high.
Instead of trying to fix it, I came up with this alternative design:
we still use 8 bits per page and we group the smp_cpus in clusters of
maximum size ceiling(smp_cpus / 8).
Given a byte describing a certain page, its first bit will represents
the vCPUs of index 0, 9, 17, ... and so on for the other bits.

Note that systems with eight vCPUs or less will not be affected by
this simplified model.

Now, unless I'm missing some advantages of your proposal except the
less aggressive memory consumption, I would go with this design.

Regards,
alvise

On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 12/08/2015 09:31, alvise rigo wrote:
>> I think that tlb_flush_entry is not enough, since in theory another
>> vCPU could have a different TLB address referring the same phys
>> address.
>
> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
> using the TCG ll/sc ops.
>
> I'm a bit worried about adding such a big bitmap.  It's only used on
> TCG, but it is also allocated on KVM and on KVM you can have hundreds
> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
> memory) is obviously not a great idea. :(
>
> Perhaps we can use a bytemap instead:
>
> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
> loads the TLB for this vaddr does not have to set it.
>
> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
> loads the TLB for this vaddr has to set it.
>
> - 255 = TLB_EXCL not set in at least two VCPUs
>
> Transitions:
>
> - ll transitions: anything -> 254
>
> - sc transitions: 254 -> current CPU_ID
>
> - TLB_EXCL store transitions: 254 -> current CPU_ID
>
> - tlb_st_page transitions: CPU_ID other than current -> 255
>
> The initial value is 255 on SMP guests, 0 on UP guests.
>
> The algorithms are very similar to yours, just using this approximate
> representation.
>
> ll algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID
>      do nothing
>   elseif llsc_value < 254
>      flush TLB of CPU llsc_value
>   elseif llsc_value == 255
>      flush all TLBs
>   set TLB_EXCL
>   bytemap[vaddr] = 254
>   load
>
> tlb_set_page algorithm:
>   llsc_value = bytemap[vaddr]
>   if llsc_value == CPU_ID or llsc_value == 255
>      do nothing
>   else if llsc_value == 254
>      set TLB_EXCL
>   else
>      # two CPUs without TLB_EXCL
>      bytemap[vaddr] = 255
>
> TLB_EXCL slow path algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>    else
>       # two CPUs without TLB_EXCL
>       bytemap[vaddr] = 255
>    clear TLB_EXCL in this CPU
>    store
>
> sc algorithm:
>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>       clear TLB_EXCL in this CPU
>       store
>       succeed
>    else
>       fail
>
> clear algorithm:
>    if bytemap[vaddr] == 254
>       bytemap[vaddr] = CPU_ID
>
> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>
> The algorithm requires the LL to be cleared e.g. on exceptions.
> Paolo
>
>> alvise
>>
>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>> Sorry, I meant the TLB.
>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>
>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>
>>> Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-09-10 13:04                 ` alvise rigo
@ 2015-09-10 16:19                   ` Alex Bennée
  2015-09-10 17:36                     ` alvise rigo
  2015-09-10 16:25                   ` Paolo Bonzini
  1 sibling, 1 reply; 42+ messages in thread
From: Alex Bennée @ 2015-09-10 16:19 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team


alvise rigo <a.rigo@virtualopensystems.com> writes:

> Hi Paolo,
>
> A brief update on this. I have a first implementation of the idea you
> proposed, though it's not working really well. The failing rate of SCs
> for some reason is very high.

Due to high memory contention on the EXCL page?

> Instead of trying to fix it, I came up with this alternative design:
> we still use 8 bits per page and we group the smp_cpus in clusters of
> maximum size ceiling(smp_cpus / 8).
> Given a byte describing a certain page, its first bit will represents
> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
>
> Note that systems with eight vCPUs or less will not be affected by
> this simplified model.
>
> Now, unless I'm missing some advantages of your proposal except the
> less aggressive memory consumption, I would go with this design.

BTW have you had a look at Emilio's series? I'm about half way through
reviewing it but I'm looking forward to seeing how his implementation
differs from yours. Maybe there are some ideas in there that will
inspire?

>
> Regards,
> alvise
>
> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 12/08/2015 09:31, alvise rigo wrote:
>>> I think that tlb_flush_entry is not enough, since in theory another
>>> vCPU could have a different TLB address referring the same phys
>>> address.
>>
>> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
>> using the TCG ll/sc ops.
>>
>> I'm a bit worried about adding such a big bitmap.  It's only used on
>> TCG, but it is also allocated on KVM and on KVM you can have hundreds
>> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
>> memory) is obviously not a great idea. :(
>>
>> Perhaps we can use a bytemap instead:
>>
>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
>> loads the TLB for this vaddr does not have to set it.
>>
>> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
>> loads the TLB for this vaddr has to set it.
>>
>> - 255 = TLB_EXCL not set in at least two VCPUs
>>
>> Transitions:
>>
>> - ll transitions: anything -> 254
>>
>> - sc transitions: 254 -> current CPU_ID
>>
>> - TLB_EXCL store transitions: 254 -> current CPU_ID
>>
>> - tlb_st_page transitions: CPU_ID other than current -> 255
>>
>> The initial value is 255 on SMP guests, 0 on UP guests.
>>
>> The algorithms are very similar to yours, just using this approximate
>> representation.
>>
>> ll algorithm:
>>   llsc_value = bytemap[vaddr]
>>   if llsc_value == CPU_ID
>>      do nothing
>>   elseif llsc_value < 254
>>      flush TLB of CPU llsc_value
>>   elseif llsc_value == 255
>>      flush all TLBs
>>   set TLB_EXCL
>>   bytemap[vaddr] = 254
>>   load
>>
>> tlb_set_page algorithm:
>>   llsc_value = bytemap[vaddr]
>>   if llsc_value == CPU_ID or llsc_value == 255
>>      do nothing
>>   else if llsc_value == 254
>>      set TLB_EXCL
>>   else
>>      # two CPUs without TLB_EXCL
>>      bytemap[vaddr] = 255
>>
>> TLB_EXCL slow path algorithm:
>>    if bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>    else
>>       # two CPUs without TLB_EXCL
>>       bytemap[vaddr] = 255
>>    clear TLB_EXCL in this CPU
>>    store
>>
>> sc algorithm:
>>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>       clear TLB_EXCL in this CPU
>>       store
>>       succeed
>>    else
>>       fail
>>
>> clear algorithm:
>>    if bytemap[vaddr] == 254
>>       bytemap[vaddr] = CPU_ID
>>
>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>>
>> The algorithm requires the LL to be cleared e.g. on exceptions.
>> Paolo
>>
>>> alvise
>>>
>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>
>>>>
>>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>>> Sorry, I meant the TLB.
>>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>>
>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>>
>>>> Paolo

-- 
Alex Bennée

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-09-10 13:04                 ` alvise rigo
  2015-09-10 16:19                   ` Alex Bennée
@ 2015-09-10 16:25                   ` Paolo Bonzini
  1 sibling, 0 replies; 42+ messages in thread
From: Paolo Bonzini @ 2015-09-10 16:25 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée



On 10/09/2015 15:04, alvise rigo wrote:
> Hi Paolo,
> 
> A brief update on this. I have a first implementation of the idea you
> proposed, though it's not working really well. The failing rate of SCs
> for some reason is very high.
> Instead of trying to fix it, I came up with this alternative design:
> we still use 8 bits per page and we group the smp_cpus in clusters of
> maximum size ceiling(smp_cpus / 8).
> Given a byte describing a certain page, its first bit will represents
> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
> 
> Note that systems with eight vCPUs or less will not be affected by
> this simplified model.
> 
> Now, unless I'm missing some advantages of your proposal except the
> less aggressive memory consumption, I would go with this design.

Sounds great.

Paolo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
  2015-09-10 16:19                   ` Alex Bennée
@ 2015-09-10 17:36                     ` alvise rigo
  0 siblings, 0 replies; 42+ messages in thread
From: alvise rigo @ 2015-09-10 17:36 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team

Hi Alex,

On Thu, Sep 10, 2015 at 6:19 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> alvise rigo <a.rigo@virtualopensystems.com> writes:
>
>> Hi Paolo,
>>
>> A brief update on this. I have a first implementation of the idea you
>> proposed, though it's not working really well. The failing rate of SCs
>> for some reason is very high.
>
> Due to high memory contention on the EXCL page?

I guess it's also due to the fact that we were using the bitmap itself
to determine if a SC operation should fail or not.
The SC should rather fail if another vCPU has an intersecting
exclusive range set and not for the state of the bitmap.

>
>> Instead of trying to fix it, I came up with this alternative design:
>> we still use 8 bits per page and we group the smp_cpus in clusters of
>> maximum size ceiling(smp_cpus / 8).
>> Given a byte describing a certain page, its first bit will represents
>> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
>>
>> Note that systems with eight vCPUs or less will not be affected by
>> this simplified model.
>>
>> Now, unless I'm missing some advantages of your proposal except the
>> less aggressive memory consumption, I would go with this design.
>
> BTW have you had a look at Emilio's series? I'm about half way through
> reviewing it but I'm looking forward to seeing how his implementation
> differs from yours. Maybe there are some ideas in there that will
> inspire?

I still have to give an in-depth look at them.
However, if I'm not missing something, the problem of the fast-path
not being interceptable is still there. Isn't it?

Regards,
alvise

>
>>
>> Regards,
>> alvise
>>
>> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 12/08/2015 09:31, alvise rigo wrote:
>>>> I think that tlb_flush_entry is not enough, since in theory another
>>>> vCPU could have a different TLB address referring the same phys
>>>> address.
>>>
>>> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
>>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
>>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
>>> using the TCG ll/sc ops.
>>>
>>> I'm a bit worried about adding such a big bitmap.  It's only used on
>>> TCG, but it is also allocated on KVM and on KVM you can have hundreds
>>> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
>>> memory) is obviously not a great idea. :(
>>>
>>> Perhaps we can use a bytemap instead:
>>>
>>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
>>> loads the TLB for this vaddr does not have to set it.
>>>
>>> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
>>> loads the TLB for this vaddr has to set it.
>>>
>>> - 255 = TLB_EXCL not set in at least two VCPUs
>>>
>>> Transitions:
>>>
>>> - ll transitions: anything -> 254
>>>
>>> - sc transitions: 254 -> current CPU_ID
>>>
>>> - TLB_EXCL store transitions: 254 -> current CPU_ID
>>>
>>> - tlb_st_page transitions: CPU_ID other than current -> 255
>>>
>>> The initial value is 255 on SMP guests, 0 on UP guests.
>>>
>>> The algorithms are very similar to yours, just using this approximate
>>> representation.
>>>
>>> ll algorithm:
>>>   llsc_value = bytemap[vaddr]
>>>   if llsc_value == CPU_ID
>>>      do nothing
>>>   elseif llsc_value < 254
>>>      flush TLB of CPU llsc_value
>>>   elseif llsc_value == 255
>>>      flush all TLBs
>>>   set TLB_EXCL
>>>   bytemap[vaddr] = 254
>>>   load
>>>
>>> tlb_set_page algorithm:
>>>   llsc_value = bytemap[vaddr]
>>>   if llsc_value == CPU_ID or llsc_value == 255
>>>      do nothing
>>>   else if llsc_value == 254
>>>      set TLB_EXCL
>>>   else
>>>      # two CPUs without TLB_EXCL
>>>      bytemap[vaddr] = 255
>>>
>>> TLB_EXCL slow path algorithm:
>>>    if bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>    else
>>>       # two CPUs without TLB_EXCL
>>>       bytemap[vaddr] = 255
>>>    clear TLB_EXCL in this CPU
>>>    store
>>>
>>> sc algorithm:
>>>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>       clear TLB_EXCL in this CPU
>>>       store
>>>       succeed
>>>    else
>>>       fail
>>>
>>> clear algorithm:
>>>    if bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>
>>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>>>
>>> The algorithm requires the LL to be cleared e.g. on exceptions.
>>> Paolo
>>>
>>>> alvise
>>>>
>>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>>>
>>>>>
>>>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>>>> Sorry, I meant the TLB.
>>>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>>>
>>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>>>
>>>>> Paolo
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2015-09-10 17:36 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-07 17:03 [Qemu-devel] [RFC v4 0/9] Slow-path for atomic instruction translation Alvise Rigo
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
2015-08-11 13:52   ` Paolo Bonzini
2015-08-11 14:24     ` Peter Maydell
2015-08-11 14:34       ` alvise rigo
2015-08-11 15:54     ` alvise rigo
2015-08-11 15:55       ` Paolo Bonzini
2015-08-11 16:11         ` alvise rigo
2015-08-11 16:32           ` Paolo Bonzini
2015-08-12  7:31             ` alvise rigo
2015-08-12 12:36               ` Paolo Bonzini
2015-08-12 13:02                 ` Peter Maydell
2015-08-12 14:04                 ` alvise rigo
2015-08-12 14:10                   ` Paolo Bonzini
2015-08-12 14:32                     ` alvise rigo
2015-09-10 13:04                 ` alvise rigo
2015-09-10 16:19                   ` Alex Bennée
2015-09-10 17:36                     ` alvise rigo
2015-09-10 16:25                   ` Paolo Bonzini
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 2/9] softmmu: Add new TLB_EXCL flag Alvise Rigo
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 3/9] softmmu: Add helpers for a new slowpath Alvise Rigo
2015-08-11 13:32   ` alvise rigo
2015-08-11 13:52     ` Paolo Bonzini
2015-08-11 15:55       ` alvise rigo
2015-08-12 12:43   ` Paolo Bonzini
2015-08-12 13:09     ` alvise rigo
2015-08-12 13:14       ` Paolo Bonzini
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 4/9] tcg-op: create new TCG qemu_{ld, st} excl variants Alvise Rigo
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 5/9] configure: Enable/disable new qemu_{ld, st} excl insns Alvise Rigo
2015-08-08 12:44   ` Aurelien Jarno
2015-08-08 13:57     ` Peter Maydell
2015-08-09  8:11     ` Alex Bennée
2015-08-09  8:40       ` Aurelien Jarno
2015-08-09  9:51         ` Alex Bennée
2015-08-09 10:13           ` Aurelien Jarno
2015-08-09 16:27             ` Alex Bennée
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 6/9] tcg-i386: Implement excl variants of qemu_{ld, st} Alvise Rigo
2015-08-08 13:00   ` Aurelien Jarno
2015-08-10  7:50     ` alvise rigo
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 7/9] tcg-arm: " Alvise Rigo
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 8/9] tcg-aarch64: " Alvise Rigo
2015-08-07 17:03 ` [Qemu-devel] [RFC v4 9/9] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).