All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: Richard Henderson <rth@twiddle.net>
Cc: qemu-devel@nongnu.org, "Emilio G. Cota" <cota@braap.org>
Subject: Re: [Qemu-devel] [PATCH v6 28/35] target-arm: emulate LL/SC using cmpxchg helpers
Date: Thu, 13 Oct 2016 12:43:41 +0100	[thread overview]
Message-ID: <87twcgfp8y.fsf@linaro.org> (raw)
In-Reply-To: <1476214861-31658-29-git-send-email-rth@twiddle.net>


Richard Henderson <rth@twiddle.net> writes:

> From: "Emilio G. Cota" <cota@braap.org>
>
> Emulating LL/SC with cmpxchg is not correct, since it can
> suffer from the ABA problem. Portable parallel code, however,
> is written assuming only cmpxchg--and not LL/SC--is available.
> This means that in practice emulating LL/SC with cmpxchg is
> a viable alternative.
>
> The appended emulates LL/SC pairs in ARM with cmpxchg helpers.
> This works in both user and system mode. In usermode, it avoids
> pausing all other CPUs to perform the LL/SC pair. The subsequent
> performance and scalability improvement is significant, as the
> plots below show. They plot the throughput of atomic_add-bench
> compiled for ARM and executed on a 64-core x86 machine.
>
> Hi-res plots: http://imgur.com/a/aNQpB
>
>                atomic_add-bench: 1000000 ops/thread, [0,1] range
>
>   9 ++---------+----------+----------+----------+----------+----------+---++
>     +cmpxchg +-E--+       +          +          +          +          +    |
>   8 +Emaster +-H--+                                                       ++
>     | |                                                                    |
>   7 ++E                                                                   ++
>     | |                                                                    |
>   6 ++++                                                                  ++
>     |  |                                                                   |
>   5 ++ |                                                                  ++
>   4 ++ |                                                                  ++
>     |  |                                                                   |
>   3 ++ |                                                                  ++
>     |   |                                                                  |
>   2 ++  |                                                                 ++
>     |H++E+---                                  +++  ---+E+------+E+------+E|
>   1 +++     +E+-----+E+------+E+------+E+------+E+--   +++      +++       ++
>     ++H+       +    +++   +  +++     ++++       +          +          +    |
>   0 ++--H----H-+-----H----+----------+----------+----------+----------+---++
>     0          10         20         30         40         50         60
>                                Number of threads
>
>                 atomic_add-bench: 1000000 ops/thread, [0,2] range
>
>   16 ++---------+----------+---------+----------+----------+----------+---++
>      +cmpxchg +-E--+       +         +          +          +          +    |
>   14 ++master +-H--+                                                      ++
>      | |                                                                   |
>   12 ++|                                                                  ++
>      | E                                                                   |
>   10 ++|                                                                  ++
>      | |                                                                   |
>    8 ++++                                                                 ++
>      |E+|                                                                  |
>      |  |                                                                  |
>    6 ++ |                                                                 ++
>      |   |                                                                 |
>    4 ++  |                                                                ++
>      |  +E+---       +++      +++              +++           ---+E+------+E|
>    2 +H+     +E+------E-------+E+-----+E+------+E+------+E+--            +++
>      + |        +    +++   +         ++++       +          +          +    |
>    0 ++H-H----H-+-----H----+---------+----------+----------+----------+---++
>      0          10         20        30         40         50         60
>                                 Number of threads
>
>                atomic_add-bench: 1000000 ops/thread, [0,128] range
>
>   70 ++---------+----------+---------+----------+----------+----------+---++
>      +cmpxchg +-E--+       +         +          +       ++++          +    |
>   60 ++master +-H--+                                 ----E------+E+-------++
>      |                                        -+E+---   +++     +++      +E|
>      |                                +++ ---- +++                       ++|
>   50 ++                       +++  ---+E+-                                ++
>      |                        -E---                                        |
>   40 ++                    ---+++                                         ++
>      |               +++---                                                |
>      |              -+E+                                                   |
>   30 ++      +++----                                                      ++
>      |       +E+                                                           |
>   20 ++ +++--                                                             ++
>      |  +E+                                                                |
>      |+E+                                                                  |
>   10 +E+                                                                  ++
>      +          +          +         +          +          +          +    |
>    0 +HH-H----H-+-----H----+---------+----------+----------+----------+---++
>      0          10         20        30         40         50         60
>                                 Number of threads
>
>               atomic_add-bench: 1000000 ops/thread, [0,1024] range
>
>   120 ++---------+---------+----------+---------+----------+----------+---++
>       +cmpxchg +-E--+      +          +         +          +          +    |
>       | master +-H--+                                                    ++|
>   100 ++                                                              ----E+
>       |                                                 +++  ---+E+---   ++|
>       |                                                --E---   +++        |
>    80 ++                                           ---- +++               ++
>       |                                     ---+E+-                        |
>    60 ++                              -+E+--                              ++
>       |                       +++ ---- +++                                 |
>       |                      -+E+-                                         |
>    40 ++              +++----                                             ++
>       |      +++   ---+E+                                                  |
>       |     -+E+---                                                        |
>    20 ++ +E+                                                              ++
>       |+E+++                                                               |
>       +E+        +         +          +         +          +          +    |
>     0 +HH-H---H--+-----H---+----------+---------+----------+----------+---++
>       0          10        20         30        40         50         60
>                                 Number of threads
>
> [rth: Enforce alignment for ldrexd.]
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> Message-Id: <1467054136-10430-23-git-send-email-cota@braap.org>
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
>  target-arm/translate.c | 136 ++++++++++++++++---------------------------------
>  1 file changed, 45 insertions(+), 91 deletions(-)
>
> diff --git a/target-arm/translate.c b/target-arm/translate.c
> index f745c37..e0e29d9 100644
> --- a/target-arm/translate.c
> +++ b/target-arm/translate.c
> @@ -7683,45 +7683,31 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
>
>  /* Load/Store exclusive instructions are implemented by remembering
>     the value/address loaded, and seeing if these are the same
> -   when the store is performed. This should be sufficient to implement
> +   when the store is performed.  This should be sufficient to implement
>     the architecturally mandated semantics, and avoids having to monitor
> -   regular stores.
> -
> -   In system emulation mode only one CPU will be running at once, so
> -   this sequence is effectively atomic.  In user emulation mode we
> -   throw an exception and handle the atomic operation elsewhere.  */
> +   regular stores.  The compare vs the remembered value is done during
> +   the cmpxchg operation, but we must compare the addresses manually.  */
>  static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>                                 TCGv_i32 addr, int size)
>  {
>      TCGv_i32 tmp = tcg_temp_new_i32();
> +    TCGMemOp opc = size | MO_ALIGN | s->be_data;
>
>      s->is_ldex = true;
>
> -    switch (size) {
> -    case 0:
> -        gen_aa32_ld8u(s, tmp, addr, get_mem_index(s));
> -        break;
> -    case 1:
> -        gen_aa32_ld16ua(s, tmp, addr, get_mem_index(s));
> -        break;
> -    case 2:
> -    case 3:
> -        gen_aa32_ld32ua(s, tmp, addr, get_mem_index(s));
> -        break;
> -    default:
> -        abort();
> -    }
> -
>      if (size == 3) {
>          TCGv_i32 tmp2 = tcg_temp_new_i32();
> -        TCGv_i32 tmp3 = tcg_temp_new_i32();
> +        TCGv_i64 t64 = tcg_temp_new_i64();
> +
> +        gen_aa32_ld_i64(s, t64, addr, get_mem_index(s), opc);
> +        tcg_gen_mov_i64(cpu_exclusive_val, t64);
> +        tcg_gen_extr_i64_i32(tmp, tmp2, t64);
> +        tcg_temp_free_i64(t64);
>
> -        tcg_gen_addi_i32(tmp2, addr, 4);
> -        gen_aa32_ld32u(s, tmp3, tmp2, get_mem_index(s));
> +        store_reg(s, rt2, tmp2);
>          tcg_temp_free_i32(tmp2);
> -        tcg_gen_concat_i32_i64(cpu_exclusive_val, tmp, tmp3);
> -        store_reg(s, rt2, tmp3);

Magically store_reg frees the tmp it is passed which means this trips up
TCG when debugging in turned on.

>      } else {
> +        gen_aa32_ld_i32(s, tmp, addr, get_mem_index(s), opc);
>          tcg_gen_extu_i32_i64(cpu_exclusive_val, tmp);
>      }
>
> @@ -7734,23 +7720,15 @@ static void gen_clrex(DisasContext *s)
>      tcg_gen_movi_i64(cpu_exclusive_addr, -1);
>  }
>
> -#ifdef CONFIG_USER_ONLY
>  static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>                                  TCGv_i32 addr, int size)
>  {
> -    tcg_gen_extu_i32_i64(cpu_exclusive_test, addr);
> -    tcg_gen_movi_i32(cpu_exclusive_info,
> -                     size | (rd << 4) | (rt << 8) | (rt2 << 12));
> -    gen_exception_internal_insn(s, 4, EXCP_STREX);
> -}
> -#else
> -static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
> -                                TCGv_i32 addr, int size)
> -{
> -    TCGv_i32 tmp;
> -    TCGv_i64 val64, extaddr;
> +    TCGv_i32 t0, t1, t2;
> +    TCGv_i64 extaddr;
> +    TCGv taddr;
>      TCGLabel *done_label;
>      TCGLabel *fail_label;
> +    TCGMemOp opc = size | MO_ALIGN | s->be_data;
>
>      /* if (env->exclusive_addr == addr && env->exclusive_val == [addr]) {
>           [addr] = {Rt};
> @@ -7765,69 +7743,45 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>      tcg_gen_brcond_i64(TCG_COND_NE, extaddr, cpu_exclusive_addr, fail_label);
>      tcg_temp_free_i64(extaddr);
>
> -    tmp = tcg_temp_new_i32();
> -    switch (size) {
> -    case 0:
> -        gen_aa32_ld8u(s, tmp, addr, get_mem_index(s));
> -        break;
> -    case 1:
> -        gen_aa32_ld16u(s, tmp, addr, get_mem_index(s));
> -        break;
> -    case 2:
> -    case 3:
> -        gen_aa32_ld32u(s, tmp, addr, get_mem_index(s));
> -        break;
> -    default:
> -        abort();
> -    }
> -
> -    val64 = tcg_temp_new_i64();
> +    taddr = gen_aa32_addr(s, addr, opc);
> +    t0 = tcg_temp_new_i32();
> +    t1 = load_reg(s, rt);
>      if (size == 3) {
> -        TCGv_i32 tmp2 = tcg_temp_new_i32();
> -        TCGv_i32 tmp3 = tcg_temp_new_i32();
> -        tcg_gen_addi_i32(tmp2, addr, 4);
> -        gen_aa32_ld32u(s, tmp3, tmp2, get_mem_index(s));
> -        tcg_temp_free_i32(tmp2);
> -        tcg_gen_concat_i32_i64(val64, tmp, tmp3);
> -        tcg_temp_free_i32(tmp3);
> -    } else {
> -        tcg_gen_extu_i32_i64(val64, tmp);
> -    }
> -    tcg_temp_free_i32(tmp);
> +        TCGv_i64 o64 = tcg_temp_new_i64();
> +        TCGv_i64 n64 = tcg_temp_new_i64();
>
> -    tcg_gen_brcond_i64(TCG_COND_NE, val64, cpu_exclusive_val, fail_label);
> -    tcg_temp_free_i64(val64);
> +        t2 = load_reg(s, rt2);
> +        tcg_gen_concat_i32_i64(n64, t1, t2);
> +        tcg_temp_free_i32(t2);
> +        gen_aa32_frob64(s, n64);
>
> -    tmp = load_reg(s, rt);
> -    switch (size) {
> -    case 0:
> -        gen_aa32_st8(s, tmp, addr, get_mem_index(s));
> -        break;
> -    case 1:
> -        gen_aa32_st16(s, tmp, addr, get_mem_index(s));
> -        break;
> -    case 2:
> -    case 3:
> -        gen_aa32_st32(s, tmp, addr, get_mem_index(s));
> -        break;
> -    default:
> -        abort();
> -    }
> -    tcg_temp_free_i32(tmp);
> -    if (size == 3) {
> -        tcg_gen_addi_i32(addr, addr, 4);
> -        tmp = load_reg(s, rt2);
> -        gen_aa32_st32(s, tmp, addr, get_mem_index(s));
> -        tcg_temp_free_i32(tmp);
> +        tcg_gen_atomic_cmpxchg_i64(o64, taddr, cpu_exclusive_val, n64,
> +                                   get_mem_index(s), opc);
> +        tcg_temp_free_i64(n64);
> +
> +        gen_aa32_frob64(s, o64);
> +        tcg_gen_setcond_i64(TCG_COND_NE, o64, o64, cpu_exclusive_val);
> +        tcg_gen_extrl_i64_i32(t0, o64);
> +
> +        tcg_temp_free_i64(o64);
> +    } else {
> +        t2 = tcg_temp_new_i32();
> +        tcg_gen_extrl_i64_i32(t2, cpu_exclusive_val);
> +        tcg_gen_atomic_cmpxchg_i32(t0, taddr, t2, t1, get_mem_index(s), opc);
> +        tcg_gen_setcond_i32(TCG_COND_NE, t0, t0, t2);
> +        tcg_temp_free_i32(t2);
>      }
> -    tcg_gen_movi_i32(cpu_R[rd], 0);
> +    tcg_temp_free_i32(t1);
> +    tcg_temp_free(taddr);
> +    tcg_gen_mov_i32(cpu_R[rd], t0);
> +    tcg_temp_free_i32(t0);
>      tcg_gen_br(done_label);
> +
>      gen_set_label(fail_label);
>      tcg_gen_movi_i32(cpu_R[rd], 1);
>      gen_set_label(done_label);
>      tcg_gen_movi_i64(cpu_exclusive_addr, -1);
>  }
> -#endif
>
>  /* gen_srs:
>   * @env: CPUARMState


--
Alex Bennée

  reply	other threads:[~2016-10-13 11:43 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-11 19:40 [Qemu-devel] [PATCH v6 00/35] cmpxchg-based emulation of atomics Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 01/35] atomics: add atomic_xor Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 02/35] atomics: add atomic_op_fetch variants Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 03/35] exec: Avoid direct references to Int128 parts Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 04/35] int128: Use __int128 if available Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 05/35] int128: Add int128_make128 Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 07/35] linux-user: enable parallel code generation on clone Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 08/35] cputlb: Replace SHIFT with DATA_SIZE Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 09/35] cputlb: Move probe_write out of softmmu_template.h Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 10/35] cputlb: Remove includes from softmmu_template.h Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 11/35] cputlb: Move most of iotlb code out of line Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 12/35] cputlb: Tidy some macros Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 13/35] tcg: Add atomic helpers Richard Henderson
2016-10-12 16:16   ` Alex Bennée
2016-10-16 22:17   ` Emilio G. Cota
2016-10-17  1:09     ` Richard Henderson
2016-10-17  1:40     ` Richard Henderson
2016-10-17  3:23       ` Emilio G. Cota
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 14/35] tcg: Add atomic128 helpers Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 15/35] tcg: Add CONFIG_ATOMIC64 Richard Henderson
2016-10-12 16:16   ` Alex Bennée
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 16/35] tcg: Emit barriers with parallel_cpus Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 17/35] target-i386: emulate LOCK'ed cmpxchg using cmpxchg helpers Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 18/35] target-i386: emulate LOCK'ed OP instructions using atomic helpers Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 19/35] target-i386: emulate LOCK'ed INC using atomic helper Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 20/35] target-i386: emulate LOCK'ed NOT " Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 21/35] target-i386: emulate LOCK'ed NEG using cmpxchg helper Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 22/35] target-i386: emulate LOCK'ed XADD using atomic helper Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 23/35] target-i386: emulate LOCK'ed BTX ops using atomic helpers Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 24/35] target-i386: emulate XCHG using atomic helper Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 25/35] target-i386: remove helper_lock() Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 26/35] tests: add atomic_add-bench Richard Henderson
2016-10-14 21:19   ` Emilio G. Cota
2016-10-17  1:49     ` Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 27/35] target-arm: Rearrange aa32 load and store functions Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 28/35] target-arm: emulate LL/SC using cmpxchg helpers Richard Henderson
2016-10-13 11:43   ` Alex Bennée [this message]
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 29/35] target-arm: emulate SWP with atomic_xchg helper Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 30/35] target-arm: emulate aarch64's LL/SC using cmpxchg helpers Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 31/35] linux-user: remove handling of ARM's EXCP_STREX Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 32/35] linux-user: remove handling of aarch64's EXCP_STREX Richard Henderson
2016-10-11 19:40 ` [Qemu-devel] [PATCH v6 33/35] target-arm: remove EXCP_STREX + cpu_exclusive_{test, info} Richard Henderson
2016-10-11 19:41 ` [Qemu-devel] [PATCH v6 34/35] target-alpha: Introduce MMU_PHYS_IDX Richard Henderson
2016-10-11 19:41 ` [Qemu-devel] [PATCH v6 35/35] target-alpha: Emulate LL/SC using cmpxchg helpers Richard Henderson
2016-10-16 22:38 ` [Qemu-devel] [PATCH v6 00/35] cmpxchg-based emulation of atomics Emilio G. Cota
2016-10-17  8:17   ` Alex Bennée
2016-10-17 14:37     ` Richard Henderson
2016-10-17 15:33       ` Alex Bennée
2016-10-17 17:56     ` Emilio G. Cota
2016-10-18  8:28       ` Alex Bennée
2016-10-18 18:01         ` Emilio G. Cota

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87twcgfp8y.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=cota@braap.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.