[PATCH 0/5] Add LoongArch v1.1 instructions

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Add LoongArch v1.1 instructions
@ 2023-10-23 15:29 Jiajie Chen
  2023-10-23 15:29 ` [PATCH 1/5] include/exec/memop.h: Add MO_TESB Jiajie Chen
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, gaosong, git, Jiajie Chen

Latest revision of LoongArch ISA is out at
https://www.loongson.cn/uploads/images/2023102309132647981.%E9%BE%99%E8%8A%AF%E6%9E%B6%E6%9E%84%E5%8F%82%E8%80%83%E6%89%8B%E5%86%8C%E5%8D%B7%E4%B8%80_r1p10.pdf
(Chinese only). The revision includes the following updates:

- estimated fp reciporcal instructions: frecip -> frecipe, frsqrt ->
  frsqrte
- 128-bit width store-conditional instruction: sc.q
- ll.w/d with acquire semantic: llacq.w/d, sc.w/d with release semantic:
  screl.w/d
- compare and swap instructions: amcas[_db].b/w/h/d
- byte and word-wide amswap/add instructions: am{swap/add}[_db].{b/h}
- new definition for dbar hints
- clarify 32-bit division instruction hebavior
- clarify load ordering when accessing the same address
- introduce message signaled interrupt
- introduce hardware page table walker

The new revision is implemented in the to be released Loongson 3A6000
processor.

This patch series implements the new instructions except sc.q, because I
do not know how to match a pair of ll.d to sc.q.


Jiajie Chen (5):
  include/exec/memop.h: Add MO_TESB
  target/loongarch: Add am{swap/add}[_db].{b/h}
  target/loongarch: Add amcas[_db].{b/h/w/d}
  target/loongarch: Add estimated reciprocal instructions
  target/loongarch: Add llacq/screl instructions

 include/exec/memop.h                          |  1 +
 target/loongarch/cpu.h                        |  4 ++
 target/loongarch/disas.c                      | 32 ++++++++++++
 .../loongarch/insn_trans/trans_atomic.c.inc   | 52 +++++++++++++++++++
 .../loongarch/insn_trans/trans_farith.c.inc   |  4 ++
 target/loongarch/insn_trans/trans_vec.c.inc   |  8 +++
 target/loongarch/insns.decode                 | 32 ++++++++++++
 target/loongarch/translate.h                  | 27 +++++++---
 8 files changed, 152 insertions(+), 8 deletions(-)

-- 
2.42.0



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] include/exec/memop.h: Add MO_TESB
  2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
@ 2023-10-23 15:29 ` Jiajie Chen
  2023-10-23 15:49   ` David Hildenbrand
  2023-10-23 15:29 ` [PATCH 2/5] target/loongarch: Add am{swap/add}[_db].{b/h} Jiajie Chen
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:29 UTC (permalink / raw)
  To: qemu-devel
  Cc: richard.henderson, gaosong, git, Jiajie Chen, Paolo Bonzini,
	Peter Xu, David Hildenbrand, Philippe Mathieu-Daudé

Signed-off-by: Jiajie Chen <c@jia.je>
---
 include/exec/memop.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/exec/memop.h b/include/exec/memop.h
index a86dc6743a..834327c62d 100644
--- a/include/exec/memop.h
+++ b/include/exec/memop.h
@@ -140,6 +140,7 @@ typedef enum MemOp {
     MO_TEUL  = MO_TE | MO_UL,
     MO_TEUQ  = MO_TE | MO_UQ,
     MO_TEUO  = MO_TE | MO_UO,
+    MO_TESB  = MO_TE | MO_SB,
     MO_TESW  = MO_TE | MO_SW,
     MO_TESL  = MO_TE | MO_SL,
     MO_TESQ  = MO_TE | MO_SQ,
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] target/loongarch: Add am{swap/add}[_db].{b/h}
  2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
  2023-10-23 15:29 ` [PATCH 1/5] include/exec/memop.h: Add MO_TESB Jiajie Chen
@ 2023-10-23 15:29 ` Jiajie Chen
  2023-10-23 22:50   ` Richard Henderson
  2023-10-23 15:29 ` [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d} Jiajie Chen
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, gaosong, git, Jiajie Chen

The new instructions are introduced in LoongArch v1.1:

- amswap.b
- amswap.h
- amadd.b
- amadd.h
- amswap_db.b
- amswap_db.h
- amadd_db.b
- amadd_db.h

The instructions are gated by CPUCFG2.LAM_BH.

Signed-off-by: Jiajie Chen <c@jia.je>
---
 target/loongarch/cpu.h                         |  1 +
 target/loongarch/disas.c                       |  8 ++++++++
 target/loongarch/insn_trans/trans_atomic.c.inc |  8 ++++++++
 target/loongarch/insns.decode                  |  8 ++++++++
 target/loongarch/translate.h                   | 17 +++++++++--------
 5 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/target/loongarch/cpu.h b/target/loongarch/cpu.h
index 8b54cf109c..7166c07756 100644
--- a/target/loongarch/cpu.h
+++ b/target/loongarch/cpu.h
@@ -155,6 +155,7 @@ FIELD(CPUCFG2, LBT_ARM, 19, 1)
 FIELD(CPUCFG2, LBT_MIPS, 20, 1)
 FIELD(CPUCFG2, LSPW, 21, 1)
 FIELD(CPUCFG2, LAM, 22, 1)
+FIELD(CPUCFG2, LAM_BH, 27, 1)
 
 /* cpucfg[3] bits */
 FIELD(CPUCFG3, CCDMA, 0, 1)
diff --git a/target/loongarch/disas.c b/target/loongarch/disas.c
index 2040f3e44d..d33aa8173a 100644
--- a/target/loongarch/disas.c
+++ b/target/loongarch/disas.c
@@ -575,6 +575,14 @@ INSN(fldx_s,       frr)
 INSN(fldx_d,       frr)
 INSN(fstx_s,       frr)
 INSN(fstx_d,       frr)
+INSN(amswap_b,     rrr)
+INSN(amswap_h,     rrr)
+INSN(amadd_b,      rrr)
+INSN(amadd_h,      rrr)
+INSN(amswap_db_b,  rrr)
+INSN(amswap_db_h,  rrr)
+INSN(amadd_db_b,   rrr)
+INSN(amadd_db_h,   rrr)
 INSN(amswap_w,     rrr)
 INSN(amswap_d,     rrr)
 INSN(amadd_w,      rrr)
diff --git a/target/loongarch/insn_trans/trans_atomic.c.inc b/target/loongarch/insn_trans/trans_atomic.c.inc
index 80c2e286fd..cd28e217ad 100644
--- a/target/loongarch/insn_trans/trans_atomic.c.inc
+++ b/target/loongarch/insn_trans/trans_atomic.c.inc
@@ -73,6 +73,14 @@ TRANS(ll_w, ALL, gen_ll, MO_TESL)
 TRANS(sc_w, ALL, gen_sc, MO_TESL)
 TRANS(ll_d, 64, gen_ll, MO_TEUQ)
 TRANS(sc_d, 64, gen_sc, MO_TEUQ)
+TRANS(amswap_b, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESB)
+TRANS(amswap_h, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESW)
+TRANS(amadd_b, LAM_BH, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESB)
+TRANS(amadd_h, LAM_BH, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESW)
+TRANS(amswap_db_b, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESB)
+TRANS(amswap_db_h, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESW)
+TRANS(amadd_db_b, LAM_BH, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESB)
+TRANS(amadd_db_h, LAM_BH, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESW)
 TRANS(amswap_w, LAM, gen_am, tcg_gen_atomic_xchg_tl, MO_TESL)
 TRANS(amswap_d, LAM, gen_am, tcg_gen_atomic_xchg_tl, MO_TEUQ)
 TRANS(amadd_w, LAM, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESL)
diff --git a/target/loongarch/insns.decode b/target/loongarch/insns.decode
index 62f58cc541..678ce42038 100644
--- a/target/loongarch/insns.decode
+++ b/target/loongarch/insns.decode
@@ -261,6 +261,14 @@ ll_w            0010 0000 .............. ..... .....     @rr_i14s2
 sc_w            0010 0001 .............. ..... .....     @rr_i14s2
 ll_d            0010 0010 .............. ..... .....     @rr_i14s2
 sc_d            0010 0011 .............. ..... .....     @rr_i14s2
+amswap_b        0011 10000101 11000 ..... ..... .....    @rrr
+amswap_h        0011 10000101 11001 ..... ..... .....    @rrr
+amadd_b         0011 10000101 11010 ..... ..... .....    @rrr
+amadd_h         0011 10000101 11011 ..... ..... .....    @rrr
+amswap_db_b     0011 10000101 11100 ..... ..... .....    @rrr
+amswap_db_h     0011 10000101 11101 ..... ..... .....    @rrr
+amadd_db_b      0011 10000101 11110 ..... ..... .....    @rrr
+amadd_db_h      0011 10000101 11111 ..... ..... .....    @rrr
 amswap_w        0011 10000110 00000 ..... ..... .....    @rrr
 amswap_d        0011 10000110 00001 ..... ..... .....    @rrr
 amadd_w         0011 10000110 00010 ..... ..... .....    @rrr
diff --git a/target/loongarch/translate.h b/target/loongarch/translate.h
index 195f53573a..0b230530e7 100644
--- a/target/loongarch/translate.h
+++ b/target/loongarch/translate.h
@@ -17,14 +17,15 @@
 #define avail_ALL(C)   true
 #define avail_64(C)    (FIELD_EX32((C)->cpucfg1, CPUCFG1, ARCH) == \
                         CPUCFG1_ARCH_LA64)
-#define avail_FP(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, FP))
-#define avail_FP_SP(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, FP_SP))
-#define avail_FP_DP(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, FP_DP))
-#define avail_LSPW(C)  (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSPW))
-#define avail_LAM(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM))
-#define avail_LSX(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSX))
-#define avail_LASX(C)  (FIELD_EX32((C)->cpucfg2, CPUCFG2, LASX))
-#define avail_IOCSR(C) (FIELD_EX32((C)->cpucfg1, CPUCFG1, IOCSR))
+#define avail_FP(C)     (FIELD_EX32((C)->cpucfg2, CPUCFG2, FP))
+#define avail_FP_SP(C)  (FIELD_EX32((C)->cpucfg2, CPUCFG2, FP_SP))
+#define avail_FP_DP(C)  (FIELD_EX32((C)->cpucfg2, CPUCFG2, FP_DP))
+#define avail_LSPW(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSPW))
+#define avail_LAM(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM))
+#define avail_LAM_BH(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM_BH))
+#define avail_LSX(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSX))
+#define avail_LASX(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LASX))
+#define avail_IOCSR(C)  (FIELD_EX32((C)->cpucfg1, CPUCFG1, IOCSR))
 
 /*
  * If an operation is being performed on less than TARGET_LONG_BITS,
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d}
  2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
  2023-10-23 15:29 ` [PATCH 1/5] include/exec/memop.h: Add MO_TESB Jiajie Chen
  2023-10-23 15:29 ` [PATCH 2/5] target/loongarch: Add am{swap/add}[_db].{b/h} Jiajie Chen
@ 2023-10-23 15:29 ` Jiajie Chen
  2023-10-23 15:35   ` Jiajie Chen
  2023-10-23 22:59   ` Richard Henderson
  2023-10-23 15:29 ` [PATCH 4/5] target/loongarch: Add estimated reciprocal instructions Jiajie Chen
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, gaosong, git, Jiajie Chen

The new instructions are introduced in LoongArch v1.1:

- amcas.b
- amcas.h
- amcas.w
- amcas.d
- amcas_db.b
- amcas_db.h
- amcas_db.w
- amcas_db.d

The new instructions are gated by CPUCFG2.LAMCAS.

Signed-off-by: Jiajie Chen <c@jia.je>
---
 target/loongarch/cpu.h                        |  1 +
 target/loongarch/disas.c                      |  8 +++++++
 .../loongarch/insn_trans/trans_atomic.c.inc   | 24 +++++++++++++++++++
 target/loongarch/insns.decode                 |  8 +++++++
 target/loongarch/translate.h                  |  1 +
 5 files changed, 42 insertions(+)

diff --git a/target/loongarch/cpu.h b/target/loongarch/cpu.h
index 7166c07756..80a476c3f8 100644
--- a/target/loongarch/cpu.h
+++ b/target/loongarch/cpu.h
@@ -156,6 +156,7 @@ FIELD(CPUCFG2, LBT_MIPS, 20, 1)
 FIELD(CPUCFG2, LSPW, 21, 1)
 FIELD(CPUCFG2, LAM, 22, 1)
 FIELD(CPUCFG2, LAM_BH, 27, 1)
+FIELD(CPUCFG2, LAMCAS, 28, 1)
 
 /* cpucfg[3] bits */
 FIELD(CPUCFG3, CCDMA, 0, 1)
diff --git a/target/loongarch/disas.c b/target/loongarch/disas.c
index d33aa8173a..4aa67749cf 100644
--- a/target/loongarch/disas.c
+++ b/target/loongarch/disas.c
@@ -575,6 +575,14 @@ INSN(fldx_s,       frr)
 INSN(fldx_d,       frr)
 INSN(fstx_s,       frr)
 INSN(fstx_d,       frr)
+INSN(amcas_b,      rrr)
+INSN(amcas_h,      rrr)
+INSN(amcas_w,      rrr)
+INSN(amcas_d,      rrr)
+INSN(amcas_db_b,   rrr)
+INSN(amcas_db_h,   rrr)
+INSN(amcas_db_w,   rrr)
+INSN(amcas_db_d,   rrr)
 INSN(amswap_b,     rrr)
 INSN(amswap_h,     rrr)
 INSN(amadd_b,      rrr)
diff --git a/target/loongarch/insn_trans/trans_atomic.c.inc b/target/loongarch/insn_trans/trans_atomic.c.inc
index cd28e217ad..bea567fdaf 100644
--- a/target/loongarch/insn_trans/trans_atomic.c.inc
+++ b/target/loongarch/insn_trans/trans_atomic.c.inc
@@ -45,6 +45,22 @@ static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
     return true;
 }
 
+static bool gen_cas(DisasContext *ctx, arg_rrr *a,
+                    void (*func)(TCGv, TCGv, TCGv, TCGv, TCGArg, MemOp),
+                    MemOp mop)
+{
+    TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
+    TCGv addr = gpr_src(ctx, a->rj, EXT_NONE);
+    TCGv val = gpr_src(ctx, a->rk, EXT_NONE);
+
+    addr = make_address_i(ctx, addr, 0);
+
+    func(dest, addr, dest, val, ctx->mem_idx, mop);
+    gen_set_gpr(a->rd, dest, EXT_NONE);
+
+    return true;
+}
+
 static bool gen_am(DisasContext *ctx, arg_rrr *a,
                    void (*func)(TCGv, TCGv, TCGv, TCGArg, MemOp),
                    MemOp mop)
@@ -73,6 +89,14 @@ TRANS(ll_w, ALL, gen_ll, MO_TESL)
 TRANS(sc_w, ALL, gen_sc, MO_TESL)
 TRANS(ll_d, 64, gen_ll, MO_TEUQ)
 TRANS(sc_d, 64, gen_sc, MO_TEUQ)
+TRANS(amcas_b, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESB)
+TRANS(amcas_h, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESW)
+TRANS(amcas_w, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESL)
+TRANS(amcas_d, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TEUQ)
+TRANS(amcas_db_b, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESB)
+TRANS(amcas_db_h, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESW)
+TRANS(amcas_db_w, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESL)
+TRANS(amcas_db_d, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TEUQ)
 TRANS(amswap_b, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESB)
 TRANS(amswap_h, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESW)
 TRANS(amadd_b, LAM_BH, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESB)
diff --git a/target/loongarch/insns.decode b/target/loongarch/insns.decode
index 678ce42038..cf4123cd46 100644
--- a/target/loongarch/insns.decode
+++ b/target/loongarch/insns.decode
@@ -261,6 +261,14 @@ ll_w            0010 0000 .............. ..... .....     @rr_i14s2
 sc_w            0010 0001 .............. ..... .....     @rr_i14s2
 ll_d            0010 0010 .............. ..... .....     @rr_i14s2
 sc_d            0010 0011 .............. ..... .....     @rr_i14s2
+amcas_b         0011 10000101 10000 ..... ..... .....    @rrr
+amcas_h         0011 10000101 10001 ..... ..... .....    @rrr
+amcas_w         0011 10000101 10010 ..... ..... .....    @rrr
+amcas_d         0011 10000101 10011 ..... ..... .....    @rrr
+amcas_db_b      0011 10000101 10100 ..... ..... .....    @rrr
+amcas_db_h      0011 10000101 10101 ..... ..... .....    @rrr
+amcas_db_w      0011 10000101 10110 ..... ..... .....    @rrr
+amcas_db_d      0011 10000101 10111 ..... ..... .....    @rrr
 amswap_b        0011 10000101 11000 ..... ..... .....    @rrr
 amswap_h        0011 10000101 11001 ..... ..... .....    @rrr
 amadd_b         0011 10000101 11010 ..... ..... .....    @rrr
diff --git a/target/loongarch/translate.h b/target/loongarch/translate.h
index 0b230530e7..3affefdafc 100644
--- a/target/loongarch/translate.h
+++ b/target/loongarch/translate.h
@@ -23,6 +23,7 @@
 #define avail_LSPW(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSPW))
 #define avail_LAM(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM))
 #define avail_LAM_BH(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM_BH))
+#define avail_LAMCAS(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAMCAS))
 #define avail_LSX(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSX))
 #define avail_LASX(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LASX))
 #define avail_IOCSR(C)  (FIELD_EX32((C)->cpucfg1, CPUCFG1, IOCSR))
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] target/loongarch: Add estimated reciprocal instructions
  2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
                   ` (2 preceding siblings ...)
  2023-10-23 15:29 ` [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d} Jiajie Chen
@ 2023-10-23 15:29 ` Jiajie Chen
  2023-10-23 23:02   ` Richard Henderson
  2023-10-23 15:29 ` [PATCH 5/5] target/loongarch: Add llacq/screl instructions Jiajie Chen
  2023-10-23 23:26 ` [PATCH 0/5] Add LoongArch v1.1 instructions Richard Henderson
  5 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, gaosong, git, Jiajie Chen

Add the following new instructions in LoongArch v1.1:

- frecipe.s
- frecipe.d
- frsqrte.s
- frsqrte.d
- vfrecipe.s
- vfrecipe.d
- vfrsqrte.s
- vfrsqrte.d
- xvfrecipe.s
- xvfrecipe.d
- xvfrsqrte.s
- xvfrsqrte.d

They are guarded by CPUCFG2.FRECIPE. Altought the instructions allow
implementation to improve performance by reducing precision, we use the
existing softfloat implementation.

Signed-off-by: Jiajie Chen <c@jia.je>
---
 target/loongarch/cpu.h                         |  1 +
 target/loongarch/disas.c                       | 12 ++++++++++++
 target/loongarch/insn_trans/trans_farith.c.inc |  4 ++++
 target/loongarch/insn_trans/trans_vec.c.inc    |  8 ++++++++
 target/loongarch/insns.decode                  | 12 ++++++++++++
 target/loongarch/translate.h                   |  6 ++++++
 6 files changed, 43 insertions(+)

diff --git a/target/loongarch/cpu.h b/target/loongarch/cpu.h
index 80a476c3f8..8f938effa8 100644
--- a/target/loongarch/cpu.h
+++ b/target/loongarch/cpu.h
@@ -155,6 +155,7 @@ FIELD(CPUCFG2, LBT_ARM, 19, 1)
 FIELD(CPUCFG2, LBT_MIPS, 20, 1)
 FIELD(CPUCFG2, LSPW, 21, 1)
 FIELD(CPUCFG2, LAM, 22, 1)
+FIELD(CPUCFG2, FRECIPE, 25, 1)
 FIELD(CPUCFG2, LAM_BH, 27, 1)
 FIELD(CPUCFG2, LAMCAS, 28, 1)
 
diff --git a/target/loongarch/disas.c b/target/loongarch/disas.c
index 4aa67749cf..9eb49fb5e3 100644
--- a/target/loongarch/disas.c
+++ b/target/loongarch/disas.c
@@ -473,6 +473,10 @@ INSN(frecip_s,     ff)
 INSN(frecip_d,     ff)
 INSN(frsqrt_s,     ff)
 INSN(frsqrt_d,     ff)
+INSN(frecipe_s,    ff)
+INSN(frecipe_d,    ff)
+INSN(frsqrte_s,    ff)
+INSN(frsqrte_d,    ff)
 INSN(fmov_s,       ff)
 INSN(fmov_d,       ff)
 INSN(movgr2fr_w,   fr)
@@ -1424,6 +1428,10 @@ INSN_LSX(vfrecip_s,        vv)
 INSN_LSX(vfrecip_d,        vv)
 INSN_LSX(vfrsqrt_s,        vv)
 INSN_LSX(vfrsqrt_d,        vv)
+INSN_LSX(vfrecipe_s,       vv)
+INSN_LSX(vfrecipe_d,       vv)
+INSN_LSX(vfrsqrte_s,       vv)
+INSN_LSX(vfrsqrte_d,       vv)
 
 INSN_LSX(vfcvtl_s_h,       vv)
 INSN_LSX(vfcvth_s_h,       vv)
@@ -2338,6 +2346,10 @@ INSN_LASX(xvfrecip_s,        vv)
 INSN_LASX(xvfrecip_d,        vv)
 INSN_LASX(xvfrsqrt_s,        vv)
 INSN_LASX(xvfrsqrt_d,        vv)
+INSN_LASX(xvfrecipe_s,       vv)
+INSN_LASX(xvfrecipe_d,       vv)
+INSN_LASX(xvfrsqrte_s,       vv)
+INSN_LASX(xvfrsqrte_d,       vv)
 
 INSN_LASX(xvfcvtl_s_h,       vv)
 INSN_LASX(xvfcvth_s_h,       vv)
diff --git a/target/loongarch/insn_trans/trans_farith.c.inc b/target/loongarch/insn_trans/trans_farith.c.inc
index f4a0dea727..356cdf99b7 100644
--- a/target/loongarch/insn_trans/trans_farith.c.inc
+++ b/target/loongarch/insn_trans/trans_farith.c.inc
@@ -191,6 +191,10 @@ TRANS(frecip_s, FP_SP, gen_ff, gen_helper_frecip_s)
 TRANS(frecip_d, FP_DP, gen_ff, gen_helper_frecip_d)
 TRANS(frsqrt_s, FP_SP, gen_ff, gen_helper_frsqrt_s)
 TRANS(frsqrt_d, FP_DP, gen_ff, gen_helper_frsqrt_d)
+TRANS(frecipe_s, FRECIPE_FP_SP, gen_ff, gen_helper_frecip_s)
+TRANS(frecipe_d, FRECIPE_FP_DP, gen_ff, gen_helper_frecip_d)
+TRANS(frsqrte_s, FRECIPE_FP_SP, gen_ff, gen_helper_frsqrt_s)
+TRANS(frsqrte_d, FRECIPE_FP_DP, gen_ff, gen_helper_frsqrt_d)
 TRANS(flogb_s, FP_SP, gen_ff, gen_helper_flogb_s)
 TRANS(flogb_d, FP_DP, gen_ff, gen_helper_flogb_d)
 TRANS(fclass_s, FP_SP, gen_ff, gen_helper_fclass_s)
diff --git a/target/loongarch/insn_trans/trans_vec.c.inc b/target/loongarch/insn_trans/trans_vec.c.inc
index 98f856bb29..1c93e19ac4 100644
--- a/target/loongarch/insn_trans/trans_vec.c.inc
+++ b/target/loongarch/insn_trans/trans_vec.c.inc
@@ -4409,12 +4409,20 @@ TRANS(vfrecip_s, LSX, gen_vv_ptr, gen_helper_vfrecip_s)
 TRANS(vfrecip_d, LSX, gen_vv_ptr, gen_helper_vfrecip_d)
 TRANS(vfrsqrt_s, LSX, gen_vv_ptr, gen_helper_vfrsqrt_s)
 TRANS(vfrsqrt_d, LSX, gen_vv_ptr, gen_helper_vfrsqrt_d)
+TRANS(vfrecipe_s, FRECIPE_LSX, gen_vv_ptr, gen_helper_vfrecip_s)
+TRANS(vfrecipe_d, FRECIPE_LSX, gen_vv_ptr, gen_helper_vfrecip_d)
+TRANS(vfrsqrte_s, FRECIPE_LSX, gen_vv_ptr, gen_helper_vfrsqrt_s)
+TRANS(vfrsqrte_d, FRECIPE_LSX, gen_vv_ptr, gen_helper_vfrsqrt_d)
 TRANS(xvfsqrt_s, LASX, gen_xx_ptr, gen_helper_vfsqrt_s)
 TRANS(xvfsqrt_d, LASX, gen_xx_ptr, gen_helper_vfsqrt_d)
 TRANS(xvfrecip_s, LASX, gen_xx_ptr, gen_helper_vfrecip_s)
 TRANS(xvfrecip_d, LASX, gen_xx_ptr, gen_helper_vfrecip_d)
 TRANS(xvfrsqrt_s, LASX, gen_xx_ptr, gen_helper_vfrsqrt_s)
 TRANS(xvfrsqrt_d, LASX, gen_xx_ptr, gen_helper_vfrsqrt_d)
+TRANS(xvfrecipe_s, FRECIPE_LASX, gen_xx_ptr, gen_helper_vfrecip_s)
+TRANS(xvfrecipe_d, FRECIPE_LASX, gen_xx_ptr, gen_helper_vfrecip_d)
+TRANS(xvfrsqrte_s, FRECIPE_LASX, gen_xx_ptr, gen_helper_vfrsqrt_s)
+TRANS(xvfrsqrte_d, FRECIPE_LASX, gen_xx_ptr, gen_helper_vfrsqrt_d)
 
 TRANS(vfcvtl_s_h, LSX, gen_vv_ptr, gen_helper_vfcvtl_s_h)
 TRANS(vfcvth_s_h, LSX, gen_vv_ptr, gen_helper_vfcvth_s_h)
diff --git a/target/loongarch/insns.decode b/target/loongarch/insns.decode
index cf4123cd46..92078f0f9f 100644
--- a/target/loongarch/insns.decode
+++ b/target/loongarch/insns.decode
@@ -371,6 +371,10 @@ frecip_s        0000 00010001 01000 10101 ..... .....    @ff
 frecip_d        0000 00010001 01000 10110 ..... .....    @ff
 frsqrt_s        0000 00010001 01000 11001 ..... .....    @ff
 frsqrt_d        0000 00010001 01000 11010 ..... .....    @ff
+frecipe_s       0000 00010001 01000 11101 ..... .....    @ff
+frecipe_d       0000 00010001 01000 11110 ..... .....    @ff
+frsqrte_s       0000 00010001 01001 00001 ..... .....    @ff
+frsqrte_d       0000 00010001 01001 00010 ..... .....    @ff
 fscaleb_s       0000 00010001 00001 ..... ..... .....    @fff
 fscaleb_d       0000 00010001 00010 ..... ..... .....    @fff
 flogb_s         0000 00010001 01000 01001 ..... .....    @ff
@@ -1115,6 +1119,10 @@ vfrecip_s        0111 00101001 11001 11101 ..... .....    @vv
 vfrecip_d        0111 00101001 11001 11110 ..... .....    @vv
 vfrsqrt_s        0111 00101001 11010 00001 ..... .....    @vv
 vfrsqrt_d        0111 00101001 11010 00010 ..... .....    @vv
+vfrecipe_s       0111 00101001 11010 00101 ..... .....    @vv
+vfrecipe_d       0111 00101001 11010 00110 ..... .....    @vv
+vfrsqrte_s       0111 00101001 11010 01001 ..... .....    @vv
+vfrsqrte_d       0111 00101001 11010 01010 ..... .....    @vv
 
 vfcvtl_s_h       0111 00101001 11011 11010 ..... .....    @vv
 vfcvth_s_h       0111 00101001 11011 11011 ..... .....    @vv
@@ -1879,6 +1887,10 @@ xvfrecip_s       0111 01101001 11001 11101 ..... .....    @vv
 xvfrecip_d       0111 01101001 11001 11110 ..... .....    @vv
 xvfrsqrt_s       0111 01101001 11010 00001 ..... .....    @vv
 xvfrsqrt_d       0111 01101001 11010 00010 ..... .....    @vv
+xvfrecipe_s      0111 01101001 11010 00101 ..... .....    @vv
+xvfrecipe_d      0111 01101001 11010 00110 ..... .....    @vv
+xvfrsqrte_s      0111 01101001 11010 01001 ..... .....    @vv
+xvfrsqrte_d      0111 01101001 11010 01010 ..... .....    @vv
 
 xvfcvtl_s_h      0111 01101001 11011 11010 ..... .....    @vv
 xvfcvth_s_h      0111 01101001 11011 11011 ..... .....    @vv
diff --git a/target/loongarch/translate.h b/target/loongarch/translate.h
index 3affefdafc..651c5796ca 100644
--- a/target/loongarch/translate.h
+++ b/target/loongarch/translate.h
@@ -28,6 +28,12 @@
 #define avail_LASX(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LASX))
 #define avail_IOCSR(C)  (FIELD_EX32((C)->cpucfg1, CPUCFG1, IOCSR))
 
+#define avail_FRECIPE(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, FRECIPE))
+#define avail_FRECIPE_FP_SP(C) (avail_FRECIPE(C) && avail_FP_SP(C))
+#define avail_FRECIPE_FP_DP(C) (avail_FRECIPE(C) && avail_FP_DP(C))
+#define avail_FRECIPE_LSX(C)   (avail_FRECIPE(C) && avail_LSX(C))
+#define avail_FRECIPE_LASX(C)   (avail_FRECIPE(C) && avail_LASX(C))
+
 /*
  * If an operation is being performed on less than TARGET_LONG_BITS,
  * it may require the inputs to be sign- or zero-extended; which will
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] target/loongarch: Add llacq/screl instructions
  2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
                   ` (3 preceding siblings ...)
  2023-10-23 15:29 ` [PATCH 4/5] target/loongarch: Add estimated reciprocal instructions Jiajie Chen
@ 2023-10-23 15:29 ` Jiajie Chen
  2023-10-23 23:19   ` Richard Henderson
  2023-10-23 23:26 ` [PATCH 0/5] Add LoongArch v1.1 instructions Richard Henderson
  5 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, gaosong, git, Jiajie Chen

Add the following instructions in LoongArch v1.1:

- llacq.w
- screl.w
- llacq.d
- screl.d

They are guarded by CPUCFG2.LLACQ_SCREL.

Signed-off-by: Jiajie Chen <c@jia.je>
---
 target/loongarch/cpu.h                        |  1 +
 target/loongarch/disas.c                      |  4 ++++
 .../loongarch/insn_trans/trans_atomic.c.inc   | 20 +++++++++++++++++++
 target/loongarch/insns.decode                 |  4 ++++
 target/loongarch/translate.h                  |  3 +++
 5 files changed, 32 insertions(+)

diff --git a/target/loongarch/cpu.h b/target/loongarch/cpu.h
index 8f938effa8..f0a63d5484 100644
--- a/target/loongarch/cpu.h
+++ b/target/loongarch/cpu.h
@@ -158,6 +158,7 @@ FIELD(CPUCFG2, LAM, 22, 1)
 FIELD(CPUCFG2, FRECIPE, 25, 1)
 FIELD(CPUCFG2, LAM_BH, 27, 1)
 FIELD(CPUCFG2, LAMCAS, 28, 1)
+FIELD(CPUCFG2, LLACQ_SCREL, 29, 1)
 
 /* cpucfg[3] bits */
 FIELD(CPUCFG3, CCDMA, 0, 1)
diff --git a/target/loongarch/disas.c b/target/loongarch/disas.c
index 9eb49fb5e3..8e02f51ddc 100644
--- a/target/loongarch/disas.c
+++ b/target/loongarch/disas.c
@@ -579,6 +579,10 @@ INSN(fldx_s,       frr)
 INSN(fldx_d,       frr)
 INSN(fstx_s,       frr)
 INSN(fstx_d,       frr)
+INSN(llacq_w,      rr)
+INSN(screl_w,      rr)
+INSN(llacq_d,      rr)
+INSN(screl_d,      rr)
 INSN(amcas_b,      rrr)
 INSN(amcas_h,      rrr)
 INSN(amcas_w,      rrr)
diff --git a/target/loongarch/insn_trans/trans_atomic.c.inc b/target/loongarch/insn_trans/trans_atomic.c.inc
index bea567fdaf..0c81fbd745 100644
--- a/target/loongarch/insn_trans/trans_atomic.c.inc
+++ b/target/loongarch/insn_trans/trans_atomic.c.inc
@@ -17,6 +17,14 @@ static bool gen_ll(DisasContext *ctx, arg_rr_i *a, MemOp mop)
     return true;
 }
 
+static bool gen_llacq(DisasContext *ctx, arg_rr *a, MemOp mop)
+{
+    arg_rr_i tmp_a = {
+        .rd = a->rd, .rj = a->rj, .imm = 0
+    };
+    return gen_ll(ctx, &tmp_a, mop);
+}
+
 static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
 {
     TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
@@ -45,6 +53,14 @@ static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
     return true;
 }
 
+static bool gen_screl(DisasContext *ctx, arg_rr *a, MemOp mop)
+{
+    arg_rr_i tmp_a = {
+        .rd = a->rd, .rj = a->rj, .imm = 0
+    };
+    return gen_sc(ctx, &tmp_a, mop);
+}
+
 static bool gen_cas(DisasContext *ctx, arg_rrr *a,
                     void (*func)(TCGv, TCGv, TCGv, TCGv, TCGArg, MemOp),
                     MemOp mop)
@@ -89,6 +105,10 @@ TRANS(ll_w, ALL, gen_ll, MO_TESL)
 TRANS(sc_w, ALL, gen_sc, MO_TESL)
 TRANS(ll_d, 64, gen_ll, MO_TEUQ)
 TRANS(sc_d, 64, gen_sc, MO_TEUQ)
+TRANS(llacq_w, LLACQ_SCREL, gen_llacq, MO_TESL)
+TRANS(screl_w, LLACQ_SCREL, gen_screl, MO_TESL)
+TRANS(llacq_d, LLACQ_SCREL_64, gen_llacq, MO_TEUQ)
+TRANS(screl_d, LLACQ_SCREL_64, gen_screl, MO_TEUQ)
 TRANS(amcas_b, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESB)
 TRANS(amcas_h, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESW)
 TRANS(amcas_w, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESL)
diff --git a/target/loongarch/insns.decode b/target/loongarch/insns.decode
index 92078f0f9f..e056d492d3 100644
--- a/target/loongarch/insns.decode
+++ b/target/loongarch/insns.decode
@@ -261,6 +261,10 @@ ll_w            0010 0000 .............. ..... .....     @rr_i14s2
 sc_w            0010 0001 .............. ..... .....     @rr_i14s2
 ll_d            0010 0010 .............. ..... .....     @rr_i14s2
 sc_d            0010 0011 .............. ..... .....     @rr_i14s2
+llacq_w         0011 10000101 01111 00000 ..... .....    @rr
+screl_w         0011 10000101 01111 00001 ..... .....    @rr
+llacq_d         0011 10000101 01111 00010 ..... .....    @rr
+screl_d         0011 10000101 01111 00011 ..... .....    @rr
 amcas_b         0011 10000101 10000 ..... ..... .....    @rrr
 amcas_h         0011 10000101 10001 ..... ..... .....    @rrr
 amcas_w         0011 10000101 10010 ..... ..... .....    @rrr
diff --git a/target/loongarch/translate.h b/target/loongarch/translate.h
index 651c5796ca..3d13d40ca6 100644
--- a/target/loongarch/translate.h
+++ b/target/loongarch/translate.h
@@ -34,6 +34,9 @@
 #define avail_FRECIPE_LSX(C)   (avail_FRECIPE(C) && avail_LSX(C))
 #define avail_FRECIPE_LASX(C)   (avail_FRECIPE(C) && avail_LASX(C))
 
+#define avail_LLACQ_SCREL(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LLACQ_SCREL))
+#define avail_LLACQ_SCREL_64(C) (avail_64(C) && avail_LLACQ_SCREL(C))
+
 /*
  * If an operation is being performed on less than TARGET_LONG_BITS,
  * it may require the inputs to be sign- or zero-extended; which will
-- 
2.42.0



^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d}
  2023-10-23 15:29 ` [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d} Jiajie Chen
@ 2023-10-23 15:35   ` Jiajie Chen
  2023-10-23 23:00     ` Richard Henderson
  2023-10-23 22:59   ` Richard Henderson
  1 sibling, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:35 UTC (permalink / raw)
  To: qemu-devel; +Cc: richard.henderson, gaosong, git


On 2023/10/23 23:29, Jiajie Chen wrote:
> The new instructions are introduced in LoongArch v1.1:
>
> - amcas.b
> - amcas.h
> - amcas.w
> - amcas.d
> - amcas_db.b
> - amcas_db.h
> - amcas_db.w
> - amcas_db.d
>
> The new instructions are gated by CPUCFG2.LAMCAS.
>
> Signed-off-by: Jiajie Chen <c@jia.je>
> ---
>   target/loongarch/cpu.h                        |  1 +
>   target/loongarch/disas.c                      |  8 +++++++
>   .../loongarch/insn_trans/trans_atomic.c.inc   | 24 +++++++++++++++++++
>   target/loongarch/insns.decode                 |  8 +++++++
>   target/loongarch/translate.h                  |  1 +
>   5 files changed, 42 insertions(+)
>
> diff --git a/target/loongarch/cpu.h b/target/loongarch/cpu.h
> index 7166c07756..80a476c3f8 100644
> --- a/target/loongarch/cpu.h
> +++ b/target/loongarch/cpu.h
> @@ -156,6 +156,7 @@ FIELD(CPUCFG2, LBT_MIPS, 20, 1)
>   FIELD(CPUCFG2, LSPW, 21, 1)
>   FIELD(CPUCFG2, LAM, 22, 1)
>   FIELD(CPUCFG2, LAM_BH, 27, 1)
> +FIELD(CPUCFG2, LAMCAS, 28, 1)
>   
>   /* cpucfg[3] bits */
>   FIELD(CPUCFG3, CCDMA, 0, 1)
> diff --git a/target/loongarch/disas.c b/target/loongarch/disas.c
> index d33aa8173a..4aa67749cf 100644
> --- a/target/loongarch/disas.c
> +++ b/target/loongarch/disas.c
> @@ -575,6 +575,14 @@ INSN(fldx_s,       frr)
>   INSN(fldx_d,       frr)
>   INSN(fstx_s,       frr)
>   INSN(fstx_d,       frr)
> +INSN(amcas_b,      rrr)
> +INSN(amcas_h,      rrr)
> +INSN(amcas_w,      rrr)
> +INSN(amcas_d,      rrr)
> +INSN(amcas_db_b,   rrr)
> +INSN(amcas_db_h,   rrr)
> +INSN(amcas_db_w,   rrr)
> +INSN(amcas_db_d,   rrr)
>   INSN(amswap_b,     rrr)
>   INSN(amswap_h,     rrr)
>   INSN(amadd_b,      rrr)
> diff --git a/target/loongarch/insn_trans/trans_atomic.c.inc b/target/loongarch/insn_trans/trans_atomic.c.inc
> index cd28e217ad..bea567fdaf 100644
> --- a/target/loongarch/insn_trans/trans_atomic.c.inc
> +++ b/target/loongarch/insn_trans/trans_atomic.c.inc
> @@ -45,6 +45,22 @@ static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
>       return true;
>   }
>   
> +static bool gen_cas(DisasContext *ctx, arg_rrr *a,
> +                    void (*func)(TCGv, TCGv, TCGv, TCGv, TCGArg, MemOp),
> +                    MemOp mop)
> +{
> +    TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
> +    TCGv addr = gpr_src(ctx, a->rj, EXT_NONE);
> +    TCGv val = gpr_src(ctx, a->rk, EXT_NONE);
> +
> +    addr = make_address_i(ctx, addr, 0);
> +

I'm unsure if I can use the same TCGv for the first and the third 
argument here. If it violates with the assumption, a temporary register 
can be used.

> +    func(dest, addr, dest, val, ctx->mem_idx, mop);
> +    gen_set_gpr(a->rd, dest, EXT_NONE);
> +
> +    return true;
> +}
> +
>   static bool gen_am(DisasContext *ctx, arg_rrr *a,
>                      void (*func)(TCGv, TCGv, TCGv, TCGArg, MemOp),
>                      MemOp mop)
> @@ -73,6 +89,14 @@ TRANS(ll_w, ALL, gen_ll, MO_TESL)
>   TRANS(sc_w, ALL, gen_sc, MO_TESL)
>   TRANS(ll_d, 64, gen_ll, MO_TEUQ)
>   TRANS(sc_d, 64, gen_sc, MO_TEUQ)
> +TRANS(amcas_b, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESB)
> +TRANS(amcas_h, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESW)
> +TRANS(amcas_w, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESL)
> +TRANS(amcas_d, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TEUQ)
> +TRANS(amcas_db_b, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESB)
> +TRANS(amcas_db_h, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESW)
> +TRANS(amcas_db_w, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TESL)
> +TRANS(amcas_db_d, LAMCAS, gen_cas, tcg_gen_atomic_cmpxchg_tl, MO_TEUQ)
>   TRANS(amswap_b, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESB)
>   TRANS(amswap_h, LAM_BH, gen_am, tcg_gen_atomic_xchg_tl, MO_TESW)
>   TRANS(amadd_b, LAM_BH, gen_am, tcg_gen_atomic_fetch_add_tl, MO_TESB)
> diff --git a/target/loongarch/insns.decode b/target/loongarch/insns.decode
> index 678ce42038..cf4123cd46 100644
> --- a/target/loongarch/insns.decode
> +++ b/target/loongarch/insns.decode
> @@ -261,6 +261,14 @@ ll_w            0010 0000 .............. ..... .....     @rr_i14s2
>   sc_w            0010 0001 .............. ..... .....     @rr_i14s2
>   ll_d            0010 0010 .............. ..... .....     @rr_i14s2
>   sc_d            0010 0011 .............. ..... .....     @rr_i14s2
> +amcas_b         0011 10000101 10000 ..... ..... .....    @rrr
> +amcas_h         0011 10000101 10001 ..... ..... .....    @rrr
> +amcas_w         0011 10000101 10010 ..... ..... .....    @rrr
> +amcas_d         0011 10000101 10011 ..... ..... .....    @rrr
> +amcas_db_b      0011 10000101 10100 ..... ..... .....    @rrr
> +amcas_db_h      0011 10000101 10101 ..... ..... .....    @rrr
> +amcas_db_w      0011 10000101 10110 ..... ..... .....    @rrr
> +amcas_db_d      0011 10000101 10111 ..... ..... .....    @rrr
>   amswap_b        0011 10000101 11000 ..... ..... .....    @rrr
>   amswap_h        0011 10000101 11001 ..... ..... .....    @rrr
>   amadd_b         0011 10000101 11010 ..... ..... .....    @rrr
> diff --git a/target/loongarch/translate.h b/target/loongarch/translate.h
> index 0b230530e7..3affefdafc 100644
> --- a/target/loongarch/translate.h
> +++ b/target/loongarch/translate.h
> @@ -23,6 +23,7 @@
>   #define avail_LSPW(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSPW))
>   #define avail_LAM(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM))
>   #define avail_LAM_BH(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAM_BH))
> +#define avail_LAMCAS(C) (FIELD_EX32((C)->cpucfg2, CPUCFG2, LAMCAS))
>   #define avail_LSX(C)    (FIELD_EX32((C)->cpucfg2, CPUCFG2, LSX))
>   #define avail_LASX(C)   (FIELD_EX32((C)->cpucfg2, CPUCFG2, LASX))
>   #define avail_IOCSR(C)  (FIELD_EX32((C)->cpucfg1, CPUCFG1, IOCSR))


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] include/exec/memop.h: Add MO_TESB
  2023-10-23 15:29 ` [PATCH 1/5] include/exec/memop.h: Add MO_TESB Jiajie Chen
@ 2023-10-23 15:49   ` David Hildenbrand
  2023-10-23 15:52     ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2023-10-23 15:49 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel
  Cc: richard.henderson, gaosong, git, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé


Why?

On 23.10.23 17:29, Jiajie Chen wrote:
> Signed-off-by: Jiajie Chen <c@jia.je>
> ---
>   include/exec/memop.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/include/exec/memop.h b/include/exec/memop.h
> index a86dc6743a..834327c62d 100644
> --- a/include/exec/memop.h
> +++ b/include/exec/memop.h
> @@ -140,6 +140,7 @@ typedef enum MemOp {
>       MO_TEUL  = MO_TE | MO_UL,
>       MO_TEUQ  = MO_TE | MO_UQ,
>       MO_TEUO  = MO_TE | MO_UO,
> +    MO_TESB  = MO_TE | MO_SB,
>       MO_TESW  = MO_TE | MO_SW,
>       MO_TESL  = MO_TE | MO_SL,
>       MO_TESQ  = MO_TE | MO_SQ,



I recall that the reason for not having this is that the target 
endianess doesn't matter for single bytes.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] include/exec/memop.h: Add MO_TESB
  2023-10-23 15:49   ` David Hildenbrand
@ 2023-10-23 15:52     ` Jiajie Chen
  0 siblings, 0 replies; 28+ messages in thread
From: Jiajie Chen @ 2023-10-23 15:52 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel
  Cc: richard.henderson, gaosong, git, Paolo Bonzini, Peter Xu,
	Philippe Mathieu-Daudé


On 2023/10/23 23:49, David Hildenbrand wrote:
>
> Why?
>
> On 23.10.23 17:29, Jiajie Chen wrote:
>> Signed-off-by: Jiajie Chen <c@jia.je>
>> ---
>>   include/exec/memop.h | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/include/exec/memop.h b/include/exec/memop.h
>> index a86dc6743a..834327c62d 100644
>> --- a/include/exec/memop.h
>> +++ b/include/exec/memop.h
>> @@ -140,6 +140,7 @@ typedef enum MemOp {
>>       MO_TEUL  = MO_TE | MO_UL,
>>       MO_TEUQ  = MO_TE | MO_UQ,
>>       MO_TEUO  = MO_TE | MO_UO,
>> +    MO_TESB  = MO_TE | MO_SB,
>>       MO_TESW  = MO_TE | MO_SW,
>>       MO_TESL  = MO_TE | MO_SL,
>>       MO_TESQ  = MO_TE | MO_SQ,
>
>
>
> I recall that the reason for not having this is that the target 
> endianess doesn't matter for single bytes.

Thanks, you are right, I was copying some code using MO_TESW only to 
find that MO_TESB is missing... I should simply use MO_SB then.




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] target/loongarch: Add am{swap/add}[_db].{b/h}
  2023-10-23 15:29 ` [PATCH 2/5] target/loongarch: Add am{swap/add}[_db].{b/h} Jiajie Chen
@ 2023-10-23 22:50   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2023-10-23 22:50 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/23/23 08:29, Jiajie Chen wrote:
> The new instructions are introduced in LoongArch v1.1:
> 
> - amswap.b
> - amswap.h
> - amadd.b
> - amadd.h
> - amswap_db.b
> - amswap_db.h
> - amadd_db.b
> - amadd_db.h
> 
> The instructions are gated by CPUCFG2.LAM_BH.
> 
> Signed-off-by: Jiajie Chen <c@jia.je>

Except for the use of MO_TESB,

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d}
  2023-10-23 15:29 ` [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d} Jiajie Chen
  2023-10-23 15:35   ` Jiajie Chen
@ 2023-10-23 22:59   ` Richard Henderson
  1 sibling, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2023-10-23 22:59 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/23/23 08:29, Jiajie Chen wrote:
> +static bool gen_cas(DisasContext *ctx, arg_rrr *a,
> +                    void (*func)(TCGv, TCGv, TCGv, TCGv, TCGArg, MemOp),
> +                    MemOp mop)
> +{
> +    TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
> +    TCGv addr = gpr_src(ctx, a->rj, EXT_NONE);
> +    TCGv val = gpr_src(ctx, a->rk, EXT_NONE);
> +
> +    addr = make_address_i(ctx, addr, 0);
> +
> +    func(dest, addr, dest, val, ctx->mem_idx, mop);

You need

     TCGv old = gpr_src(ctx, a->rd, EXT_NONE);
     func(dest, addr, old, val, ...);

as otherwise rd=0 will abort.

Correct emulation requires that you perform the memory operation, and then discard the 
result.  But you must provide the (initialized) source of zero for that case.

Do any or all of the AM, LL, SC instructions require aligned memory?
I suspect that they do.

I think probably gen_ll, gen_sc, gen_am, and now gen_cas are missing "mop | MO_ALIGN" 
applied to the memory operation(s).


r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d}
  2023-10-23 15:35   ` Jiajie Chen
@ 2023-10-23 23:00     ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2023-10-23 23:00 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/23/23 08:35, Jiajie Chen wrote:
>> +static bool gen_cas(DisasContext *ctx, arg_rrr *a,
>> +                    void (*func)(TCGv, TCGv, TCGv, TCGv, TCGArg, MemOp),
>> +                    MemOp mop)
>> +{
>> +    TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
>> +    TCGv addr = gpr_src(ctx, a->rj, EXT_NONE);
>> +    TCGv val = gpr_src(ctx, a->rk, EXT_NONE);
>> +
>> +    addr = make_address_i(ctx, addr, 0);
>> +
> 
> I'm unsure if I can use the same TCGv for the first and the third argument here. If it 
> violates with the assumption, a temporary register can be used.
> 
>> +    func(dest, addr, dest, val, ctx->mem_idx, mop);

Correct, you cannot use dest in both places.
I just replied to the patch itself with that.  :-)


r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/5] target/loongarch: Add estimated reciprocal instructions
  2023-10-23 15:29 ` [PATCH 4/5] target/loongarch: Add estimated reciprocal instructions Jiajie Chen
@ 2023-10-23 23:02   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2023-10-23 23:02 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/23/23 08:29, Jiajie Chen wrote:
> Add the following new instructions in LoongArch v1.1:
> 
> - frecipe.s
> - frecipe.d
> - frsqrte.s
> - frsqrte.d
> - vfrecipe.s
> - vfrecipe.d
> - vfrsqrte.s
> - vfrsqrte.d
> - xvfrecipe.s
> - xvfrecipe.d
> - xvfrsqrte.s
> - xvfrsqrte.d
> 
> They are guarded by CPUCFG2.FRECIPE. Altought the instructions allow
> implementation to improve performance by reducing precision, we use the
> existing softfloat implementation.
> 
> Signed-off-by: Jiajie Chen <c@jia.je>
> ---
>   target/loongarch/cpu.h                         |  1 +
>   target/loongarch/disas.c                       | 12 ++++++++++++
>   target/loongarch/insn_trans/trans_farith.c.inc |  4 ++++
>   target/loongarch/insn_trans/trans_vec.c.inc    |  8 ++++++++
>   target/loongarch/insns.decode                  | 12 ++++++++++++
>   target/loongarch/translate.h                   |  6 ++++++
>   6 files changed, 43 insertions(+)

Acked-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/5] target/loongarch: Add llacq/screl instructions
  2023-10-23 15:29 ` [PATCH 5/5] target/loongarch: Add llacq/screl instructions Jiajie Chen
@ 2023-10-23 23:19   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2023-10-23 23:19 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/23/23 08:29, Jiajie Chen wrote:
> --- a/target/loongarch/insn_trans/trans_atomic.c.inc
> +++ b/target/loongarch/insn_trans/trans_atomic.c.inc
> @@ -17,6 +17,14 @@ static bool gen_ll(DisasContext *ctx, arg_rr_i *a, MemOp mop)
>       return true;
>   }
>   
> +static bool gen_llacq(DisasContext *ctx, arg_rr *a, MemOp mop)
> +{
> +    arg_rr_i tmp_a = {
> +        .rd = a->rd, .rj = a->rj, .imm = 0
> +    };
> +    return gen_ll(ctx, &tmp_a, mop);
> +}
> +
>   static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
>   {
>       TCGv dest = gpr_dst(ctx, a->rd, EXT_NONE);
> @@ -45,6 +53,14 @@ static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
>       return true;
>   }
>   
> +static bool gen_screl(DisasContext *ctx, arg_rr *a, MemOp mop)
> +{
> +    arg_rr_i tmp_a = {
> +        .rd = a->rd, .rj = a->rj, .imm = 0
> +    };
> +    return gen_sc(ctx, &tmp_a, mop);
> +}

This is incorrect.  You need to add the required memory barriers.

Should be like

- static bool gen_ll(DisasContext *ctx, arg_rr_i *a, MemOp mop)
+ static bool gen_ll(DisasContext *ctx, arg_rr_i *a, MemOp mop, bool acq)
   {
       ...
+     if (acq) {
+         tcg_gen_mb(TCG_MO_ALL | TCG_BAR_LDAQ);
+     }
       return true;
   }

- static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop)
+ static bool gen_sc(DisasContext *ctx, arg_rr_i *a, MemOp mop, bool rel)
   {
       ...
+     if (rel) {
+         tcg_gen_mb(TCG_MO_ALL | TCG_BAR_STRL);
+     }
       tcg_gen_brcond_tl(TCG_COND_EQ, t0, cpu_lladdr, l1);
       ...
   }

TRANS(ll_w, ALL, gen_ll, MO_TESL, false)
TRANS(sc_w, ALL, gen_sc, MO_TESL, false)
TRANS(ll_d, 64, gen_ll, MO_TEUQ, false)
TRANS(sc_d, 64, gen_sc, MO_TEUQ, false)
TRANS(llacq_w, LLACQ_SCREL, gen_ll, MO_TESL, true)
TRANS(screl_w, LLACQ_SCREL, gen_sc, MO_TESL, true)
TRANS(llacq_d, LLACQ_SCREL_64, gen_ll, MO_TEUQ, true)
TRANS(screl_d, LLACQ_SCREL_64, gen_sc, MO_TEUQ, true)


You should decode into a common argument format, rather than doing it by hand.

@rr_i0          .... ........ ..... ..... rj:5  rd:5     &rr_i imm=0

llacq_w         0011 10000101 01111 00000 ..... .....    @rr_i0
screl_w         0011 10000101 01111 00001 ..... .....    @rr_i0
llacq_d         0011 10000101 01111 00010 ..... .....    @rr_i0
screl_d         0011 10000101 01111 00011 ..... .....    @rr_i0



r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
                   ` (4 preceding siblings ...)
  2023-10-23 15:29 ` [PATCH 5/5] target/loongarch: Add llacq/screl instructions Jiajie Chen
@ 2023-10-23 23:26 ` Richard Henderson
  2023-10-24  6:10   ` Jiajie Chen
  5 siblings, 1 reply; 28+ messages in thread
From: Richard Henderson @ 2023-10-23 23:26 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/23/23 08:29, Jiajie Chen wrote:
> This patch series implements the new instructions except sc.q, because I do not know how 
> to match a pair of ll.d to sc.q.

There are a couple of examples within the tree.

See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 block.
See target/ppc/translate.c, gen_stqcx_.


r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-23 23:26 ` [PATCH 0/5] Add LoongArch v1.1 instructions Richard Henderson
@ 2023-10-24  6:10   ` Jiajie Chen
  2023-10-25 17:13     ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-24  6:10 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: gaosong, git


On 2023/10/24 07:26, Richard Henderson wrote:
> On 10/23/23 08:29, Jiajie Chen wrote:
>> This patch series implements the new instructions except sc.q, 
>> because I do not know how to match a pair of ll.d to sc.q.
>
> There are a couple of examples within the tree.
>
> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 block.
> See target/ppc/translate.c, gen_stqcx_.


The situation here is slightly different: aarch64 and ppc64 have both 
128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll and 128-bit 
sc. I guest the intended usage of sc.q is:


ll.d lo, base, 0

ll.d hi, base, 4

# do some computation

sc.q lo, hi, base

# try again if sc failed



>
>
> r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-24  6:10   ` Jiajie Chen
@ 2023-10-25 17:13     ` Jiajie Chen
  2023-10-25 19:04       ` Richard Henderson
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-25 17:13 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: gaosong, git


On 2023/10/24 14:10, Jiajie Chen wrote:
>
> On 2023/10/24 07:26, Richard Henderson wrote:
>> On 10/23/23 08:29, Jiajie Chen wrote:
>>> This patch series implements the new instructions except sc.q, 
>>> because I do not know how to match a pair of ll.d to sc.q.
>>
>> There are a couple of examples within the tree.
>>
>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 
>> block.
>> See target/ppc/translate.c, gen_stqcx_.
>
>
> The situation here is slightly different: aarch64 and ppc64 have both 
> 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll and 
> 128-bit sc. I guest the intended usage of sc.q is:
>
>
> ll.d lo, base, 0
>
> ll.d hi, base, 4
>
> # do some computation
>
> sc.q lo, hi, base
>
> # try again if sc failed


Possibly use the combination of ll.d and ld.d:


ll.d lo, base, 0

ld.d hi, base, 4

# do some computation

sc.q lo, hi, base

# try again if sc failed


Then a possible implementation of gen_ll() would be: align base to 
128-bit boundary, read 128-bit from memory, save 64-bit part to rd and 
record whole 128-bit data in llval. Then, in gen_sc_q(), it uses a 
128-bit cmpxchg.


But what about the reversed instruction pattern: ll.d hi, base, 4; ld.d 
lo, base 0?


Since there are no existing code utilizing the new sc.q instruction, I 
don't know what should we consider here.


>
>
>
>>
>>
>> r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-25 17:13     ` Jiajie Chen
@ 2023-10-25 19:04       ` Richard Henderson
  2023-10-26  1:38         ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Henderson @ 2023-10-25 19:04 UTC (permalink / raw)
  To: Jiajie Chen, qemu-devel; +Cc: gaosong, git

On 10/25/23 10:13, Jiajie Chen wrote:
>> On 2023/10/24 07:26, Richard Henderson wrote:
>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 block.
>>> See target/ppc/translate.c, gen_stqcx_.
>>
>> The situation here is slightly different: aarch64 and ppc64 have both 128-bit ll and sc, 
>> however LoongArch v1.1 only has 64-bit ll and 128-bit sc.

Ah, that does complicate things.

> Possibly use the combination of ll.d and ld.d:
> 
> 
> ll.d lo, base, 0
> ld.d hi, base, 4
> 
> # do some computation
> 
> sc.q lo, hi, base
> 
> # try again if sc failed
> 
> Then a possible implementation of gen_ll() would be: align base to 128-bit boundary, read 
> 128-bit from memory, save 64-bit part to rd and record whole 128-bit data in llval. Then, 
> in gen_sc_q(), it uses a 128-bit cmpxchg.
> 
> 
> But what about the reversed instruction pattern: ll.d hi, base, 4; ld.d lo, base 0?

It would be worth asking your hardware engineers about the bounds of legal behaviour. 
Ideally there would be some very explicit language, similar to

https://developer.arm.com/documentation/ddi0487/latest/
B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions

But you could do the same thing, aligning and recording the entire 128-bit quantity, then 
extract the ll.d result based on address bit 6.  This would complicate the implementation 
of sc.d as well, but would perhaps bring us "close enough" to the actual architecture.

Note that our Arm store-exclusive implementation isn't quite in spec either.  There is 
quite a large comment within translate-a64.c store_exclusive() about the ways things are 
not quite right.  But it seems to be close enough for actual usage to succeed.

r~

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-25 19:04       ` Richard Henderson
@ 2023-10-26  1:38         ` Jiajie Chen
  2023-10-26  6:54           ` gaosong
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-26  1:38 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: gaosong, git


On 2023/10/26 03:04, Richard Henderson wrote:
> On 10/25/23 10:13, Jiajie Chen wrote:
>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 
>>>> block.
>>>> See target/ppc/translate.c, gen_stqcx_.
>>>
>>> The situation here is slightly different: aarch64 and ppc64 have 
>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>> and 128-bit sc.
>
> Ah, that does complicate things.
>
>> Possibly use the combination of ll.d and ld.d:
>>
>>
>> ll.d lo, base, 0
>> ld.d hi, base, 4
>>
>> # do some computation
>>
>> sc.q lo, hi, base
>>
>> # try again if sc failed
>>
>> Then a possible implementation of gen_ll() would be: align base to 
>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it uses 
>> a 128-bit cmpxchg.
>>
>>
>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>> ld.d lo, base 0?
>
> It would be worth asking your hardware engineers about the bounds of 
> legal behaviour. Ideally there would be some very explicit language, 
> similar to


I'm a community developer not affiliated with Loongson. Song Gao, could 
you provide some detail from Loongson Inc.?


>
> https://developer.arm.com/documentation/ddi0487/latest/
> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions
>
> But you could do the same thing, aligning and recording the entire 
> 128-bit quantity, then extract the ll.d result based on address bit 
> 6.  This would complicate the implementation of sc.d as well, but 
> would perhaps bring us "close enough" to the actual architecture.
>
> Note that our Arm store-exclusive implementation isn't quite in spec 
> either.  There is quite a large comment within translate-a64.c 
> store_exclusive() about the ways things are not quite right.  But it 
> seems to be close enough for actual usage to succeed.
>
>
> r~


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-26  1:38         ` Jiajie Chen
@ 2023-10-26  6:54           ` gaosong
  2023-10-28 13:09             ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: gaosong @ 2023-10-26  6:54 UTC (permalink / raw)
  To: Jiajie Chen, Richard Henderson, qemu-devel; +Cc: git, bibo mao

在 2023/10/26 上午9:38, Jiajie Chen 写道:
>
> On 2023/10/26 03:04, Richard Henderson wrote:
>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, TCGv_i128 
>>>>> block.
>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>
>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>>> and 128-bit sc.
>>
>> Ah, that does complicate things.
>>
>>> Possibly use the combination of ll.d and ld.d:
>>>
>>>
>>> ll.d lo, base, 0
>>> ld.d hi, base, 4
>>>
>>> # do some computation
>>>
>>> sc.q lo, hi, base
>>>
>>> # try again if sc failed
>>>
>>> Then a possible implementation of gen_ll() would be: align base to 
>>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it uses 
>>> a 128-bit cmpxchg.
>>>
>>>
>>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>>> ld.d lo, base 0?
>>
>> It would be worth asking your hardware engineers about the bounds of 
>> legal behaviour. Ideally there would be some very explicit language, 
>> similar to
>
>
> I'm a community developer not affiliated with Loongson. Song Gao, 
> could you provide some detail from Loongson Inc.?
>
>

ll.d   r1, base, 0
dbar 0x700          ==> see 2.2.8.1
ld.d  r2, base,  8
...
sc.q r1, r2, base


For this series,
I think we need set the new config bits to the 'max cpu', and change 
linux-user/target_elf.h ''any' to 'max', so that we can use these new 
instructions on linux-user mode.

Thanks
Song Gao
>>
>> https://developer.arm.com/documentation/ddi0487/latest/
>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions
>>
>> But you could do the same thing, aligning and recording the entire 
>> 128-bit quantity, then extract the ll.d result based on address bit 
>> 6.  This would complicate the implementation of sc.d as well, but 
>> would perhaps bring us "close enough" to the actual architecture.
>>
>> Note that our Arm store-exclusive implementation isn't quite in spec 
>> either.  There is quite a large comment within translate-a64.c 
>> store_exclusive() about the ways things are not quite right.  But it 
>> seems to be close enough for actual usage to succeed.
>>
>>
>> r~



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-26  6:54           ` gaosong
@ 2023-10-28 13:09             ` Jiajie Chen
  2023-10-30  8:23               ` gaosong
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-28 13:09 UTC (permalink / raw)
  To: gaosong, Richard Henderson, qemu-devel; +Cc: git, bibo mao


On 2023/10/26 14:54, gaosong wrote:
> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>
>> On 2023/10/26 03:04, Richard Henderson wrote:
>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>> TCGv_i128 block.
>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>
>>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>>>> and 128-bit sc.
>>>
>>> Ah, that does complicate things.
>>>
>>>> Possibly use the combination of ll.d and ld.d:
>>>>
>>>>
>>>> ll.d lo, base, 0
>>>> ld.d hi, base, 4
>>>>
>>>> # do some computation
>>>>
>>>> sc.q lo, hi, base
>>>>
>>>> # try again if sc failed
>>>>
>>>> Then a possible implementation of gen_ll() would be: align base to 
>>>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>>>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it 
>>>> uses a 128-bit cmpxchg.
>>>>
>>>>
>>>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>>>> ld.d lo, base 0?
>>>
>>> It would be worth asking your hardware engineers about the bounds of 
>>> legal behaviour. Ideally there would be some very explicit language, 
>>> similar to
>>
>>
>> I'm a community developer not affiliated with Loongson. Song Gao, 
>> could you provide some detail from Loongson Inc.?
>>
>>
>
> ll.d   r1, base, 0
> dbar 0x700          ==> see 2.2.8.1
> ld.d  r2, base,  8
> ...
> sc.q r1, r2, base


Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence and 
translate the sequence into one tcg_gen_qemu_ld_i128 and split the 
result into two 64-bit parts. Can do this in QEMU?


>
>
> For this series,
> I think we need set the new config bits to the 'max cpu', and change 
> linux-user/target_elf.h ''any' to 'max', so that we can use these new 
> instructions on linux-user mode.

I will work on it.


>
> Thanks
> Song Gao
>>>
>>> https://developer.arm.com/documentation/ddi0487/latest/
>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>> restrictions
>>>
>>> But you could do the same thing, aligning and recording the entire 
>>> 128-bit quantity, then extract the ll.d result based on address bit 
>>> 6.  This would complicate the implementation of sc.d as well, but 
>>> would perhaps bring us "close enough" to the actual architecture.
>>>
>>> Note that our Arm store-exclusive implementation isn't quite in spec 
>>> either.  There is quite a large comment within translate-a64.c 
>>> store_exclusive() about the ways things are not quite right.  But it 
>>> seems to be close enough for actual usage to succeed.
>>>
>>>
>>> r~
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-28 13:09             ` Jiajie Chen
@ 2023-10-30  8:23               ` gaosong
  2023-10-30 11:54                 ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: gaosong @ 2023-10-30  8:23 UTC (permalink / raw)
  To: Jiajie Chen, Richard Henderson, qemu-devel; +Cc: git, bibo mao

在 2023/10/28 下午9:09, Jiajie Chen 写道:
>
> On 2023/10/26 14:54, gaosong wrote:
>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>
>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>> TCGv_i128 block.
>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>
>>>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit ll 
>>>>>> and 128-bit sc.
>>>>
>>>> Ah, that does complicate things.
>>>>
>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>
>>>>>
>>>>> ll.d lo, base, 0
>>>>> ld.d hi, base, 4
>>>>>
>>>>> # do some computation
>>>>>
>>>>> sc.q lo, hi, base
>>>>>
>>>>> # try again if sc failed
>>>>>
>>>>> Then a possible implementation of gen_ll() would be: align base to 
>>>>> 128-bit boundary, read 128-bit from memory, save 64-bit part to rd 
>>>>> and record whole 128-bit data in llval. Then, in gen_sc_q(), it 
>>>>> uses a 128-bit cmpxchg.
>>>>>
>>>>>
>>>>> But what about the reversed instruction pattern: ll.d hi, base, 4; 
>>>>> ld.d lo, base 0?
>>>>
>>>> It would be worth asking your hardware engineers about the bounds 
>>>> of legal behaviour. Ideally there would be some very explicit 
>>>> language, similar to
>>>
>>>
>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>> could you provide some detail from Loongson Inc.?
>>>
>>>
>>
>> ll.d   r1, base, 0
>> dbar 0x700          ==> see 2.2.8.1
>> ld.d  r2, base,  8
>> ...
>> sc.q r1, r2, base
>
>
> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence and 
> translate the sequence into one tcg_gen_qemu_ld_i128 and split the 
> result into two 64-bit parts. Can do this in QEMU?
>
>
Oh, I'm not sure.

I think we just need to implement sc.q. We don't need to care about 
'll.d-dbar-ld.d'. It's just like 'll.q'.
It needs the user to ensure that .

ll.q' is
1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
2) dbar 0x700　
3) ld.d r2 base, 8 ==> load the high 64 bits to r2

sc.q needs to
1) Use 64-bit cmpxchg.
2) Write 128 bits to memory.

Thanks.
Song Gao
>>
>>
>> For this series,
>> I think we need set the new config bits to the 'max cpu', and change 
>> linux-user/target_elf.h ''any' to 'max', so that we can use these new 
>> instructions on linux-user mode.
>
> I will work on it.
>
>
>>
>> Thanks
>> Song Gao
>>>>
>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>> restrictions
>>>>
>>>> But you could do the same thing, aligning and recording the entire 
>>>> 128-bit quantity, then extract the ll.d result based on address bit 
>>>> 6.  This would complicate the implementation of sc.d as well, but 
>>>> would perhaps bring us "close enough" to the actual architecture.
>>>>
>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>> spec either.  There is quite a large comment within translate-a64.c 
>>>> store_exclusive() about the ways things are not quite right.  But 
>>>> it seems to be close enough for actual usage to succeed.
>>>>
>>>>
>>>> r~
>>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-30  8:23               ` gaosong
@ 2023-10-30 11:54                 ` Jiajie Chen
  2023-10-31  9:11                   ` gaosong
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-30 11:54 UTC (permalink / raw)
  To: gaosong, Richard Henderson, qemu-devel; +Cc: git, bibo mao


On 2023/10/30 16:23, gaosong wrote:
> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>
>> On 2023/10/26 14:54, gaosong wrote:
>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>
>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>> TCGv_i128 block.
>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>
>>>>>>> The situation here is slightly different: aarch64 and ppc64 have 
>>>>>>> both 128-bit ll and sc, however LoongArch v1.1 only has 64-bit 
>>>>>>> ll and 128-bit sc.
>>>>>
>>>>> Ah, that does complicate things.
>>>>>
>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>
>>>>>>
>>>>>> ll.d lo, base, 0
>>>>>> ld.d hi, base, 4
>>>>>>
>>>>>> # do some computation
>>>>>>
>>>>>> sc.q lo, hi, base
>>>>>>
>>>>>> # try again if sc failed
>>>>>>
>>>>>> Then a possible implementation of gen_ll() would be: align base 
>>>>>> to 128-bit boundary, read 128-bit from memory, save 64-bit part 
>>>>>> to rd and record whole 128-bit data in llval. Then, in 
>>>>>> gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>
>>>>>>
>>>>>> But what about the reversed instruction pattern: ll.d hi, base, 
>>>>>> 4; ld.d lo, base 0?
>>>>>
>>>>> It would be worth asking your hardware engineers about the bounds 
>>>>> of legal behaviour. Ideally there would be some very explicit 
>>>>> language, similar to
>>>>
>>>>
>>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>>> could you provide some detail from Loongson Inc.?
>>>>
>>>>
>>>
>>> ll.d   r1, base, 0
>>> dbar 0x700          ==> see 2.2.8.1
>>> ld.d  r2, base,  8
>>> ...
>>> sc.q r1, r2, base
>>
>>
>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence and 
>> translate the sequence into one tcg_gen_qemu_ld_i128 and split the 
>> result into two 64-bit parts. Can do this in QEMU?
>>
>>
> Oh, I'm not sure.
>
> I think we just need to implement sc.q. We don't need to care about 
> 'll.d-dbar-ld.d'. It's just like 'll.q'.
> It needs the user to ensure that .
>
> ll.q' is
> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
> 2) dbar 0x700　
> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>
> sc.q needs to
> 1) Use 64-bit cmpxchg.
> 2) Write 128 bits to memory.

Consider the following code:


ll.d r1, base, 0

dbar 0x700

ld.d r2, base, 8

addi.d r2, r2, 1

sc.q r1, r2, base


We translate them into native code:


ld.d r1, base, 0

mv LLbit, 1

mv LLaddr, base

mv LLval, r1

dbar 0x700

ld.d r2, base, 8

addi.d r2, r2, 1

if (LLbit == 1 && LLaddr == base) {

     cmpxchg addr=base compare=LLval new=r1

     128-bit write {r2, r1} to base if cmpxchg succeeded

}

set r1 if sc.q succeeded



If the memory content of base+8 has changed between ld.d r2 and addi.d 
r2, the atomicity is not guaranteed, i.e. only the high part has 
changed, the low part hasn't.



>
> Thanks.
> Song Gao
>>>
>>>
>>> For this series,
>>> I think we need set the new config bits to the 'max cpu', and change 
>>> linux-user/target_elf.h ''any' to 'max', so that we can use these 
>>> new instructions on linux-user mode.
>>
>> I will work on it.
>>
>>
>>>
>>> Thanks
>>> Song Gao
>>>>>
>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>> restrictions
>>>>>
>>>>> But you could do the same thing, aligning and recording the entire 
>>>>> 128-bit quantity, then extract the ll.d result based on address 
>>>>> bit 6.  This would complicate the implementation of sc.d as well, 
>>>>> but would perhaps bring us "close enough" to the actual architecture.
>>>>>
>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>> spec either.  There is quite a large comment within 
>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>> quite right.  But it seems to be close enough for actual usage to 
>>>>> succeed.
>>>>>
>>>>>
>>>>> r~
>>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-30 11:54                 ` Jiajie Chen
@ 2023-10-31  9:11                   ` gaosong
  2023-10-31  9:13                     ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: gaosong @ 2023-10-31  9:11 UTC (permalink / raw)
  To: Jiajie Chen, Richard Henderson, qemu-devel; +Cc: git, bibo mao

在 2023/10/30 下午7:54, Jiajie Chen 写道:
>
> On 2023/10/30 16:23, gaosong wrote:
>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>
>>> On 2023/10/26 14:54, gaosong wrote:
>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>
>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>> TCGv_i128 block.
>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>
>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>
>>>>>> Ah, that does complicate things.
>>>>>>
>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>
>>>>>>>
>>>>>>> ll.d lo, base, 0
>>>>>>> ld.d hi, base, 4
>>>>>>>
>>>>>>> # do some computation
>>>>>>>
>>>>>>> sc.q lo, hi, base
>>>>>>>
>>>>>>> # try again if sc failed
>>>>>>>
>>>>>>> Then a possible implementation of gen_ll() would be: align base 
>>>>>>> to 128-bit boundary, read 128-bit from memory, save 64-bit part 
>>>>>>> to rd and record whole 128-bit data in llval. Then, in 
>>>>>>> gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>
>>>>>>>
>>>>>>> But what about the reversed instruction pattern: ll.d hi, base, 
>>>>>>> 4; ld.d lo, base 0?
>>>>>>
>>>>>> It would be worth asking your hardware engineers about the bounds 
>>>>>> of legal behaviour. Ideally there would be some very explicit 
>>>>>> language, similar to
>>>>>
>>>>>
>>>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>>>> could you provide some detail from Loongson Inc.?
>>>>>
>>>>>
>>>>
>>>> ll.d   r1, base, 0
>>>> dbar 0x700          ==> see 2.2.8.1
>>>> ld.d  r2, base,  8
>>>> ...
>>>> sc.q r1, r2, base
>>>
>>>
>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and split 
>>> the result into two 64-bit parts. Can do this in QEMU?
>>>
>>>
>> Oh, I'm not sure.
>>
>> I think we just need to implement sc.q. We don't need to care about 
>> 'll.d-dbar-ld.d'. It's just like 'll.q'.
>> It needs the user to ensure that .
>>
>> ll.q' is
>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>> 2) dbar 0x700　
>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>
>> sc.q needs to
>> 1) Use 64-bit cmpxchg.
>> 2) Write 128 bits to memory.
>
> Consider the following code:
>
>
> ll.d r1, base, 0
>
> dbar 0x700
>
> ld.d r2, base, 8
>
> addi.d r2, r2, 1
>
> sc.q r1, r2, base
>
>
> We translate them into native code:
>
>
> ld.d r1, base, 0
>
> mv LLbit, 1
>
> mv LLaddr, base
>
> mv LLval, r1
>
> dbar 0x700
>
> ld.d r2, base, 8
>
> addi.d r2, r2, 1
>
> if (LLbit == 1 && LLaddr == base) {
>
>     cmpxchg addr=base compare=LLval new=r1
>
>     128-bit write {r2, r1} to base if cmpxchg succeeded
>
> }
>
> set r1 if sc.q succeeded
>
>
>
> If the memory content of base+8 has changed between ld.d r2 and addi.d 
> r2, the atomicity is not guaranteed, i.e. only the high part has 
> changed, the low part hasn't.
>
>
Sorry,  my mistake.  need use cmpxchg_i128.   See 
target/arm/tcg/translate-a64.c   gen_store_exclusive().

gen_scq(rd, rk, rj)
{
      ...
     TCGv_i128 t16 = tcg_temp_new_i128();
     TCGv_i128 c16 = tcg_temp_new_i128();
     TCGv_i64 low = tcg_temp_new_i64();
     TCGv_i64 high= tcg_temp_new_i64();
     TCGv_i64 temp = tcg_temp_new_i64();

     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));

     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx,  MO_TEUQ);
     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
     tcg_gen_concat_i64_i128(c16, low,  high);

     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
ctx->mem_idx, MO_128);

     ...
}

I am not sure this is right.

I think Richard can give you more suggestions. @Richard

Thanks.
Song Gao
>
>> Thanks.
>> Song Gao
>>>>
>>>>
>>>> For this series,
>>>> I think we need set the new config bits to the 'max cpu', and 
>>>> change linux-user/target_elf.h ''any' to 'max', so that we can use 
>>>> these new instructions on linux-user mode.
>>>
>>> I will work on it.
>>>
>>>
>>>>
>>>> Thanks
>>>> Song Gao
>>>>>>
>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>> restrictions
>>>>>>
>>>>>> But you could do the same thing, aligning and recording the 
>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>> address bit 6.  This would complicate the implementation of sc.d 
>>>>>> as well, but would perhaps bring us "close enough" to the actual 
>>>>>> architecture.
>>>>>>
>>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>>> spec either.  There is quite a large comment within 
>>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>>> quite right.  But it seems to be close enough for actual usage to 
>>>>>> succeed.
>>>>>>
>>>>>>
>>>>>> r~
>>>>
>>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-31  9:11                   ` gaosong
@ 2023-10-31  9:13                     ` Jiajie Chen
  2023-10-31 11:06                       ` gaosong
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-31  9:13 UTC (permalink / raw)
  To: gaosong, Richard Henderson, qemu-devel; +Cc: git, bibo mao


On 2023/10/31 17:11, gaosong wrote:
> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>
>> On 2023/10/30 16:23, gaosong wrote:
>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>
>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>
>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>
>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>>
>>>>>>> Ah, that does complicate things.
>>>>>>>
>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>
>>>>>>>>
>>>>>>>> ll.d lo, base, 0
>>>>>>>> ld.d hi, base, 4
>>>>>>>>
>>>>>>>> # do some computation
>>>>>>>>
>>>>>>>> sc.q lo, hi, base
>>>>>>>>
>>>>>>>> # try again if sc failed
>>>>>>>>
>>>>>>>> Then a possible implementation of gen_ll() would be: align base 
>>>>>>>> to 128-bit boundary, read 128-bit from memory, save 64-bit part 
>>>>>>>> to rd and record whole 128-bit data in llval. Then, in 
>>>>>>>> gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>
>>>>>>>>
>>>>>>>> But what about the reversed instruction pattern: ll.d hi, base, 
>>>>>>>> 4; ld.d lo, base 0?
>>>>>>>
>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>> explicit language, similar to
>>>>>>
>>>>>>
>>>>>> I'm a community developer not affiliated with Loongson. Song Gao, 
>>>>>> could you provide some detail from Loongson Inc.?
>>>>>>
>>>>>>
>>>>>
>>>>> ll.d   r1, base, 0
>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>> ld.d  r2, base,  8
>>>>> ...
>>>>> sc.q r1, r2, base
>>>>
>>>>
>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and split 
>>>> the result into two 64-bit parts. Can do this in QEMU?
>>>>
>>>>
>>> Oh, I'm not sure.
>>>
>>> I think we just need to implement sc.q. We don't need to care about 
>>> 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>> It needs the user to ensure that .
>>>
>>> ll.q' is
>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>> 2) dbar 0x700　
>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>
>>> sc.q needs to
>>> 1) Use 64-bit cmpxchg.
>>> 2) Write 128 bits to memory.
>>
>> Consider the following code:
>>
>>
>> ll.d r1, base, 0
>>
>> dbar 0x700
>>
>> ld.d r2, base, 8
>>
>> addi.d r2, r2, 1
>>
>> sc.q r1, r2, base
>>
>>
>> We translate them into native code:
>>
>>
>> ld.d r1, base, 0
>>
>> mv LLbit, 1
>>
>> mv LLaddr, base
>>
>> mv LLval, r1
>>
>> dbar 0x700
>>
>> ld.d r2, base, 8
>>
>> addi.d r2, r2, 1
>>
>> if (LLbit == 1 && LLaddr == base) {
>>
>>     cmpxchg addr=base compare=LLval new=r1
>>
>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>
>> }
>>
>> set r1 if sc.q succeeded
>>
>>
>>
>> If the memory content of base+8 has changed between ld.d r2 and 
>> addi.d r2, the atomicity is not guaranteed, i.e. only the high part 
>> has changed, the low part hasn't.
>>
>>
> Sorry,  my mistake.  need use cmpxchg_i128.   See 
> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>
> gen_scq(rd, rk, rj)
> {
>      ...
>     TCGv_i128 t16 = tcg_temp_new_i128();
>     TCGv_i128 c16 = tcg_temp_new_i128();
>     TCGv_i64 low = tcg_temp_new_i64();
>     TCGv_i64 high= tcg_temp_new_i64();
>     TCGv_i64 temp = tcg_temp_new_i64();
>
>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));
>
>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx,  MO_TEUQ);
>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);


The problem is that, the high value read here might not equal to the 
previously read one in ll.d r2, base 8 instruction.


> tcg_gen_concat_i64_i128(c16, low,  high);
>
>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
> ctx->mem_idx, MO_128);
>
>     ...
> }
>
> I am not sure this is right.
>
> I think Richard can give you more suggestions. @Richard
>
> Thanks.
> Song Gao
>>
>>> Thanks.
>>> Song Gao
>>>>>
>>>>>
>>>>> For this series,
>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can use 
>>>>> these new instructions on linux-user mode.
>>>>
>>>> I will work on it.
>>>>
>>>>
>>>>>
>>>>> Thanks
>>>>> Song Gao
>>>>>>>
>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>> restrictions
>>>>>>>
>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>>> address bit 6.  This would complicate the implementation of sc.d 
>>>>>>> as well, but would perhaps bring us "close enough" to the actual 
>>>>>>> architecture.
>>>>>>>
>>>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>>>> spec either.  There is quite a large comment within 
>>>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>>>> quite right.  But it seems to be close enough for actual usage 
>>>>>>> to succeed.
>>>>>>>
>>>>>>>
>>>>>>> r~
>>>>>
>>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-31  9:13                     ` Jiajie Chen
@ 2023-10-31 11:06                       ` gaosong
  2023-10-31 11:10                         ` Jiajie Chen
  0 siblings, 1 reply; 28+ messages in thread
From: gaosong @ 2023-10-31 11:06 UTC (permalink / raw)
  To: Jiajie Chen, Richard Henderson, qemu-devel; +Cc: git, bibo mao

在 2023/10/31 下午5:13, Jiajie Chen 写道:
>
> On 2023/10/31 17:11, gaosong wrote:
>> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>>
>>> On 2023/10/30 16:23, gaosong wrote:
>>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>>
>>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>>
>>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>>
>>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>>>
>>>>>>>> Ah, that does complicate things.
>>>>>>>>
>>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ll.d lo, base, 0
>>>>>>>>> ld.d hi, base, 4
>>>>>>>>>
>>>>>>>>> # do some computation
>>>>>>>>>
>>>>>>>>> sc.q lo, hi, base
>>>>>>>>>
>>>>>>>>> # try again if sc failed
>>>>>>>>>
>>>>>>>>> Then a possible implementation of gen_ll() would be: align 
>>>>>>>>> base to 128-bit boundary, read 128-bit from memory, save 
>>>>>>>>> 64-bit part to rd and record whole 128-bit data in llval. 
>>>>>>>>> Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> But what about the reversed instruction pattern: ll.d hi, 
>>>>>>>>> base, 4; ld.d lo, base 0?
>>>>>>>>
>>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>>> explicit language, similar to
>>>>>>>
>>>>>>>
>>>>>>> I'm a community developer not affiliated with Loongson. Song 
>>>>>>> Gao, could you provide some detail from Loongson Inc.?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ll.d   r1, base, 0
>>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>>> ld.d  r2, base,  8
>>>>>> ...
>>>>>> sc.q r1, r2, base
>>>>>
>>>>>
>>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and split 
>>>>> the result into two 64-bit parts. Can do this in QEMU?
>>>>>
>>>>>
>>>> Oh, I'm not sure.
>>>>
>>>> I think we just need to implement sc.q. We don't need to care about 
>>>> 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>>> It needs the user to ensure that .
>>>>
>>>> ll.q' is
>>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>>> 2) dbar 0x700　
>>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>>
>>>> sc.q needs to
>>>> 1) Use 64-bit cmpxchg.
>>>> 2) Write 128 bits to memory.
>>>
>>> Consider the following code:
>>>
>>>
>>> ll.d r1, base, 0
>>>
>>> dbar 0x700
>>>
>>> ld.d r2, base, 8
>>>
>>> addi.d r2, r2, 1
>>>
>>> sc.q r1, r2, base
>>>
>>>
>>> We translate them into native code:
>>>
>>>
>>> ld.d r1, base, 0
>>>
>>> mv LLbit, 1
>>>
>>> mv LLaddr, base
>>>
>>> mv LLval, r1
>>>
>>> dbar 0x700
>>>
>>> ld.d r2, base, 8
>>>
>>> addi.d r2, r2, 1
>>>
>>> if (LLbit == 1 && LLaddr == base) {
>>>
>>>     cmpxchg addr=base compare=LLval new=r1
>>>
>>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>>
>>> }
>>>
>>> set r1 if sc.q succeeded
>>>
>>>
>>>
>>> If the memory content of base+8 has changed between ld.d r2 and 
>>> addi.d r2, the atomicity is not guaranteed, i.e. only the high part 
>>> has changed, the low part hasn't.
>>>
>>>
>> Sorry,  my mistake.  need use cmpxchg_i128.   See 
>> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>>
>> gen_scq(rd, rk, rj)
>> {
>>      ...
>>     TCGv_i128 t16 = tcg_temp_new_i128();
>>     TCGv_i128 c16 = tcg_temp_new_i128();
>>     TCGv_i64 low = tcg_temp_new_i64();
>>     TCGv_i64 high= tcg_temp_new_i64();
>>     TCGv_i64 temp = tcg_temp_new_i64();
>>
>>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));
>>
>>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx,  MO_TEUQ);
>>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
>
>
> The problem is that, the high value read here might not equal to the 
> previously read one in ll.d r2, base 8 instruction.
I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bit 
atomic operation.

Thanks.
Song Gao
>> tcg_gen_concat_i64_i128(c16, low,  high);
>>
>>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
>> ctx->mem_idx, MO_128);
>>
>>     ...
>> }
>>
>> I am not sure this is right.
>>
>> I think Richard can give you more suggestions. @Richard
>>
>> Thanks.
>> Song Gao
>>>
>>>> Thanks.
>>>> Song Gao
>>>>>>
>>>>>>
>>>>>> For this series,
>>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can 
>>>>>> use these new instructions on linux-user mode.
>>>>>
>>>>> I will work on it.
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Song Gao
>>>>>>>>
>>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>>> restrictions
>>>>>>>>
>>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>>>> address bit 6.  This would complicate the implementation of 
>>>>>>>> sc.d as well, but would perhaps bring us "close enough" to the 
>>>>>>>> actual architecture.
>>>>>>>>
>>>>>>>> Note that our Arm store-exclusive implementation isn't quite in 
>>>>>>>> spec either.  There is quite a large comment within 
>>>>>>>> translate-a64.c store_exclusive() about the ways things are not 
>>>>>>>> quite right.  But it seems to be close enough for actual usage 
>>>>>>>> to succeed.
>>>>>>>>
>>>>>>>>
>>>>>>>> r~
>>>>>>
>>>>
>>



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-31 11:06                       ` gaosong
@ 2023-10-31 11:10                         ` Jiajie Chen
  2023-10-31 12:12                           ` gaosong
  0 siblings, 1 reply; 28+ messages in thread
From: Jiajie Chen @ 2023-10-31 11:10 UTC (permalink / raw)
  To: gaosong, Richard Henderson, qemu-devel; +Cc: git, bibo mao


On 2023/10/31 19:06, gaosong wrote:
> 在 2023/10/31 下午5:13, Jiajie Chen 写道:
>>
>> On 2023/10/31 17:11, gaosong wrote:
>>> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>>>
>>>> On 2023/10/30 16:23, gaosong wrote:
>>>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>>>
>>>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>>>
>>>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>>>
>>>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only has 
>>>>>>>>>>> 64-bit ll and 128-bit sc.
>>>>>>>>>
>>>>>>>>> Ah, that does complicate things.
>>>>>>>>>
>>>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ll.d lo, base, 0
>>>>>>>>>> ld.d hi, base, 4
>>>>>>>>>>
>>>>>>>>>> # do some computation
>>>>>>>>>>
>>>>>>>>>> sc.q lo, hi, base
>>>>>>>>>>
>>>>>>>>>> # try again if sc failed
>>>>>>>>>>
>>>>>>>>>> Then a possible implementation of gen_ll() would be: align 
>>>>>>>>>> base to 128-bit boundary, read 128-bit from memory, save 
>>>>>>>>>> 64-bit part to rd and record whole 128-bit data in llval. 
>>>>>>>>>> Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> But what about the reversed instruction pattern: ll.d hi, 
>>>>>>>>>> base, 4; ld.d lo, base 0?
>>>>>>>>>
>>>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>>>> explicit language, similar to
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm a community developer not affiliated with Loongson. Song 
>>>>>>>> Gao, could you provide some detail from Loongson Inc.?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> ll.d   r1, base, 0
>>>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>>>> ld.d  r2, base,  8
>>>>>>> ...
>>>>>>> sc.q r1, r2, base
>>>>>>
>>>>>>
>>>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d sequence 
>>>>>> and translate the sequence into one tcg_gen_qemu_ld_i128 and 
>>>>>> split the result into two 64-bit parts. Can do this in QEMU?
>>>>>>
>>>>>>
>>>>> Oh, I'm not sure.
>>>>>
>>>>> I think we just need to implement sc.q. We don't need to care 
>>>>> about 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>>>> It needs the user to ensure that .
>>>>>
>>>>> ll.q' is
>>>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>>>> 2) dbar 0x700　
>>>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>>>
>>>>> sc.q needs to
>>>>> 1) Use 64-bit cmpxchg.
>>>>> 2) Write 128 bits to memory.
>>>>
>>>> Consider the following code:
>>>>
>>>>
>>>> ll.d r1, base, 0
>>>>
>>>> dbar 0x700
>>>>
>>>> ld.d r2, base, 8
>>>>
>>>> addi.d r2, r2, 1
>>>>
>>>> sc.q r1, r2, base
>>>>
>>>>
>>>> We translate them into native code:
>>>>
>>>>
>>>> ld.d r1, base, 0
>>>>
>>>> mv LLbit, 1
>>>>
>>>> mv LLaddr, base
>>>>
>>>> mv LLval, r1
>>>>
>>>> dbar 0x700
>>>>
>>>> ld.d r2, base, 8
>>>>
>>>> addi.d r2, r2, 1
>>>>
>>>> if (LLbit == 1 && LLaddr == base) {
>>>>
>>>>     cmpxchg addr=base compare=LLval new=r1
>>>>
>>>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>>>
>>>> }
>>>>
>>>> set r1 if sc.q succeeded
>>>>
>>>>
>>>>
>>>> If the memory content of base+8 has changed between ld.d r2 and 
>>>> addi.d r2, the atomicity is not guaranteed, i.e. only the high part 
>>>> has changed, the low part hasn't.
>>>>
>>>>
>>> Sorry,  my mistake.  need use cmpxchg_i128.   See 
>>> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>>>
>>> gen_scq(rd, rk, rj)
>>> {
>>>      ...
>>>     TCGv_i128 t16 = tcg_temp_new_i128();
>>>     TCGv_i128 c16 = tcg_temp_new_i128();
>>>     TCGv_i64 low = tcg_temp_new_i64();
>>>     TCGv_i64 high= tcg_temp_new_i64();
>>>     TCGv_i64 temp = tcg_temp_new_i64();
>>>
>>>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd],  cpu_gpr[rk]));
>>>
>>>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx, MO_TEUQ);
>>>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>>>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>>>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
>>
>>
>> The problem is that, the high value read here might not equal to the 
>> previously read one in ll.d r2, base 8 instruction.
> I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bit 
> atomic operation.


The code does work in real LoongArch machine. However, we are emulating 
LoongArch in qemu, we have to make it atomic, yet it isn't now.


>
> Thanks.
> Song Gao
>>> tcg_gen_concat_i64_i128(c16, low, high);
>>>
>>>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
>>> ctx->mem_idx, MO_128);
>>>
>>>     ...
>>> }
>>>
>>> I am not sure this is right.
>>>
>>> I think Richard can give you more suggestions. @Richard
>>>
>>> Thanks.
>>> Song Gao
>>>>
>>>>> Thanks.
>>>>> Song Gao
>>>>>>>
>>>>>>>
>>>>>>> For this series,
>>>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can 
>>>>>>> use these new instructions on linux-user mode.
>>>>>>
>>>>>> I will work on it.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Song Gao
>>>>>>>>>
>>>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>>>> restrictions
>>>>>>>>>
>>>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>>>> entire 128-bit quantity, then extract the ll.d result based on 
>>>>>>>>> address bit 6. This would complicate the implementation of 
>>>>>>>>> sc.d as well, but would perhaps bring us "close enough" to the 
>>>>>>>>> actual architecture.
>>>>>>>>>
>>>>>>>>> Note that our Arm store-exclusive implementation isn't quite 
>>>>>>>>> in spec either.  There is quite a large comment within 
>>>>>>>>> translate-a64.c store_exclusive() about the ways things are 
>>>>>>>>> not quite right.  But it seems to be close enough for actual 
>>>>>>>>> usage to succeed.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> r~
>>>>>>>
>>>>>
>>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/5] Add LoongArch v1.1 instructions
  2023-10-31 11:10                         ` Jiajie Chen
@ 2023-10-31 12:12                           ` gaosong
  0 siblings, 0 replies; 28+ messages in thread
From: gaosong @ 2023-10-31 12:12 UTC (permalink / raw)
  To: Jiajie Chen, Richard Henderson, qemu-devel; +Cc: git, bibo mao

在 2023/10/31 下午7:10, Jiajie Chen 写道:
>
> On 2023/10/31 19:06, gaosong wrote:
>> 在 2023/10/31 下午5:13, Jiajie Chen 写道:
>>>
>>> On 2023/10/31 17:11, gaosong wrote:
>>>> 在 2023/10/30 下午7:54, Jiajie Chen 写道:
>>>>>
>>>>> On 2023/10/30 16:23, gaosong wrote:
>>>>>> 在 2023/10/28 下午9:09, Jiajie Chen 写道:
>>>>>>>
>>>>>>> On 2023/10/26 14:54, gaosong wrote:
>>>>>>>> 在 2023/10/26 上午9:38, Jiajie Chen 写道:
>>>>>>>>>
>>>>>>>>> On 2023/10/26 03:04, Richard Henderson wrote:
>>>>>>>>>> On 10/25/23 10:13, Jiajie Chen wrote:
>>>>>>>>>>>> On 2023/10/24 07:26, Richard Henderson wrote:
>>>>>>>>>>>>> See target/arm/tcg/translate-a64.c, gen_store_exclusive, 
>>>>>>>>>>>>> TCGv_i128 block.
>>>>>>>>>>>>> See target/ppc/translate.c, gen_stqcx_.
>>>>>>>>>>>>
>>>>>>>>>>>> The situation here is slightly different: aarch64 and ppc64 
>>>>>>>>>>>> have both 128-bit ll and sc, however LoongArch v1.1 only 
>>>>>>>>>>>> has 64-bit ll and 128-bit sc.
>>>>>>>>>>
>>>>>>>>>> Ah, that does complicate things.
>>>>>>>>>>
>>>>>>>>>>> Possibly use the combination of ll.d and ld.d:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ll.d lo, base, 0
>>>>>>>>>>> ld.d hi, base, 4
>>>>>>>>>>>
>>>>>>>>>>> # do some computation
>>>>>>>>>>>
>>>>>>>>>>> sc.q lo, hi, base
>>>>>>>>>>>
>>>>>>>>>>> # try again if sc failed
>>>>>>>>>>>
>>>>>>>>>>> Then a possible implementation of gen_ll() would be: align 
>>>>>>>>>>> base to 128-bit boundary, read 128-bit from memory, save 
>>>>>>>>>>> 64-bit part to rd and record whole 128-bit data in llval. 
>>>>>>>>>>> Then, in gen_sc_q(), it uses a 128-bit cmpxchg.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> But what about the reversed instruction pattern: ll.d hi, 
>>>>>>>>>>> base, 4; ld.d lo, base 0?
>>>>>>>>>>
>>>>>>>>>> It would be worth asking your hardware engineers about the 
>>>>>>>>>> bounds of legal behaviour. Ideally there would be some very 
>>>>>>>>>> explicit language, similar to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm a community developer not affiliated with Loongson. Song 
>>>>>>>>> Gao, could you provide some detail from Loongson Inc.?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ll.d   r1, base, 0
>>>>>>>> dbar 0x700          ==> see 2.2.8.1
>>>>>>>> ld.d  r2, base,  8
>>>>>>>> ...
>>>>>>>> sc.q r1, r2, base
>>>>>>>
>>>>>>>
>>>>>>> Thanks! I think we may need to detect the ll.d-dbar-ld.d 
>>>>>>> sequence and translate the sequence into one 
>>>>>>> tcg_gen_qemu_ld_i128 and split the result into two 64-bit parts. 
>>>>>>> Can do this in QEMU?
>>>>>>>
>>>>>>>
>>>>>> Oh, I'm not sure.
>>>>>>
>>>>>> I think we just need to implement sc.q. We don't need to care 
>>>>>> about 'll.d-dbar-ld.d'. It's just like 'll.q'.
>>>>>> It needs the user to ensure that .
>>>>>>
>>>>>> ll.q' is
>>>>>> 1) ll.d r1 base, 0 ==> set LLbit, load the low 64 bits into r1
>>>>>> 2) dbar 0x700　
>>>>>> 3) ld.d r2 base, 8 ==> load the high 64 bits to r2
>>>>>>
>>>>>> sc.q needs to
>>>>>> 1) Use 64-bit cmpxchg.
>>>>>> 2) Write 128 bits to memory.
>>>>>
>>>>> Consider the following code:
>>>>>
>>>>>
>>>>> ll.d r1, base, 0
>>>>>
>>>>> dbar 0x700
>>>>>
>>>>> ld.d r2, base, 8
>>>>>
>>>>> addi.d r2, r2, 1
>>>>>
>>>>> sc.q r1, r2, base
>>>>>
>>>>>
>>>>> We translate them into native code:
>>>>>
>>>>>
>>>>> ld.d r1, base, 0
>>>>>
>>>>> mv LLbit, 1
>>>>>
>>>>> mv LLaddr, base
>>>>>
>>>>> mv LLval, r1
>>>>>
>>>>> dbar 0x700
>>>>>
>>>>> ld.d r2, base, 8
>>>>>
>>>>> addi.d r2, r2, 1
>>>>>
>>>>> if (LLbit == 1 && LLaddr == base) {
>>>>>
>>>>>     cmpxchg addr=base compare=LLval new=r1
>>>>>
>>>>>     128-bit write {r2, r1} to base if cmpxchg succeeded
>>>>>
>>>>> }
>>>>>
>>>>> set r1 if sc.q succeeded
>>>>>
>>>>>
>>>>>
>>>>> If the memory content of base+8 has changed between ld.d r2 and 
>>>>> addi.d r2, the atomicity is not guaranteed, i.e. only the high 
>>>>> part has changed, the low part hasn't.
>>>>>
>>>>>
>>>> Sorry,  my mistake.  need use cmpxchg_i128.   See 
>>>> target/arm/tcg/translate-a64.c   gen_store_exclusive().
>>>>
>>>> gen_scq(rd, rk, rj)
>>>> {
>>>>      ...
>>>>     TCGv_i128 t16 = tcg_temp_new_i128();
>>>>     TCGv_i128 c16 = tcg_temp_new_i128();
>>>>     TCGv_i64 low = tcg_temp_new_i64();
>>>>     TCGv_i64 high= tcg_temp_new_i64();
>>>>     TCGv_i64 temp = tcg_temp_new_i64();
>>>>
>>>>     tcg_gen_concat_i64_i128(t16, cpu_gpr[rd], cpu_gpr[rk]));
>>>>
>>>>     tcg_gen_qemu_ld(low, cpu_lladdr, ctx->mem_idx, MO_TEUQ);
>>>>     tcg_gen_addi_tl(temp, cpu_lladdr, 8);
>>>>     tcg_gen_mb(TCG_BAR_SC | TCG_MO_LD_LD);
>>>>     tcg_gen_qemu_ld(high, temp, ctx->mem_idx, MO_TEUQ);
>>>
>>>
>>> The problem is that, the high value read here might not equal to the 
>>> previously read one in ll.d r2, base 8 instruction.
>> I think dbar 0x7000 ensures that the 2 loads in 'll.q' are a 128bit 
>> atomic operation.
>
>
> The code does work in real LoongArch machine. However, we are 
> emulating LoongArch in qemu, we have to make it atomic, yet it isn't now.
>
>
yes, I know,  As i said before,  we need't care about 'll.q', it needs 
the user to ensure that.

In QEMU,   I think  the instruction dbar can make it atomic.  but I am 
not sure  this is right.

static bool trans_dbar()
{
         tcg_gen_mb(TCG_BAR_SC | TCG_MO_ALL);
         return;
}

may be this is already enough.

or

like this:
static bool trans_dbar()
{
     TCGBar bar;
     if (a->hint == 0x700)
         bar = TCG_BAR_SC |  TCG_MO_LD_LD;
     } else {
         bar = TCG_BAR_SC | TCG_MO_ALL;
     }

     tcg_gen_mb(bar);
     return true;
}

Thanks.
Song Gao
>>
>> Thanks.
>> Song Gao
>>>> tcg_gen_concat_i64_i128(c16, low, high);
>>>>
>>>>     tcg_gen_atomic_cmpxchg_i128(t16, cpu_lladdr, c16, t16, 
>>>> ctx->mem_idx, MO_128);
>>>>
>>>>     ...
>>>> }
>>>>
>>>> I am not sure this is right.
>>>>
>>>> I think Richard can give you more suggestions. @Richard
>>>>
>>>> Thanks.
>>>> Song Gao
>>>>>
>>>>>> Thanks.
>>>>>> Song Gao
>>>>>>>>
>>>>>>>>
>>>>>>>> For this series,
>>>>>>>> I think we need set the new config bits to the 'max cpu', and 
>>>>>>>> change linux-user/target_elf.h ''any' to 'max', so that we can 
>>>>>>>> use these new instructions on linux-user mode.
>>>>>>>
>>>>>>> I will work on it.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Song Gao
>>>>>>>>>>
>>>>>>>>>> https://developer.arm.com/documentation/ddi0487/latest/
>>>>>>>>>> B2.9.5 Load-Exclusive and Store-Exclusive instruction usage 
>>>>>>>>>> restrictions
>>>>>>>>>>
>>>>>>>>>> But you could do the same thing, aligning and recording the 
>>>>>>>>>> entire 128-bit quantity, then extract the ll.d result based 
>>>>>>>>>> on address bit 6. This would complicate the implementation of 
>>>>>>>>>> sc.d as well, but would perhaps bring us "close enough" to 
>>>>>>>>>> the actual architecture.
>>>>>>>>>>
>>>>>>>>>> Note that our Arm store-exclusive implementation isn't quite 
>>>>>>>>>> in spec either.  There is quite a large comment within 
>>>>>>>>>> translate-a64.c store_exclusive() about the ways things are 
>>>>>>>>>> not quite right.  But it seems to be close enough for actual 
>>>>>>>>>> usage to succeed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> r~
>>>>>>>>
>>>>>>
>>>>
>>



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-10-31 12:13 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-23 15:29 [PATCH 0/5] Add LoongArch v1.1 instructions Jiajie Chen
2023-10-23 15:29 ` [PATCH 1/5] include/exec/memop.h: Add MO_TESB Jiajie Chen
2023-10-23 15:49   ` David Hildenbrand
2023-10-23 15:52     ` Jiajie Chen
2023-10-23 15:29 ` [PATCH 2/5] target/loongarch: Add am{swap/add}[_db].{b/h} Jiajie Chen
2023-10-23 22:50   ` Richard Henderson
2023-10-23 15:29 ` [PATCH 3/5] target/loongarch: Add amcas[_db].{b/h/w/d} Jiajie Chen
2023-10-23 15:35   ` Jiajie Chen
2023-10-23 23:00     ` Richard Henderson
2023-10-23 22:59   ` Richard Henderson
2023-10-23 15:29 ` [PATCH 4/5] target/loongarch: Add estimated reciprocal instructions Jiajie Chen
2023-10-23 23:02   ` Richard Henderson
2023-10-23 15:29 ` [PATCH 5/5] target/loongarch: Add llacq/screl instructions Jiajie Chen
2023-10-23 23:19   ` Richard Henderson
2023-10-23 23:26 ` [PATCH 0/5] Add LoongArch v1.1 instructions Richard Henderson
2023-10-24  6:10   ` Jiajie Chen
2023-10-25 17:13     ` Jiajie Chen
2023-10-25 19:04       ` Richard Henderson
2023-10-26  1:38         ` Jiajie Chen
2023-10-26  6:54           ` gaosong
2023-10-28 13:09             ` Jiajie Chen
2023-10-30  8:23               ` gaosong
2023-10-30 11:54                 ` Jiajie Chen
2023-10-31  9:11                   ` gaosong
2023-10-31  9:13                     ` Jiajie Chen
2023-10-31 11:06                       ` gaosong
2023-10-31 11:10                         ` Jiajie Chen
2023-10-31 12:12                           ` gaosong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).