qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] softfloat: Implement float128_muladd
@ 2020-09-24  1:24 Richard Henderson
  2020-09-24  1:24 ` [PATCH 1/8] softfloat: Use mulu64 for mul64To128 Richard Henderson
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

Plus assorted cleanups, passes tests/fp/fp-test.
I will eventually fill in ppc and s390x assembly bits.


r~


Richard Henderson (8):
  softfloat: Use mulu64 for mul64To128
  softfloat: Use int128.h for some operations
  softfloat: Tidy a * b + inf return
  softfloat: Add float_cmask and constants
  softfloat: Inline pick_nan_muladd into its caller
  softfloat: Implement float128_muladd
  softfloat: Use x86_64 assembly for {add,sub}{192,256}
  softfloat: Use aarch64 assembly for {add,sub}{192,256}

 include/fpu/softfloat-macros.h |  95 +++---
 include/fpu/softfloat.h        |   2 +
 fpu/softfloat.c                | 520 +++++++++++++++++++++++++++++----
 tests/fp/fp-test.c             |   2 +-
 tests/fp/wrap.c.inc            |  12 +
 5 files changed, 538 insertions(+), 93 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/8] softfloat: Use mulu64 for mul64To128
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  7:32   ` David Hildenbrand
  2020-09-24  1:24 ` [PATCH 2/8] softfloat: Use int128.h for some operations Richard Henderson
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

Via host-utils.h, we use a host widening multiply for
64-bit hosts, and a common subroutine for 32-bit hosts.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/fpu/softfloat-macros.h | 24 ++++--------------------
 1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
index a35ec2893a..57845f8af0 100644
--- a/include/fpu/softfloat-macros.h
+++ b/include/fpu/softfloat-macros.h
@@ -83,6 +83,7 @@ this code that are retained.
 #define FPU_SOFTFLOAT_MACROS_H
 
 #include "fpu/softfloat-types.h"
+#include "qemu/host-utils.h"
 
 /*----------------------------------------------------------------------------
 | Shifts `a' right by the number of bits given in `count'.  If any nonzero
@@ -515,27 +516,10 @@ static inline void
 | `z0Ptr' and `z1Ptr'.
 *----------------------------------------------------------------------------*/
 
-static inline void mul64To128( uint64_t a, uint64_t b, uint64_t *z0Ptr, uint64_t *z1Ptr )
+static inline void
+mul64To128(uint64_t a, uint64_t b, uint64_t *z0Ptr, uint64_t *z1Ptr)
 {
-    uint32_t aHigh, aLow, bHigh, bLow;
-    uint64_t z0, zMiddleA, zMiddleB, z1;
-
-    aLow = a;
-    aHigh = a>>32;
-    bLow = b;
-    bHigh = b>>32;
-    z1 = ( (uint64_t) aLow ) * bLow;
-    zMiddleA = ( (uint64_t) aLow ) * bHigh;
-    zMiddleB = ( (uint64_t) aHigh ) * bLow;
-    z0 = ( (uint64_t) aHigh ) * bHigh;
-    zMiddleA += zMiddleB;
-    z0 += ( ( (uint64_t) ( zMiddleA < zMiddleB ) )<<32 ) + ( zMiddleA>>32 );
-    zMiddleA <<= 32;
-    z1 += zMiddleA;
-    z0 += ( z1 < zMiddleA );
-    *z1Ptr = z1;
-    *z0Ptr = z0;
-
+    mulu64(z1Ptr, z0Ptr, a, b);
 }
 
 /*----------------------------------------------------------------------------
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/8] softfloat: Use int128.h for some operations
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
  2020-09-24  1:24 ` [PATCH 1/8] softfloat: Use mulu64 for mul64To128 Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  7:35   ` David Hildenbrand
  2020-09-24  1:24 ` [PATCH 3/8] softfloat: Tidy a * b + inf return Richard Henderson
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

Use our Int128, which wraps the compiler's __int128_t,
instead of open-coding left shifts and arithmetic.
We'd need to extend Int128 to have unsigned operations
to replace more than these three.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/fpu/softfloat-macros.h | 39 +++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
index 57845f8af0..95d88d05b8 100644
--- a/include/fpu/softfloat-macros.h
+++ b/include/fpu/softfloat-macros.h
@@ -84,6 +84,7 @@ this code that are retained.
 
 #include "fpu/softfloat-types.h"
 #include "qemu/host-utils.h"
+#include "qemu/int128.h"
 
 /*----------------------------------------------------------------------------
 | Shifts `a' right by the number of bits given in `count'.  If any nonzero
@@ -352,13 +353,11 @@ static inline void shortShift128Left(uint64_t a0, uint64_t a1, int count,
 static inline void shift128Left(uint64_t a0, uint64_t a1, int count,
                                 uint64_t *z0Ptr, uint64_t *z1Ptr)
 {
-    if (count < 64) {
-        *z1Ptr = a1 << count;
-        *z0Ptr = count == 0 ? a0 : (a0 << count) | (a1 >> (-count & 63));
-    } else {
-        *z1Ptr = 0;
-        *z0Ptr = a1 << (count - 64);
-    }
+    Int128 a = int128_make128(a1, a0);
+    Int128 z = int128_lshift(a, count);
+
+    *z0Ptr = int128_gethi(z);
+    *z1Ptr = int128_getlo(z);
 }
 
 /*----------------------------------------------------------------------------
@@ -405,15 +404,15 @@ static inline void
 *----------------------------------------------------------------------------*/
 
 static inline void
- add128(
-     uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1, uint64_t *z0Ptr, uint64_t *z1Ptr )
+add128(uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1,
+       uint64_t *z0Ptr, uint64_t *z1Ptr)
 {
-    uint64_t z1;
-
-    z1 = a1 + b1;
-    *z1Ptr = z1;
-    *z0Ptr = a0 + b0 + ( z1 < a1 );
+    Int128 a = int128_make128(a1, a0);
+    Int128 b = int128_make128(b1, b0);
+    Int128 z = int128_add(a, b);
 
+    *z0Ptr = int128_gethi(z);
+    *z1Ptr = int128_getlo(z);
 }
 
 /*----------------------------------------------------------------------------
@@ -463,13 +462,15 @@ static inline void
 *----------------------------------------------------------------------------*/
 
 static inline void
- sub128(
-     uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1, uint64_t *z0Ptr, uint64_t *z1Ptr )
+sub128(uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1,
+       uint64_t *z0Ptr, uint64_t *z1Ptr)
 {
+    Int128 a = int128_make128(a1, a0);
+    Int128 b = int128_make128(b1, b0);
+    Int128 z = int128_sub(a, b);
 
-    *z1Ptr = a1 - b1;
-    *z0Ptr = a0 - b0 - ( a1 < b1 );
-
+    *z0Ptr = int128_gethi(z);
+    *z1Ptr = int128_getlo(z);
 }
 
 /*----------------------------------------------------------------------------
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/8] softfloat: Tidy a * b + inf return
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
  2020-09-24  1:24 ` [PATCH 1/8] softfloat: Use mulu64 for mul64To128 Richard Henderson
  2020-09-24  1:24 ` [PATCH 2/8] softfloat: Use int128.h for some operations Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  7:37   ` David Hildenbrand
  2020-09-24  1:24 ` [PATCH 4/8] softfloat: Add float_cmask and constants Richard Henderson
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

No reason to set values in 'a', when we already
have float_class_inf in 'c', and can flip that sign.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 fpu/softfloat.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 67cfa0fd82..9db55d2b11 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -1380,9 +1380,8 @@ static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
             s->float_exception_flags |= float_flag_invalid;
             return parts_default_nan(s);
         } else {
-            a.cls = float_class_inf;
-            a.sign = c.sign ^ sign_flip;
-            return a;
+            c.sign ^= sign_flip;
+            return c;
         }
     }
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/8] softfloat: Add float_cmask and constants
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (2 preceding siblings ...)
  2020-09-24  1:24 ` [PATCH 3/8] softfloat: Tidy a * b + inf return Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  7:40   ` David Hildenbrand
  2020-09-24  1:24 ` [PATCH 5/8] softfloat: Inline pick_nan_muladd into its caller Richard Henderson
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

Testing more than one class at a time is better done with masks.
This reduces the static branch count.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 fpu/softfloat.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 9db55d2b11..3e625c47cd 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -469,6 +469,20 @@ typedef enum __attribute__ ((__packed__)) {
     float_class_snan,
 } FloatClass;
 
+#define float_cmask(bit)  (1u << (bit))
+
+enum {
+    float_cmask_zero    = float_cmask(float_class_zero),
+    float_cmask_normal  = float_cmask(float_class_normal),
+    float_cmask_inf     = float_cmask(float_class_inf),
+    float_cmask_qnan    = float_cmask(float_class_qnan),
+    float_cmask_snan    = float_cmask(float_class_snan),
+
+    float_cmask_infzero = float_cmask_zero | float_cmask_inf,
+    float_cmask_anynan  = float_cmask_qnan | float_cmask_snan,
+};
+
+
 /* Simple helpers for checking if, or what kind of, NaN we have */
 static inline __attribute__((unused)) bool is_nan(FloatClass c)
 {
@@ -1335,24 +1349,27 @@ bfloat16 QEMU_FLATTEN bfloat16_mul(bfloat16 a, bfloat16 b, float_status *status)
 static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
                                 int flags, float_status *s)
 {
-    bool inf_zero = ((1 << a.cls) | (1 << b.cls)) ==
-                    ((1 << float_class_inf) | (1 << float_class_zero));
-    bool p_sign;
+    bool inf_zero, p_sign;
     bool sign_flip = flags & float_muladd_negate_result;
     FloatClass p_class;
     uint64_t hi, lo;
     int p_exp;
+    int ab_mask, abc_mask;
+
+    ab_mask = float_cmask(a.cls) | float_cmask(b.cls);
+    abc_mask = float_cmask(c.cls) | ab_mask;
+    inf_zero = ab_mask == float_cmask_infzero;
 
     /* It is implementation-defined whether the cases of (0,inf,qnan)
      * and (inf,0,qnan) raise InvalidOperation or not (and what QNaN
      * they return if they do), so we have to hand this information
      * off to the target-specific pick-a-NaN routine.
      */
-    if (is_nan(a.cls) || is_nan(b.cls) || is_nan(c.cls)) {
+    if (unlikely(abc_mask & float_cmask_anynan)) {
         return pick_nan_muladd(a, b, c, inf_zero, s);
     }
 
-    if (inf_zero) {
+    if (unlikely(inf_zero)) {
         s->float_exception_flags |= float_flag_invalid;
         return parts_default_nan(s);
     }
@@ -1367,9 +1384,9 @@ static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
         p_sign ^= 1;
     }
 
-    if (a.cls == float_class_inf || b.cls == float_class_inf) {
+    if (ab_mask & float_cmask_inf) {
         p_class = float_class_inf;
-    } else if (a.cls == float_class_zero || b.cls == float_class_zero) {
+    } else if (ab_mask & float_cmask_zero) {
         p_class = float_class_zero;
     } else {
         p_class = float_class_normal;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/8] softfloat: Inline pick_nan_muladd into its caller
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (3 preceding siblings ...)
  2020-09-24  1:24 ` [PATCH 4/8] softfloat: Add float_cmask and constants Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  7:42   ` David Hildenbrand
  2020-09-24  1:24 ` [PATCH 6/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

Because of FloatParts, there will only ever be one caller.
Inlining allows us to re-use abc_mask for the snan test.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 fpu/softfloat.c | 75 +++++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 40 deletions(-)

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 3e625c47cd..e038434a07 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -929,45 +929,6 @@ static FloatParts pick_nan(FloatParts a, FloatParts b, float_status *s)
     return a;
 }
 
-static FloatParts pick_nan_muladd(FloatParts a, FloatParts b, FloatParts c,
-                                  bool inf_zero, float_status *s)
-{
-    int which;
-
-    if (is_snan(a.cls) || is_snan(b.cls) || is_snan(c.cls)) {
-        s->float_exception_flags |= float_flag_invalid;
-    }
-
-    which = pickNaNMulAdd(a.cls, b.cls, c.cls, inf_zero, s);
-
-    if (s->default_nan_mode) {
-        /* Note that this check is after pickNaNMulAdd so that function
-         * has an opportunity to set the Invalid flag.
-         */
-        which = 3;
-    }
-
-    switch (which) {
-    case 0:
-        break;
-    case 1:
-        a = b;
-        break;
-    case 2:
-        a = c;
-        break;
-    case 3:
-        return parts_default_nan(s);
-    default:
-        g_assert_not_reached();
-    }
-
-    if (is_snan(a.cls)) {
-        return parts_silence_nan(a, s);
-    }
-    return a;
-}
-
 /*
  * Returns the result of adding or subtracting the values of the
  * floating-point values `a' and `b'. The operation is performed
@@ -1366,7 +1327,41 @@ static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
      * off to the target-specific pick-a-NaN routine.
      */
     if (unlikely(abc_mask & float_cmask_anynan)) {
-        return pick_nan_muladd(a, b, c, inf_zero, s);
+        int which;
+
+        if (unlikely(abc_mask & float_cmask_snan)) {
+            float_raise(float_flag_invalid, s);
+        }
+
+        which = pickNaNMulAdd(a.cls, b.cls, c.cls, inf_zero, s);
+
+        if (s->default_nan_mode) {
+            /*
+             * Note that this check is after pickNaNMulAdd so that function
+             * has an opportunity to set the Invalid flag for inf_zero.
+             */
+            which = 3;
+        }
+
+        switch (which) {
+        case 0:
+            break;
+        case 1:
+            a = b;
+            break;
+        case 2:
+            a = c;
+            break;
+        case 3:
+            return parts_default_nan(s);
+        default:
+            g_assert_not_reached();
+        }
+
+        if (is_snan(a.cls)) {
+            return parts_silence_nan(a, s);
+        }
+        return a;
     }
 
     if (unlikely(inf_zero)) {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/8] softfloat: Implement float128_muladd
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (4 preceding siblings ...)
  2020-09-24  1:24 ` [PATCH 5/8] softfloat: Inline pick_nan_muladd into its caller Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  7:56   ` David Hildenbrand
  2020-09-24  1:24 ` [PATCH 7/8] softfloat: Use x86_64 assembly for {add,sub}{192,256} Richard Henderson
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/fpu/softfloat.h |   2 +
 fpu/softfloat.c         | 356 +++++++++++++++++++++++++++++++++++++++-
 tests/fp/fp-test.c      |   2 +-
 tests/fp/wrap.c.inc     |  12 ++
 4 files changed, 370 insertions(+), 2 deletions(-)

diff --git a/include/fpu/softfloat.h b/include/fpu/softfloat.h
index 78ad5ca738..a38433deb4 100644
--- a/include/fpu/softfloat.h
+++ b/include/fpu/softfloat.h
@@ -1196,6 +1196,8 @@ float128 float128_sub(float128, float128, float_status *status);
 float128 float128_mul(float128, float128, float_status *status);
 float128 float128_div(float128, float128, float_status *status);
 float128 float128_rem(float128, float128, float_status *status);
+float128 float128_muladd(float128, float128, float128, int,
+                         float_status *status);
 float128 float128_sqrt(float128, float_status *status);
 FloatRelation float128_compare(float128, float128, float_status *status);
 FloatRelation float128_compare_quiet(float128, float128, float_status *status);
diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index e038434a07..5b714fbd82 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -512,11 +512,19 @@ static inline __attribute__((unused)) bool is_qnan(FloatClass c)
 
 typedef struct {
     uint64_t frac;
-    int32_t  exp;
+    int32_t exp;
     FloatClass cls;
     bool sign;
 } FloatParts;
 
+/* Similar for float128.  */
+typedef struct {
+    uint64_t frac0, frac1;
+    int32_t exp;
+    FloatClass cls;
+    bool sign;
+} FloatParts128;
+
 #define DECOMPOSED_BINARY_POINT    (64 - 2)
 #define DECOMPOSED_IMPLICIT_BIT    (1ull << DECOMPOSED_BINARY_POINT)
 #define DECOMPOSED_OVERFLOW_BIT    (DECOMPOSED_IMPLICIT_BIT << 1)
@@ -4574,6 +4582,46 @@ static void
 
 }
 
+/*----------------------------------------------------------------------------
+| Returns the parts of floating-point value `a'.
+*----------------------------------------------------------------------------*/
+
+static void float128_unpack(FloatParts128 *p, float128 a, float_status *status)
+{
+    p->sign = extractFloat128Sign(a);
+    p->exp = extractFloat128Exp(a);
+    p->frac0 = extractFloat128Frac0(a);
+    p->frac1 = extractFloat128Frac1(a);
+
+    if (p->exp == 0) {
+        if ((p->frac0 | p->frac1) == 0) {
+            p->cls = float_class_zero;
+        } else if (status->flush_inputs_to_zero) {
+            float_raise(float_flag_input_denormal, status);
+            p->cls = float_class_zero;
+            p->frac0 = p->frac1 = 0;
+        } else {
+            normalizeFloat128Subnormal(p->frac0, p->frac1, &p->exp,
+                                       &p->frac0, &p->frac1);
+            p->exp -= 0x3fff;
+            p->cls = float_class_normal;
+        }
+    } else if (p->exp == 0x7fff) {
+        if ((p->frac0 | p->frac1) == 0) {
+            p->cls = float_class_inf;
+        } else if (float128_is_signaling_nan(a, status)) {
+            p->cls = float_class_snan;
+        } else {
+            p->cls = float_class_qnan;
+        }
+    } else {
+        /* Add the implicit bit. */
+        p->frac0 |= UINT64_C(0x0001000000000000);
+        p->exp -= 0x3fff;
+        p->cls = float_class_normal;
+    }
+}
+
 /*----------------------------------------------------------------------------
 | Packs the sign `zSign', the exponent `zExp', and the significand formed
 | by the concatenation of `zSig0' and `zSig1' into a quadruple-precision
@@ -7205,6 +7253,312 @@ float128 float128_mul(float128 a, float128 b, float_status *status)
 
 }
 
+static void shortShift256Left(uint64_t p[4], unsigned count)
+{
+    int negcount = -count & 63;
+
+    if (count == 0) {
+        return;
+    }
+    g_assert(count < 64);
+    p[0] = (p[0] << count) | (p[1] >> negcount);
+    p[1] = (p[1] << count) | (p[2] >> negcount);
+    p[2] = (p[2] << count) | (p[3] >> negcount);
+    p[3] = (p[3] << count);
+}
+
+static void shift256RightJamming(uint64_t p[4], int count)
+{
+    uint64_t in = 0;
+
+    g_assert(count >= 0);
+
+    count = MIN(count, 256);
+    for (; count >= 64; count -= 64) {
+        in |= p[3];
+        p[3] = p[2];
+        p[2] = p[1];
+        p[1] = p[0];
+        p[0] = 0;
+    }
+
+    if (count) {
+        int negcount = -count & 63;
+
+        in |= p[3] << negcount;
+        p[3] = (p[2] << negcount) | (p[3] >> count);
+        p[2] = (p[1] << negcount) | (p[2] >> count);
+        p[1] = (p[0] << negcount) | (p[1] >> count);
+        p[0] = p[0] >> count;
+    }
+    p[3] |= (in != 0);
+}
+
+/* R = A - B */
+static void sub256(uint64_t r[4], uint64_t a[4], uint64_t b[4])
+{
+    bool borrow = false;
+
+    for (int i = 3; i >= 0; --i) {
+        if (borrow) {
+            borrow = a[i] <= b[i];
+            r[i] = a[i] - b[i] - 1;
+        } else {
+            borrow = a[i] < b[i];
+            r[i] = a[i] - b[i];
+        }
+    }
+}
+
+/* A = -A */
+static void neg256(uint64_t a[4])
+{
+    a[3] = -a[3];
+    if (likely(a[3])) {
+        goto not2;
+    }
+    a[2] = -a[2];
+    if (likely(a[2])) {
+        goto not1;
+    }
+    a[1] = -a[1];
+    if (likely(a[1])) {
+        goto not0;
+    }
+    a[0] = -a[0];
+    return;
+ not2:
+    a[2] = ~a[2];
+ not1:
+    a[1] = ~a[1];
+ not0:
+    a[0] = ~a[0];
+}
+
+/* A += B */
+static void add256(uint64_t a[4], uint64_t b[4])
+{
+    bool carry = false;
+
+    for (int i = 3; i >= 0; --i) {
+        uint64_t t = a[i] + b[i];
+        if (carry) {
+            t += 1;
+            carry = t <= a[i];
+        } else {
+            carry = t < a[i];
+        }
+        a[i] = t;
+    }
+}
+
+float128 float128_muladd(float128 a_f, float128 b_f, float128 c_f,
+                         int flags, float_status *status)
+{
+    bool inf_zero, p_sign, sign_flip;
+    uint64_t p_frac[4];
+    FloatParts128 a, b, c;
+    int p_exp, exp_diff, shift, ab_mask, abc_mask;
+    FloatClass p_cls;
+
+    float128_unpack(&a, a_f, status);
+    float128_unpack(&b, b_f, status);
+    float128_unpack(&c, c_f, status);
+
+    ab_mask = float_cmask(a.cls) | float_cmask(b.cls);
+    abc_mask = float_cmask(c.cls) | ab_mask;
+    inf_zero = ab_mask == float_cmask_infzero;
+
+    /* If any input is a NaN, select the required result. */
+    if (unlikely(abc_mask & float_cmask_anynan)) {
+        if (unlikely(abc_mask & float_cmask_snan)) {
+            float_raise(float_flag_invalid, status);
+        }
+
+        int which = pickNaNMulAdd(a.cls, b.cls, c.cls, inf_zero, status);
+        if (status->default_nan_mode) {
+            which = 3;
+        }
+        switch (which) {
+        case 0:
+            break;
+        case 1:
+            a_f = b_f;
+            a.cls = b.cls;
+            break;
+        case 2:
+            a_f = c_f;
+            a.cls = c.cls;
+            break;
+        case 3:
+            return float128_default_nan(status);
+        }
+        if (is_snan(a.cls)) {
+            return float128_silence_nan(a_f, status);
+        }
+        return a_f;
+    }
+
+    /* After dealing with input NaNs, look for Inf * Zero. */
+    if (unlikely(inf_zero)) {
+        float_raise(float_flag_invalid, status);
+        return float128_default_nan(status);
+    }
+
+    p_sign = a.sign ^ b.sign;
+
+    if (flags & float_muladd_negate_c) {
+        c.sign ^= 1;
+    }
+    if (flags & float_muladd_negate_product) {
+        p_sign ^= 1;
+    }
+    sign_flip = (flags & float_muladd_negate_result);
+
+    if (ab_mask & float_cmask_inf) {
+        p_cls = float_class_inf;
+    } else if (ab_mask & float_cmask_zero) {
+        p_cls = float_class_zero;
+    } else {
+        p_cls = float_class_normal;
+    }
+
+    if (c.cls == float_class_inf) {
+        if (p_cls == float_class_inf && p_sign != c.sign) {
+            /* +Inf + -Inf = NaN */
+            float_raise(float_flag_invalid, status);
+            return float128_default_nan(status);
+        }
+        /* Inf + Inf = Inf of the proper sign; reuse the return below. */
+        p_cls = float_class_inf;
+        p_sign = c.sign;
+    }
+
+    if (p_cls == float_class_inf) {
+        return packFloat128(p_sign ^ sign_flip, 0x7fff, 0, 0);
+    }
+
+    if (p_cls == float_class_zero) {
+        if (c.cls == float_class_zero) {
+            if (p_sign != c.sign) {
+                p_sign = status->float_rounding_mode == float_round_down;
+            }
+            return packFloat128(p_sign ^ sign_flip, 0, 0, 0);
+        }
+
+        if (flags & float_muladd_halve_result) {
+            c.exp -= 1;
+        }
+        return roundAndPackFloat128(c.sign ^ sign_flip,
+                                    c.exp + 0x3fff - 1,
+                                    c.frac0, c.frac1, 0, status);
+    }
+
+    /* a & b should be normals now... */
+    assert(a.cls == float_class_normal && b.cls == float_class_normal);
+
+    /* Multiply of 2 113-bit numbers produces a 226-bit result.  */
+    mul128To256(a.frac0, a.frac1, b.frac0, b.frac1,
+                &p_frac[0], &p_frac[1], &p_frac[2], &p_frac[3]);
+
+    /* Realign the binary point at bit 48 of p_frac[0].  */
+    shift = clz64(p_frac[0]) - 15;
+    g_assert(shift == 15 || shift == 16);
+    shortShift256Left(p_frac, shift);
+    p_exp = a.exp + b.exp - (shift - 16);
+    exp_diff = p_exp - c.exp;
+
+    uint64_t c_frac[4] = { c.frac0, c.frac1, 0, 0 };
+
+    /* Add or subtract C from the intermediate product. */
+    if (c.cls == float_class_zero) {
+        /* Fall through to rounding after addition (with zero). */
+    } else if (p_sign != c.sign) {
+        /* Subtraction */
+        if (exp_diff < 0) {
+            shift256RightJamming(p_frac, -exp_diff);
+            sub256(p_frac, c_frac, p_frac);
+            p_exp = c.exp;
+            p_sign ^= 1;
+        } else if (exp_diff > 0) {
+            shift256RightJamming(c_frac, exp_diff);
+            sub256(p_frac, p_frac, c_frac);
+        } else {
+            /* Low 128 bits of C are known to be zero. */
+            sub128(p_frac[0], p_frac[1], c_frac[0], c_frac[1],
+                   &p_frac[0], &p_frac[1]);
+            /*
+             * Since we have normalized to bit 48 of p_frac[0],
+             * a negative result means C > P and we need to invert.
+             */
+            if ((int64_t)p_frac[0] < 0) {
+                neg256(p_frac);
+                p_sign ^= 1;
+            }
+        }
+
+        /*
+         * Gross normalization of the 256-bit subtraction result.
+         * Fine tuning below shared with addition.
+         */
+        if (p_frac[0] != 0) {
+            /* nothing to do */
+        } else if (p_frac[1] != 0) {
+            p_exp -= 64;
+            p_frac[0] = p_frac[1];
+            p_frac[1] = p_frac[2];
+            p_frac[2] = p_frac[3];
+            p_frac[3] = 0;
+        } else if (p_frac[2] != 0) {
+            p_exp -= 128;
+            p_frac[0] = p_frac[2];
+            p_frac[1] = p_frac[3];
+            p_frac[2] = 0;
+            p_frac[3] = 0;
+        } else if (p_frac[3] != 0) {
+            p_exp -= 192;
+            p_frac[0] = p_frac[3];
+            p_frac[1] = 0;
+            p_frac[2] = 0;
+            p_frac[3] = 0;
+        } else {
+            /* Subtraction was exact: result is zero. */
+            p_sign = status->float_rounding_mode == float_round_down;
+            return packFloat128(p_sign ^ sign_flip, 0, 0, 0);
+        }
+    } else {
+        /* Addition */
+        if (exp_diff <= 0) {
+            shift256RightJamming(p_frac, -exp_diff);
+            /* Low 128 bits of C are known to be zero. */
+            add128(p_frac[0], p_frac[1], c_frac[0], c_frac[1],
+                   &p_frac[0], &p_frac[1]);
+            p_exp = c.exp;
+        } else {
+            shift256RightJamming(c_frac, exp_diff);
+            add256(p_frac, c_frac);
+        }
+    }
+
+    /* Fine normalization of the 256-bit result: p_frac[0] != 0. */
+    shift = clz64(p_frac[0]) - 15;
+    if (shift < 0) {
+        shift256RightJamming(p_frac, -shift);
+    } else if (shift > 0) {
+        shortShift256Left(p_frac, shift);
+    }
+    p_exp -= shift;
+
+    if (flags & float_muladd_halve_result) {
+        p_exp -= 1;
+    }
+    return roundAndPackFloat128(p_sign ^ sign_flip,
+                                p_exp + 0x3fff - 1,
+                                p_frac[0], p_frac[1],
+                                p_frac[2] | (p_frac[3] != 0),
+                                status);
+}
+
 /*----------------------------------------------------------------------------
 | Returns the result of dividing the quadruple-precision floating-point value
 | `a' by the corresponding value `b'.  The operation is performed according to
diff --git a/tests/fp/fp-test.c b/tests/fp/fp-test.c
index 06ffebd6db..9bbb0dba67 100644
--- a/tests/fp/fp-test.c
+++ b/tests/fp/fp-test.c
@@ -717,7 +717,7 @@ static void do_testfloat(int op, int rmode, bool exact)
         test_abz_f128(true_abz_f128M, subj_abz_f128M);
         break;
     case F128_MULADD:
-        not_implemented();
+        test_abcz_f128(slow_f128M_mulAdd, qemu_f128_mulAdd);
         break;
     case F128_SQRT:
         test_az_f128(slow_f128M_sqrt, qemu_f128M_sqrt);
diff --git a/tests/fp/wrap.c.inc b/tests/fp/wrap.c.inc
index 0cbd20013e..65a713deae 100644
--- a/tests/fp/wrap.c.inc
+++ b/tests/fp/wrap.c.inc
@@ -574,6 +574,18 @@ WRAP_MULADD(qemu_f32_mulAdd, float32_muladd, float32)
 WRAP_MULADD(qemu_f64_mulAdd, float64_muladd, float64)
 #undef WRAP_MULADD
 
+static void qemu_f128_mulAdd(const float128_t *ap, const float128_t *bp,
+                             const float128_t *cp, float128_t *res)
+{
+    float128 a, b, c, ret;
+
+    a = soft_to_qemu128(*ap);
+    b = soft_to_qemu128(*bp);
+    c = soft_to_qemu128(*cp);
+    ret = float128_muladd(a, b, c, 0, &qsf);
+    *res = qemu_to_soft128(ret);
+}
+
 #define WRAP_CMP16(name, func, retcond)         \
     static bool name(float16_t a, float16_t b)  \
     {                                           \
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/8] softfloat: Use x86_64 assembly for {add,sub}{192,256}
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (5 preceding siblings ...)
  2020-09-24  1:24 ` [PATCH 6/8] softfloat: Implement float128_muladd Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  1:24 ` [PATCH 8/8] softfloat: Use aarch64 " Richard Henderson
  2020-09-24  8:00 ` [PATCH 0/8] softfloat: Implement float128_muladd David Hildenbrand
  8 siblings, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

The compiler cannot chain more than two additions together.
Use inline assembly for 3 or 4 additions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/fpu/softfloat-macros.h | 18 ++++++++++++++++--
 fpu/softfloat.c                | 28 ++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
index 95d88d05b8..99fa124e56 100644
--- a/include/fpu/softfloat-macros.h
+++ b/include/fpu/softfloat-macros.h
@@ -436,6 +436,13 @@ static inline void
      uint64_t *z2Ptr
  )
 {
+#ifdef __x86_64__
+    asm("add %5, %2\n\t"
+        "adc %4, %1\n\t"
+        "adc %3, %0"
+        : "=&r"(*z0Ptr), "=&r"(*z1Ptr), "=&r"(*z2Ptr)
+        : "rm"(b0), "rm"(b1), "rm"(b2), "0"(a0), "1"(a1), "2"(a2));
+#else
     uint64_t z0, z1, z2;
     int8_t carry0, carry1;
 
@@ -450,7 +457,7 @@ static inline void
     *z2Ptr = z2;
     *z1Ptr = z1;
     *z0Ptr = z0;
-
+#endif
 }
 
 /*----------------------------------------------------------------------------
@@ -494,6 +501,13 @@ static inline void
      uint64_t *z2Ptr
  )
 {
+#ifdef __x86_64__
+    asm("sub %5, %2\n\t"
+        "sbb %4, %1\n\t"
+        "sbb %3, %0"
+        : "=&r"(*z0Ptr), "=&r"(*z1Ptr), "=&r"(*z2Ptr)
+        : "rm"(b0), "rm"(b1), "rm"(b2), "0"(a0), "1"(a1), "2"(a2));
+#else
     uint64_t z0, z1, z2;
     int8_t borrow0, borrow1;
 
@@ -508,7 +522,7 @@ static inline void
     *z2Ptr = z2;
     *z1Ptr = z1;
     *z0Ptr = z0;
-
+#endif
 }
 
 /*----------------------------------------------------------------------------
diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 5b714fbd82..d8e5d90fd7 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -7297,6 +7297,15 @@ static void shift256RightJamming(uint64_t p[4], int count)
 /* R = A - B */
 static void sub256(uint64_t r[4], uint64_t a[4], uint64_t b[4])
 {
+#if defined(__x86_64__)
+    asm("sub %7, %3\n\t"
+        "sbb %6, %2\n\t"
+        "sbb %5, %1\n\t"
+        "sbb %4, %0"
+        : "=&r"(r[0]), "=&r"(r[1]), "=&r"(r[2]), "=&r"(r[3])
+        : "rme"(b[0]), "rme"(b[1]), "rme"(b[2]), "rme"(b[3]),
+            "0"(a[0]),   "1"(a[1]),   "2"(a[2]),   "3"(a[3]));
+#else
     bool borrow = false;
 
     for (int i = 3; i >= 0; --i) {
@@ -7308,11 +7317,20 @@ static void sub256(uint64_t r[4], uint64_t a[4], uint64_t b[4])
             r[i] = a[i] - b[i];
         }
     }
+#endif
 }
 
 /* A = -A */
 static void neg256(uint64_t a[4])
 {
+#if defined(__x86_64__)
+    asm("negq %3\n\t"
+        "sbb %6, %2\n\t"
+        "sbb %5, %1\n\t"
+        "sbb %4, %0"
+        : "=&r"(a[0]), "=&r"(a[1]), "=&r"(a[2]), "+rm"(a[3])
+        : "rme"(a[0]), "rme"(a[1]), "rme"(a[2]), "0"(0), "1"(0), "2"(0));
+#else
     a[3] = -a[3];
     if (likely(a[3])) {
         goto not2;
@@ -7333,11 +7351,20 @@ static void neg256(uint64_t a[4])
     a[1] = ~a[1];
  not0:
     a[0] = ~a[0];
+#endif
 }
 
 /* A += B */
 static void add256(uint64_t a[4], uint64_t b[4])
 {
+#if defined(__x86_64__)
+    asm("add %7, %3\n\t"
+        "adc %6, %2\n\t"
+        "adc %5, %1\n\t"
+        "adc %4, %0"
+        :  "+r"(a[0]),  "+r"(a[1]),  "+r"(a[2]),  "+r"(a[3])
+        : "rme"(b[0]), "rme"(b[1]), "rme"(b[2]), "rme"(b[3]));
+#else
     bool carry = false;
 
     for (int i = 3; i >= 0; --i) {
@@ -7350,6 +7377,7 @@ static void add256(uint64_t a[4], uint64_t b[4])
         }
         a[i] = t;
     }
+#endif
 }
 
 float128 float128_muladd(float128 a_f, float128 b_f, float128 c_f,
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 8/8] softfloat: Use aarch64 assembly for {add,sub}{192,256}
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (6 preceding siblings ...)
  2020-09-24  1:24 ` [PATCH 7/8] softfloat: Use x86_64 assembly for {add,sub}{192,256} Richard Henderson
@ 2020-09-24  1:24 ` Richard Henderson
  2020-09-24  8:00 ` [PATCH 0/8] softfloat: Implement float128_muladd David Hildenbrand
  8 siblings, 0 replies; 18+ messages in thread
From: Richard Henderson @ 2020-09-24  1:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: bharata, alex.bennee, david

The compiler cannot chain more than two additions together.
Use inline assembly for 3 or 4 additions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/fpu/softfloat-macros.h | 14 ++++++++++++++
 fpu/softfloat.c                | 25 +++++++++++++++++++++++++
 2 files changed, 39 insertions(+)

diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
index 99fa124e56..969a486fd2 100644
--- a/include/fpu/softfloat-macros.h
+++ b/include/fpu/softfloat-macros.h
@@ -442,6 +442,13 @@ static inline void
         "adc %3, %0"
         : "=&r"(*z0Ptr), "=&r"(*z1Ptr), "=&r"(*z2Ptr)
         : "rm"(b0), "rm"(b1), "rm"(b2), "0"(a0), "1"(a1), "2"(a2));
+#elif defined(__aarch64__)
+    asm("adds %2, %x5, %x8\n\t"
+        "adcs %1, %x4, %x7\n\t"
+        "adc  %0, %x3, %x6"
+        : "=&r"(*z0Ptr), "=&r"(*z1Ptr), "=&r"(*z2Ptr)
+        : "rZ"(a0), "rZ"(a1), "rZ"(a2), "rZ"(b0), "rZ"(b1), "rZ"(b2)
+        : "cc");
 #else
     uint64_t z0, z1, z2;
     int8_t carry0, carry1;
@@ -507,6 +514,13 @@ static inline void
         "sbb %3, %0"
         : "=&r"(*z0Ptr), "=&r"(*z1Ptr), "=&r"(*z2Ptr)
         : "rm"(b0), "rm"(b1), "rm"(b2), "0"(a0), "1"(a1), "2"(a2));
+#elif defined(__aarch64__)
+    asm("subs %2, %x5, %x8\n\t"
+        "sbcs %1, %x4, %x7\n\t"
+        "sbc  %0, %x3, %x6"
+        : "=&r"(*z0Ptr), "=&r"(*z1Ptr), "=&r"(*z2Ptr)
+        : "rZ"(a0), "rZ"(a1), "rZ"(a2), "rZ"(b0), "rZ"(b1), "rZ"(b2)
+        : "cc");
 #else
     uint64_t z0, z1, z2;
     int8_t borrow0, borrow1;
diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index d8e5d90fd7..1601095d60 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -7305,6 +7305,16 @@ static void sub256(uint64_t r[4], uint64_t a[4], uint64_t b[4])
         : "=&r"(r[0]), "=&r"(r[1]), "=&r"(r[2]), "=&r"(r[3])
         : "rme"(b[0]), "rme"(b[1]), "rme"(b[2]), "rme"(b[3]),
             "0"(a[0]),   "1"(a[1]),   "2"(a[2]),   "3"(a[3]));
+#elif defined(__aarch64__)
+    asm("subs %[r3], %x[a3], %x[b3]\n\t"
+        "sbcs %[r2], %x[a2], %x[b2]\n\t"
+        "sbcs %[r1], %x[a1], %x[b1]\n\t"
+        "sbc  %[r0], %x[a0], %x[b0]"
+        : [r0] "=&r"(r[0]), [r1] "=&r"(r[1]),
+          [r2] "=&r"(r[2]), [r3] "=&r"(r[3])
+        : [a0] "rZ"(a[0]), [a1] "rZ"(a[1]), [a2] "rZ"(a[2]), [a3] "rZ"(a[3]),
+          [b0] "rZ"(b[0]), [b1] "rZ"(b[1]), [b2] "rZ"(b[2]), [b3] "rZ"(b[3])
+        : "cc");
 #else
     bool borrow = false;
 
@@ -7330,6 +7340,13 @@ static void neg256(uint64_t a[4])
         "sbb %4, %0"
         : "=&r"(a[0]), "=&r"(a[1]), "=&r"(a[2]), "+rm"(a[3])
         : "rme"(a[0]), "rme"(a[1]), "rme"(a[2]), "0"(0), "1"(0), "2"(0));
+#elif defined(__aarch64__)
+    asm("negs %3, %3\n\t"
+        "ngcs %2, %2\n\t"
+        "ngcs %1, %1\n\t"
+        "ngc  %0, %0"
+        : "+r"(a[0]), "+r"(a[1]), "+r"(a[2]), "+r"(a[3])
+        : : "cc");
 #else
     a[3] = -a[3];
     if (likely(a[3])) {
@@ -7364,6 +7381,14 @@ static void add256(uint64_t a[4], uint64_t b[4])
         "adc %4, %0"
         :  "+r"(a[0]),  "+r"(a[1]),  "+r"(a[2]),  "+r"(a[3])
         : "rme"(b[0]), "rme"(b[1]), "rme"(b[2]), "rme"(b[3]));
+#elif defined(__aarch64__)
+    asm("adds %3, %3, %x7\n\t"
+        "adcs %2, %2, %x6\n\t"
+        "adcs %1, %1, %x5\n\t"
+        "adc  %0, %0, %x4"
+        : "+r"(a[0]), "+r"(a[1]), "+r"(a[2]), "+r"(a[3])
+        : "rZ"(b[0]), "rZ"(b[1]), "rZ"(b[2]), "rZ"(b[3])
+        : "cc");
 #else
     bool carry = false;
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/8] softfloat: Use mulu64 for mul64To128
  2020-09-24  1:24 ` [PATCH 1/8] softfloat: Use mulu64 for mul64To128 Richard Henderson
@ 2020-09-24  7:32   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  7:32 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 03:24, Richard Henderson wrote:
> Via host-utils.h, we use a host widening multiply for
> 64-bit hosts, and a common subroutine for 32-bit hosts.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  include/fpu/softfloat-macros.h | 24 ++++--------------------
>  1 file changed, 4 insertions(+), 20 deletions(-)
> 
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index a35ec2893a..57845f8af0 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -83,6 +83,7 @@ this code that are retained.
>  #define FPU_SOFTFLOAT_MACROS_H
>  
>  #include "fpu/softfloat-types.h"
> +#include "qemu/host-utils.h"
>  
>  /*----------------------------------------------------------------------------
>  | Shifts `a' right by the number of bits given in `count'.  If any nonzero
> @@ -515,27 +516,10 @@ static inline void
>  | `z0Ptr' and `z1Ptr'.
>  *----------------------------------------------------------------------------*/
>  
> -static inline void mul64To128( uint64_t a, uint64_t b, uint64_t *z0Ptr, uint64_t *z1Ptr )
> +static inline void
> +mul64To128(uint64_t a, uint64_t b, uint64_t *z0Ptr, uint64_t *z1Ptr)
>  {
> -    uint32_t aHigh, aLow, bHigh, bLow;
> -    uint64_t z0, zMiddleA, zMiddleB, z1;
> -
> -    aLow = a;
> -    aHigh = a>>32;
> -    bLow = b;
> -    bHigh = b>>32;
> -    z1 = ( (uint64_t) aLow ) * bLow;
> -    zMiddleA = ( (uint64_t) aLow ) * bHigh;
> -    zMiddleB = ( (uint64_t) aHigh ) * bLow;
> -    z0 = ( (uint64_t) aHigh ) * bHigh;
> -    zMiddleA += zMiddleB;
> -    z0 += ( ( (uint64_t) ( zMiddleA < zMiddleB ) )<<32 ) + ( zMiddleA>>32 );
> -    zMiddleA <<= 32;
> -    z1 += zMiddleA;
> -    z0 += ( z1 < zMiddleA );
> -    *z1Ptr = z1;
> -    *z0Ptr = z0;
> -
> +    mulu64(z1Ptr, z0Ptr, a, b);
>  }
>  
>  /*----------------------------------------------------------------------------
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/8] softfloat: Use int128.h for some operations
  2020-09-24  1:24 ` [PATCH 2/8] softfloat: Use int128.h for some operations Richard Henderson
@ 2020-09-24  7:35   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  7:35 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 03:24, Richard Henderson wrote:
> Use our Int128, which wraps the compiler's __int128_t,
> instead of open-coding left shifts and arithmetic.
> We'd need to extend Int128 to have unsigned operations
> to replace more than these three.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  include/fpu/softfloat-macros.h | 39 +++++++++++++++++-----------------
>  1 file changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/include/fpu/softfloat-macros.h b/include/fpu/softfloat-macros.h
> index 57845f8af0..95d88d05b8 100644
> --- a/include/fpu/softfloat-macros.h
> +++ b/include/fpu/softfloat-macros.h
> @@ -84,6 +84,7 @@ this code that are retained.
>  
>  #include "fpu/softfloat-types.h"
>  #include "qemu/host-utils.h"
> +#include "qemu/int128.h"
>  
>  /*----------------------------------------------------------------------------
>  | Shifts `a' right by the number of bits given in `count'.  If any nonzero
> @@ -352,13 +353,11 @@ static inline void shortShift128Left(uint64_t a0, uint64_t a1, int count,
>  static inline void shift128Left(uint64_t a0, uint64_t a1, int count,
>                                  uint64_t *z0Ptr, uint64_t *z1Ptr)
>  {
> -    if (count < 64) {
> -        *z1Ptr = a1 << count;
> -        *z0Ptr = count == 0 ? a0 : (a0 << count) | (a1 >> (-count & 63));
> -    } else {
> -        *z1Ptr = 0;
> -        *z0Ptr = a1 << (count - 64);
> -    }
> +    Int128 a = int128_make128(a1, a0);
> +    Int128 z = int128_lshift(a, count);
> +
> +    *z0Ptr = int128_gethi(z);
> +    *z1Ptr = int128_getlo(z);
>  }
>  
>  /*----------------------------------------------------------------------------
> @@ -405,15 +404,15 @@ static inline void
>  *----------------------------------------------------------------------------*/
>  
>  static inline void
> - add128(
> -     uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1, uint64_t *z0Ptr, uint64_t *z1Ptr )
> +add128(uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1,
> +       uint64_t *z0Ptr, uint64_t *z1Ptr)
>  {
> -    uint64_t z1;
> -
> -    z1 = a1 + b1;
> -    *z1Ptr = z1;
> -    *z0Ptr = a0 + b0 + ( z1 < a1 );
> +    Int128 a = int128_make128(a1, a0);
> +    Int128 b = int128_make128(b1, b0);
> +    Int128 z = int128_add(a, b);
>  
> +    *z0Ptr = int128_gethi(z);
> +    *z1Ptr = int128_getlo(z);
>  }
>  
>  /*----------------------------------------------------------------------------
> @@ -463,13 +462,15 @@ static inline void
>  *----------------------------------------------------------------------------*/
>  
>  static inline void
> - sub128(
> -     uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1, uint64_t *z0Ptr, uint64_t *z1Ptr )
> +sub128(uint64_t a0, uint64_t a1, uint64_t b0, uint64_t b1,
> +       uint64_t *z0Ptr, uint64_t *z1Ptr)
>  {
> +    Int128 a = int128_make128(a1, a0);
> +    Int128 b = int128_make128(b1, b0);
> +    Int128 z = int128_sub(a, b);
>  
> -    *z1Ptr = a1 - b1;
> -    *z0Ptr = a0 - b0 - ( a1 < b1 );
> -
> +    *z0Ptr = int128_gethi(z);
> +    *z1Ptr = int128_getlo(z);
>  }
>  
>  /*----------------------------------------------------------------------------
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/8] softfloat: Tidy a * b + inf return
  2020-09-24  1:24 ` [PATCH 3/8] softfloat: Tidy a * b + inf return Richard Henderson
@ 2020-09-24  7:37   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  7:37 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 03:24, Richard Henderson wrote:
> No reason to set values in 'a', when we already
> have float_class_inf in 'c', and can flip that sign.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  fpu/softfloat.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fpu/softfloat.c b/fpu/softfloat.c
> index 67cfa0fd82..9db55d2b11 100644
> --- a/fpu/softfloat.c
> +++ b/fpu/softfloat.c
> @@ -1380,9 +1380,8 @@ static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
>              s->float_exception_flags |= float_flag_invalid;
>              return parts_default_nan(s);
>          } else {
> -            a.cls = float_class_inf;
> -            a.sign = c.sign ^ sign_flip;
> -            return a;
> +            c.sign ^= sign_flip;
> +            return c;
>          }
>      }
>  
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/8] softfloat: Add float_cmask and constants
  2020-09-24  1:24 ` [PATCH 4/8] softfloat: Add float_cmask and constants Richard Henderson
@ 2020-09-24  7:40   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  7:40 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 03:24, Richard Henderson wrote:
> Testing more than one class at a time is better done with masks.
> This reduces the static branch count.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  fpu/softfloat.c | 31 ++++++++++++++++++++++++-------
>  1 file changed, 24 insertions(+), 7 deletions(-)
> 
> diff --git a/fpu/softfloat.c b/fpu/softfloat.c
> index 9db55d2b11..3e625c47cd 100644
> --- a/fpu/softfloat.c
> +++ b/fpu/softfloat.c
> @@ -469,6 +469,20 @@ typedef enum __attribute__ ((__packed__)) {
>      float_class_snan,
>  } FloatClass;
>  
> +#define float_cmask(bit)  (1u << (bit))
> +
> +enum {
> +    float_cmask_zero    = float_cmask(float_class_zero),
> +    float_cmask_normal  = float_cmask(float_class_normal),
> +    float_cmask_inf     = float_cmask(float_class_inf),
> +    float_cmask_qnan    = float_cmask(float_class_qnan),
> +    float_cmask_snan    = float_cmask(float_class_snan),
> +
> +    float_cmask_infzero = float_cmask_zero | float_cmask_inf,
> +    float_cmask_anynan  = float_cmask_qnan | float_cmask_snan,
> +};
> +
> +
>  /* Simple helpers for checking if, or what kind of, NaN we have */
>  static inline __attribute__((unused)) bool is_nan(FloatClass c)
>  {
> @@ -1335,24 +1349,27 @@ bfloat16 QEMU_FLATTEN bfloat16_mul(bfloat16 a, bfloat16 b, float_status *status)
>  static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
>                                  int flags, float_status *s)
>  {
> -    bool inf_zero = ((1 << a.cls) | (1 << b.cls)) ==
> -                    ((1 << float_class_inf) | (1 << float_class_zero));
> -    bool p_sign;
> +    bool inf_zero, p_sign;
>      bool sign_flip = flags & float_muladd_negate_result;
>      FloatClass p_class;
>      uint64_t hi, lo;
>      int p_exp;
> +    int ab_mask, abc_mask;
> +
> +    ab_mask = float_cmask(a.cls) | float_cmask(b.cls);
> +    abc_mask = float_cmask(c.cls) | ab_mask;
> +    inf_zero = ab_mask == float_cmask_infzero;
>  
>      /* It is implementation-defined whether the cases of (0,inf,qnan)
>       * and (inf,0,qnan) raise InvalidOperation or not (and what QNaN
>       * they return if they do), so we have to hand this information
>       * off to the target-specific pick-a-NaN routine.
>       */
> -    if (is_nan(a.cls) || is_nan(b.cls) || is_nan(c.cls)) {
> +    if (unlikely(abc_mask & float_cmask_anynan)) {
>          return pick_nan_muladd(a, b, c, inf_zero, s);
>      }
>  
> -    if (inf_zero) {
> +    if (unlikely(inf_zero)) {
>          s->float_exception_flags |= float_flag_invalid;
>          return parts_default_nan(s);
>      }
> @@ -1367,9 +1384,9 @@ static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
>          p_sign ^= 1;
>      }
>  
> -    if (a.cls == float_class_inf || b.cls == float_class_inf) {
> +    if (ab_mask & float_cmask_inf) {
>          p_class = float_class_inf;
> -    } else if (a.cls == float_class_zero || b.cls == float_class_zero) {
> +    } else if (ab_mask & float_cmask_zero) {
>          p_class = float_class_zero;
>      } else {
>          p_class = float_class_normal;
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/8] softfloat: Inline pick_nan_muladd into its caller
  2020-09-24  1:24 ` [PATCH 5/8] softfloat: Inline pick_nan_muladd into its caller Richard Henderson
@ 2020-09-24  7:42   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  7:42 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 03:24, Richard Henderson wrote:
> Because of FloatParts, there will only ever be one caller.
> Inlining allows us to re-use abc_mask for the snan test.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  fpu/softfloat.c | 75 +++++++++++++++++++++++--------------------------
>  1 file changed, 35 insertions(+), 40 deletions(-)
> 
> diff --git a/fpu/softfloat.c b/fpu/softfloat.c
> index 3e625c47cd..e038434a07 100644
> --- a/fpu/softfloat.c
> +++ b/fpu/softfloat.c
> @@ -929,45 +929,6 @@ static FloatParts pick_nan(FloatParts a, FloatParts b, float_status *s)
>      return a;
>  }
>  
> -static FloatParts pick_nan_muladd(FloatParts a, FloatParts b, FloatParts c,
> -                                  bool inf_zero, float_status *s)
> -{
> -    int which;
> -
> -    if (is_snan(a.cls) || is_snan(b.cls) || is_snan(c.cls)) {
> -        s->float_exception_flags |= float_flag_invalid;
> -    }
> -
> -    which = pickNaNMulAdd(a.cls, b.cls, c.cls, inf_zero, s);
> -
> -    if (s->default_nan_mode) {
> -        /* Note that this check is after pickNaNMulAdd so that function
> -         * has an opportunity to set the Invalid flag.
> -         */
> -        which = 3;
> -    }
> -
> -    switch (which) {
> -    case 0:
> -        break;
> -    case 1:
> -        a = b;
> -        break;
> -    case 2:
> -        a = c;
> -        break;
> -    case 3:
> -        return parts_default_nan(s);
> -    default:
> -        g_assert_not_reached();
> -    }
> -
> -    if (is_snan(a.cls)) {
> -        return parts_silence_nan(a, s);
> -    }
> -    return a;
> -}
> -
>  /*
>   * Returns the result of adding or subtracting the values of the
>   * floating-point values `a' and `b'. The operation is performed
> @@ -1366,7 +1327,41 @@ static FloatParts muladd_floats(FloatParts a, FloatParts b, FloatParts c,
>       * off to the target-specific pick-a-NaN routine.
>       */
>      if (unlikely(abc_mask & float_cmask_anynan)) {
> -        return pick_nan_muladd(a, b, c, inf_zero, s);
> +        int which;
> +
> +        if (unlikely(abc_mask & float_cmask_snan)) {
> +            float_raise(float_flag_invalid, s);
> +        }
> +
> +        which = pickNaNMulAdd(a.cls, b.cls, c.cls, inf_zero, s);
> +
> +        if (s->default_nan_mode) {
> +            /*
> +             * Note that this check is after pickNaNMulAdd so that function
> +             * has an opportunity to set the Invalid flag for inf_zero.
> +             */
> +            which = 3;
> +        }
> +
> +        switch (which) {
> +        case 0:
> +            break;
> +        case 1:
> +            a = b;
> +            break;
> +        case 2:
> +            a = c;
> +            break;
> +        case 3:
> +            return parts_default_nan(s);
> +        default:
> +            g_assert_not_reached();
> +        }
> +
> +        if (is_snan(a.cls)) {
> +            return parts_silence_nan(a, s);
> +        }
> +        return a;
>      }
>  
>      if (unlikely(inf_zero)) {
> 

Not sure if that increases readability of muladd_floats() ... sometimes
there is good reason to factor out stuff into subfunctions to improve
readability.

But the change itself looks good to me

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/8] softfloat: Implement float128_muladd
  2020-09-24  1:24 ` [PATCH 6/8] softfloat: Implement float128_muladd Richard Henderson
@ 2020-09-24  7:56   ` David Hildenbrand
  2020-09-24 13:30     ` Richard Henderson
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  7:56 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

[...]

>  /*----------------------------------------------------------------------------
>  | Packs the sign `zSign', the exponent `zExp', and the significand formed
>  | by the concatenation of `zSig0' and `zSig1' into a quadruple-precision
> @@ -7205,6 +7253,312 @@ float128 float128_mul(float128 a, float128 b, float_status *status)
>  
>  }
>  

I do wonder if a type for Int256 would make sense - instead of manually
passing these arrays.

> +static void shortShift256Left(uint64_t p[4], unsigned count)
> +{
> +    int negcount = -count & 63;

That's the same as "64 - count", right? (which I find easier to get)

> +
> +    if (count == 0) {
> +        return;
> +    }
> +    g_assert(count < 64);
> +    p[0] = (p[0] << count) | (p[1] >> negcount);
> +    p[1] = (p[1] << count) | (p[2] >> negcount);
> +    p[2] = (p[2] << count) | (p[3] >> negcount);
> +    p[3] = (p[3] << count);
> +}
> +
> +static void shift256RightJamming(uint64_t p[4], int count)
> +{
> +    uint64_t in = 0;
> +
> +    g_assert(count >= 0);
> +
> +    count = MIN(count, 256);
> +    for (; count >= 64; count -= 64) {
> +        in |= p[3];
> +        p[3] = p[2];
> +        p[2] = p[1];
> +        p[1] = p[0];
> +        p[0] = 0;
> +    }
> +
> +    if (count) {
> +        int negcount = -count & 63;

dito

> +
> +        in |= p[3] << negcount;
> +        p[3] = (p[2] << negcount) | (p[3] >> count);
> +        p[2] = (p[1] << negcount) | (p[2] >> count);
> +        p[1] = (p[0] << negcount) | (p[1] >> count);
> +        p[0] = p[0] >> count;
> +    }
> +    p[3] |= (in != 0);

Took ma a bit longer to understand, but now I know why the function name
has "Jamming" in it :)

[...]

> +
> +float128 float128_muladd(float128 a_f, float128 b_f, float128 c_f,
> +                         int flags, float_status *status)
> +{
> +    bool inf_zero, p_sign, sign_flip;
> +    uint64_t p_frac[4];
> +    FloatParts128 a, b, c;
> +    int p_exp, exp_diff, shift, ab_mask, abc_mask;
> +    FloatClass p_cls;
> +
> +    float128_unpack(&a, a_f, status);
> +    float128_unpack(&b, b_f, status);
> +    float128_unpack(&c, c_f, status);
> +
> +    ab_mask = float_cmask(a.cls) | float_cmask(b.cls);
> +    abc_mask = float_cmask(c.cls) | ab_mask;
> +    inf_zero = ab_mask == float_cmask_infzero;
> +
> +    /* If any input is a NaN, select the required result. */
> +    if (unlikely(abc_mask & float_cmask_anynan)) {
> +        if (unlikely(abc_mask & float_cmask_snan)) {
> +            float_raise(float_flag_invalid, status);
> +        }
> +
> +        int which = pickNaNMulAdd(a.cls, b.cls, c.cls, inf_zero, status);
> +        if (status->default_nan_mode) {
> +            which = 3;
> +        }
> +        switch (which) {
> +        case 0:
> +            break;
> +        case 1:
> +            a_f = b_f;
> +            a.cls = b.cls;
> +            break;
> +        case 2:
> +            a_f = c_f;
> +            a.cls = c.cls;
> +            break;
> +        case 3:
> +            return float128_default_nan(status);
> +        }
> +        if (is_snan(a.cls)) {
> +            return float128_silence_nan(a_f, status);
> +        }
> +        return a_f;
> +    }
> +
> +    /* After dealing with input NaNs, look for Inf * Zero. */
> +    if (unlikely(inf_zero)) {
> +        float_raise(float_flag_invalid, status);
> +        return float128_default_nan(status);
> +    }
> +
> +    p_sign = a.sign ^ b.sign;
> +
> +    if (flags & float_muladd_negate_c) {
> +        c.sign ^= 1;
> +    }
> +    if (flags & float_muladd_negate_product) {
> +        p_sign ^= 1;
> +    }
> +    sign_flip = (flags & float_muladd_negate_result);
> +
> +    if (ab_mask & float_cmask_inf) {
> +        p_cls = float_class_inf;
> +    } else if (ab_mask & float_cmask_zero) {
> +        p_cls = float_class_zero;
> +    } else {
> +        p_cls = float_class_normal;
> +    }
> +
> +    if (c.cls == float_class_inf) {
> +        if (p_cls == float_class_inf && p_sign != c.sign) {
> +            /* +Inf + -Inf = NaN */
> +            float_raise(float_flag_invalid, status);
> +            return float128_default_nan(status);
> +        }
> +        /* Inf + Inf = Inf of the proper sign; reuse the return below. */
> +        p_cls = float_class_inf;
> +        p_sign = c.sign;
> +    }
> +
> +    if (p_cls == float_class_inf) {
> +        return packFloat128(p_sign ^ sign_flip, 0x7fff, 0, 0);
> +    }
> +
> +    if (p_cls == float_class_zero) {
> +        if (c.cls == float_class_zero) {
> +            if (p_sign != c.sign) {
> +                p_sign = status->float_rounding_mode == float_round_down;
> +            }
> +            return packFloat128(p_sign ^ sign_flip, 0, 0, 0);
> +        }
> +
> +        if (flags & float_muladd_halve_result) {
> +            c.exp -= 1;
> +        }
> +        return roundAndPackFloat128(c.sign ^ sign_flip,
> +                                    c.exp + 0x3fff - 1,
> +                                    c.frac0, c.frac1, 0, status);
> +    }
> +
> +    /* a & b should be normals now... */
> +    assert(a.cls == float_class_normal && b.cls == float_class_normal);
> +
> +    /* Multiply of 2 113-bit numbers produces a 226-bit result.  */
> +    mul128To256(a.frac0, a.frac1, b.frac0, b.frac1,
> +                &p_frac[0], &p_frac[1], &p_frac[2], &p_frac[3]);
> +
> +    /* Realign the binary point at bit 48 of p_frac[0].  */
> +    shift = clz64(p_frac[0]) - 15;
> +    g_assert(shift == 15 || shift == 16);
> +    shortShift256Left(p_frac, shift);
> +    p_exp = a.exp + b.exp - (shift - 16);
> +    exp_diff = p_exp - c.exp;
> +
> +    uint64_t c_frac[4] = { c.frac0, c.frac1, 0, 0 };
> +
> +    /* Add or subtract C from the intermediate product. */
> +    if (c.cls == float_class_zero) {
> +        /* Fall through to rounding after addition (with zero). */
> +    } else if (p_sign != c.sign) {
> +        /* Subtraction */
> +        if (exp_diff < 0) {
> +            shift256RightJamming(p_frac, -exp_diff);
> +            sub256(p_frac, c_frac, p_frac);
> +            p_exp = c.exp;
> +            p_sign ^= 1;
> +        } else if (exp_diff > 0) {
> +            shift256RightJamming(c_frac, exp_diff);
> +            sub256(p_frac, p_frac, c_frac);
> +        } else {
> +            /* Low 128 bits of C are known to be zero. */
> +            sub128(p_frac[0], p_frac[1], c_frac[0], c_frac[1],
> +                   &p_frac[0], &p_frac[1]);
> +            /*
> +             * Since we have normalized to bit 48 of p_frac[0],
> +             * a negative result means C > P and we need to invert.
> +             */
> +            if ((int64_t)p_frac[0] < 0) {
> +                neg256(p_frac);
> +                p_sign ^= 1;
> +            }
> +        }
> +
> +        /*
> +         * Gross normalization of the 256-bit subtraction result.
> +         * Fine tuning below shared with addition.
> +         */
> +        if (p_frac[0] != 0) {
> +            /* nothing to do */
> +        } else if (p_frac[1] != 0) {
> +            p_exp -= 64;
> +            p_frac[0] = p_frac[1];
> +            p_frac[1] = p_frac[2];
> +            p_frac[2] = p_frac[3];
> +            p_frac[3] = 0;
> +        } else if (p_frac[2] != 0) {
> +            p_exp -= 128;
> +            p_frac[0] = p_frac[2];
> +            p_frac[1] = p_frac[3];
> +            p_frac[2] = 0;
> +            p_frac[3] = 0;
> +        } else if (p_frac[3] != 0) {
> +            p_exp -= 192;
> +            p_frac[0] = p_frac[3];
> +            p_frac[1] = 0;
> +            p_frac[2] = 0;
> +            p_frac[3] = 0;
> +        } else {
> +            /* Subtraction was exact: result is zero. */
> +            p_sign = status->float_rounding_mode == float_round_down;
> +            return packFloat128(p_sign ^ sign_flip, 0, 0, 0);
> +        }
> +    } else {
> +        /* Addition */
> +        if (exp_diff <= 0) {
> +            shift256RightJamming(p_frac, -exp_diff);
> +            /* Low 128 bits of C are known to be zero. */
> +            add128(p_frac[0], p_frac[1], c_frac[0], c_frac[1],
> +                   &p_frac[0], &p_frac[1]);
> +            p_exp = c.exp;
> +        } else {
> +            shift256RightJamming(c_frac, exp_diff);
> +            add256(p_frac, c_frac);
> +        }
> +    }
> +
> +    /* Fine normalization of the 256-bit result: p_frac[0] != 0. */
> +    shift = clz64(p_frac[0]) - 15;
> +    if (shift < 0) {
> +        shift256RightJamming(p_frac, -shift);
> +    } else if (shift > 0) {
> +        shortShift256Left(p_frac, shift);
> +    }
> +    p_exp -= shift;
> +
> +    if (flags & float_muladd_halve_result) {
> +        p_exp -= 1;
> +    }
> +    return roundAndPackFloat128(p_sign ^ sign_flip,
> +                                p_exp + 0x3fff - 1,
> +                                p_frac[0], p_frac[1],
> +                                p_frac[2] | (p_frac[3] != 0),
> +                                status);
> +}

Wow, that's a beast :)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/8] softfloat: Implement float128_muladd
  2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
                   ` (7 preceding siblings ...)
  2020-09-24  1:24 ` [PATCH 8/8] softfloat: Use aarch64 " Richard Henderson
@ 2020-09-24  8:00 ` David Hildenbrand
  8 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-24  8:00 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 03:24, Richard Henderson wrote:
> Plus assorted cleanups, passes tests/fp/fp-test.
> I will eventually fill in ppc and s390x assembly bits.
> 

Thanks for looking into this! Would have taken me ages to come up with
that :)


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/8] softfloat: Implement float128_muladd
  2020-09-24  7:56   ` David Hildenbrand
@ 2020-09-24 13:30     ` Richard Henderson
  2020-09-25  9:17       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Richard Henderson @ 2020-09-24 13:30 UTC (permalink / raw)
  To: David Hildenbrand, qemu-devel; +Cc: alex.bennee, bharata

On 9/24/20 12:56 AM, David Hildenbrand wrote:
> I do wonder if a type for Int256 would make sense - instead of > manually passing these arrays.

I could do that.  It does name better, I suppose, in passing.  So long as
you're happy having the guts of the type be public, and not wrapping everything
like we do for Int128...

Either is better than the softfloat style, which would pass 12 arguments to
these helpers... ;-)

>> +static void shortShift256Left(uint64_t p[4], unsigned count)
>> +{
>> +    int negcount = -count & 63;
> 
> That's the same as "64 - count", right? (which I find easier to get)

In this case, yes.

Of course, more hosts have a "neg" instruction than they do a
"subtract-from-immediate" instruction.  When the shift instruction only
examines the low N bits, the "& 63" is optimized away, so this can result in 1
fewer instruction in the end.

Which is why I almost always use this form, and it's already all over the code
inherited from upstream.

> Wow, that's a beast :)

But not much worse than muladd_floats(), I'm pleased to say.


r~


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/8] softfloat: Implement float128_muladd
  2020-09-24 13:30     ` Richard Henderson
@ 2020-09-25  9:17       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-09-25  9:17 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: alex.bennee, bharata

On 24.09.20 15:30, Richard Henderson wrote:
> On 9/24/20 12:56 AM, David Hildenbrand wrote:
>> I do wonder if a type for Int256 would make sense - instead of > manually passing these arrays.
> 
> I could do that.  It does name better, I suppose, in passing.  So long as
> you're happy having the guts of the type be public, and not wrapping everything
> like we do for Int128...

We can do that once we have hardware+compiler support for 256bit ;)

> 
> Either is better than the softfloat style, which would pass 12 arguments to
> these helpers... ;-)

Indeed

> 
>>> +static void shortShift256Left(uint64_t p[4], unsigned count)
>>> +{
>>> +    int negcount = -count & 63;
>>
>> That's the same as "64 - count", right? (which I find easier to get)
> 
> In this case, yes.
> 
> Of course, more hosts have a "neg" instruction than they do a
> "subtract-from-immediate" instruction.  When the shift instruction only
> examines the low N bits, the "& 63" is optimized away, so this can result in 1
> fewer instruction in the end.
> 
> Which is why I almost always use this form, and it's already all over the code
> inherited from upstream.
> 
>> Wow, that's a beast :)
> 
> But not much worse than muladd_floats(), I'm pleased to say.

Definitely!


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-09-25  9:18 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-24  1:24 [PATCH 0/8] softfloat: Implement float128_muladd Richard Henderson
2020-09-24  1:24 ` [PATCH 1/8] softfloat: Use mulu64 for mul64To128 Richard Henderson
2020-09-24  7:32   ` David Hildenbrand
2020-09-24  1:24 ` [PATCH 2/8] softfloat: Use int128.h for some operations Richard Henderson
2020-09-24  7:35   ` David Hildenbrand
2020-09-24  1:24 ` [PATCH 3/8] softfloat: Tidy a * b + inf return Richard Henderson
2020-09-24  7:37   ` David Hildenbrand
2020-09-24  1:24 ` [PATCH 4/8] softfloat: Add float_cmask and constants Richard Henderson
2020-09-24  7:40   ` David Hildenbrand
2020-09-24  1:24 ` [PATCH 5/8] softfloat: Inline pick_nan_muladd into its caller Richard Henderson
2020-09-24  7:42   ` David Hildenbrand
2020-09-24  1:24 ` [PATCH 6/8] softfloat: Implement float128_muladd Richard Henderson
2020-09-24  7:56   ` David Hildenbrand
2020-09-24 13:30     ` Richard Henderson
2020-09-25  9:17       ` David Hildenbrand
2020-09-24  1:24 ` [PATCH 7/8] softfloat: Use x86_64 assembly for {add,sub}{192,256} Richard Henderson
2020-09-24  1:24 ` [PATCH 8/8] softfloat: Use aarch64 " Richard Henderson
2020-09-24  8:00 ` [PATCH 0/8] softfloat: Implement float128_muladd David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).