* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat [not found] <mailman.26358.1543950647.1282.qemu-devel@nongnu.org> @ 2018-12-05 11:07 ` Programmingkid 2018-12-05 16:08 ` Emilio G. Cota 0 siblings, 1 reply; 10+ messages in thread From: Programmingkid @ 2018-12-05 11:07 UTC (permalink / raw) To: Alex Benn?e, Emilio G. Cota, Richard Henderson; +Cc: QEMU Developers, qemu-ppc > On Dec 4, 2018, at 2:10 PM, qemu-devel-request@nongnu.org wrote: > > Emilio G. Cota <cota@braap.org> writes: > >> On Tue, Dec 04, 2018 at 13:52:16 +0000, Alex Benn?e wrote: >>>> We could always >>>> >>>> #ifdef __FAST_MATH__ >>>> #error "Silliness like this will get you nowhere" >>>> #endif >>> >>> Emilio, are you happy to add that guard with a suitable pithy comment? >> >> Isn't it better to just disable hardfloat then? >> >> --- a/fpu/softfloat.c >> +++ b/fpu/softfloat.c >> @@ -220,7 +220,7 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64) >> * the use of hardfloat, since hardfloat relies on the inexact flag being >> * already set. >> */ >> -#if defined(TARGET_PPC) >> +#if defined(TARGET_PPC) || defined(__FAST_MATH__) >> # define QEMU_NO_HARDFLOAT 1 >> # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN >> #else Why can't PowerPC also benefit from a hardfloat? It uses IEEE754 also. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-12-05 11:07 ` [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Programmingkid @ 2018-12-05 16:08 ` Emilio G. Cota 0 siblings, 0 replies; 10+ messages in thread From: Emilio G. Cota @ 2018-12-05 16:08 UTC (permalink / raw) To: Programmingkid; +Cc: Alex Benn?e, Richard Henderson, QEMU Developers, qemu-ppc On Wed, Dec 05, 2018 at 06:07:44 -0500, Programmingkid wrote: > > > On Dec 4, 2018, at 2:10 PM, qemu-devel-request@nongnu.org wrote: > > > > Emilio G. Cota <cota@braap.org> writes: > > > >> On Tue, Dec 04, 2018 at 13:52:16 +0000, Alex Benn?e wrote: > >>>> We could always > >>>> > >>>> #ifdef __FAST_MATH__ > >>>> #error "Silliness like this will get you nowhere" > >>>> #endif > >>> > >>> Emilio, are you happy to add that guard with a suitable pithy comment? > >> > >> Isn't it better to just disable hardfloat then? > >> > >> --- a/fpu/softfloat.c > >> +++ b/fpu/softfloat.c > >> @@ -220,7 +220,7 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64) > >> * the use of hardfloat, since hardfloat relies on the inexact flag being > >> * already set. > >> */ > >> -#if defined(TARGET_PPC) > >> +#if defined(TARGET_PPC) || defined(__FAST_MATH__) > >> # define QEMU_NO_HARDFLOAT 1 > >> # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN > >> #else > > Why can't PowerPC also benefit from a hardfloat? It uses IEEE754 also. Please see this message: https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04974.html Thanks, E. ^ permalink raw reply [flat|nested] 10+ messages in thread
* [Qemu-devel] [PATCH v6 00/13] hardfloat @ 2018-11-24 23:55 Emilio G. Cota 2018-11-24 23:55 ` [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Emilio G. Cota 0 siblings, 1 reply; 10+ messages in thread From: Emilio G. Cota @ 2018-11-24 23:55 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Bennée, Richard Henderson v5: https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg02793.html Changes since v5: - Rebase on rth/tcg-next-for-4.0 - Use QEMU_FLATTEN instead of __attribute__((flatten)) - Merge rth's cleanups (thanks!). With this, we now use a union to hold {float|float32} or {double|float64} types, which gets rid of most macros. I added a few optimizations (i.e. likely hints in some branches, and not using temp variables to hold the result of fpclassify) to roughly match (and sometimes surpass) v5's performance. - float64_sqrt: use fpclassify, which gives a 1.5x speedup. This series introduces no regressions to fp-test. You can test hardfloat by passing "-f x" to fp-test (so that the inexact flag is set before each operation) and using even rounding (fp-test's default). Note that hardfloat does not affect operations with other rounding modes. Perf numbers for fp-bench running on several host machines are in each commit log; numbers for several benchmarks (NBench, SPEC06fp) are in the last patch's commit log. These numbers are a bit outdated (they're from v2 or so), but I've decided to keep them because they give a good idea of the speedups to expect, and I don't have time to re-run them =) I did re-run the numbers for sqrt and cmp, though, since the implementation has changed quite a bit since v5. I didn't re-run these on Aarch64 and PPC hosts due to lack of time, but I doubt they'd change significantly. You can fetch this series from: https://github.com/cota/qemu/tree/hardfloat-v6 Thanks, Emilio ^ permalink raw reply [flat|nested] 10+ messages in thread
* [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-11-24 23:55 [Qemu-devel] [PATCH v6 00/13] hardfloat Emilio G. Cota @ 2018-11-24 23:55 ` Emilio G. Cota 2018-11-25 0:25 ` Aleksandar Markovic 2018-12-04 12:28 ` Alex Bennée 0 siblings, 2 replies; 10+ messages in thread From: Emilio G. Cota @ 2018-11-24 23:55 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Bennée, Richard Henderson The appended paves the way for leveraging the host FPU for a subset of guest FP operations. For most guest workloads (e.g. FP flags aren't ever cleared, inexact occurs often and rounding is set to the default [to nearest]) this will yield sizable performance speedups. The approach followed here avoids checking the FP exception flags register. See the added comment for details. This assumes that QEMU is running on an IEEE754-compliant FPU and that the rounding is set to the default (to nearest). The implementation-dependent specifics of the FPU should not matter; things like tininess detection and snan representation are still dealt with in soft-fp. However, this approach will break on most hosts if we compile QEMU with flags such as -ffast-math. We control the flags so this should be easy to enforce though. This patch just adds common code. Some operations will be migrated to hardfloat in subsequent patches to ease bisection. Note: some architectures (at least PPC, there might be others) clear the status flags passed to softfloat before most FP operations. This precludes the use of hardfloat, so to avoid introducing a performance regression for those targets, we add a flag to disable hardfloat. In the long run though it would be good to fix the targets so that at least the inexact flag passed to softfloat is indeed sticky. Signed-off-by: Emilio G. Cota <cota@braap.org> --- fpu/softfloat.c | 315 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 315 insertions(+) diff --git a/fpu/softfloat.c b/fpu/softfloat.c index ecdc00c633..306a12fa8d 100644 --- a/fpu/softfloat.c +++ b/fpu/softfloat.c @@ -83,6 +83,7 @@ this code that are retained. * target-dependent and needs the TARGET_* macros. */ #include "qemu/osdep.h" +#include <math.h> #include "qemu/bitops.h" #include "fpu/softfloat.h" @@ -95,6 +96,320 @@ this code that are retained. *----------------------------------------------------------------------------*/ #include "fpu/softfloat-macros.h" +/* + * Hardfloat + * + * Fast emulation of guest FP instructions is challenging for two reasons. + * First, FP instruction semantics are similar but not identical, particularly + * when handling NaNs. Second, emulating at reasonable speed the guest FP + * exception flags is not trivial: reading the host's flags register with a + * feclearexcept & fetestexcept pair is slow [slightly slower than soft-fp], + * and trapping on every FP exception is not fast nor pleasant to work with. + * + * We address these challenges by leveraging the host FPU for a subset of the + * operations. To do this we expand on the idea presented in this paper: + * + * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions in a + * binary translator." Software: Practice and Experience 46.12 (2016):1591-1615. + * + * The idea is thus to leverage the host FPU to (1) compute FP operations + * and (2) identify whether FP exceptions occurred while avoiding + * expensive exception flag register accesses. + * + * An important optimization shown in the paper is that given that exception + * flags are rarely cleared by the guest, we can avoid recomputing some flags. + * This is particularly useful for the inexact flag, which is very frequently + * raised in floating-point workloads. + * + * We optimize the code further by deferring to soft-fp whenever FP exception + * detection might get hairy. Two examples: (1) when at least one operand is + * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 result + * and the result is < the minimum normal. + */ +#define GEN_INPUT_FLUSH__NOCHECK(name, soft_t) \ + static inline void name(soft_t *a, float_status *s) \ + { \ + if (unlikely(soft_t ## _is_denormal(*a))) { \ + *a = soft_t ## _set_sign(soft_t ## _zero, \ + soft_t ## _is_neg(*a)); \ + s->float_exception_flags |= float_flag_input_denormal; \ + } \ + } + +GEN_INPUT_FLUSH__NOCHECK(float32_input_flush__nocheck, float32) +GEN_INPUT_FLUSH__NOCHECK(float64_input_flush__nocheck, float64) +#undef GEN_INPUT_FLUSH__NOCHECK + +#define GEN_INPUT_FLUSH1(name, soft_t) \ + static inline void name(soft_t *a, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + } + +GEN_INPUT_FLUSH1(float32_input_flush1, float32) +GEN_INPUT_FLUSH1(float64_input_flush1, float64) +#undef GEN_INPUT_FLUSH1 + +#define GEN_INPUT_FLUSH2(name, soft_t) \ + static inline void name(soft_t *a, soft_t *b, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + soft_t ## _input_flush__nocheck(b, s); \ + } + +GEN_INPUT_FLUSH2(float32_input_flush2, float32) +GEN_INPUT_FLUSH2(float64_input_flush2, float64) +#undef GEN_INPUT_FLUSH2 + +#define GEN_INPUT_FLUSH3(name, soft_t) \ + static inline void name(soft_t *a, soft_t *b, soft_t *c, float_status *s) \ + { \ + if (likely(!s->flush_inputs_to_zero)) { \ + return; \ + } \ + soft_t ## _input_flush__nocheck(a, s); \ + soft_t ## _input_flush__nocheck(b, s); \ + soft_t ## _input_flush__nocheck(c, s); \ + } + +GEN_INPUT_FLUSH3(float32_input_flush3, float32) +GEN_INPUT_FLUSH3(float64_input_flush3, float64) +#undef GEN_INPUT_FLUSH3 + +/* + * Choose whether to use fpclassify or float32/64_* primitives in the generated + * hardfloat functions. Each combination of number of inputs and float size + * gets its own value. + */ +#if defined(__x86_64__) +# define QEMU_HARDFLOAT_1F32_USE_FP 0 +# define QEMU_HARDFLOAT_1F64_USE_FP 1 +# define QEMU_HARDFLOAT_2F32_USE_FP 0 +# define QEMU_HARDFLOAT_2F64_USE_FP 1 +# define QEMU_HARDFLOAT_3F32_USE_FP 0 +# define QEMU_HARDFLOAT_3F64_USE_FP 1 +#else +# define QEMU_HARDFLOAT_1F32_USE_FP 0 +# define QEMU_HARDFLOAT_1F64_USE_FP 0 +# define QEMU_HARDFLOAT_2F32_USE_FP 0 +# define QEMU_HARDFLOAT_2F64_USE_FP 0 +# define QEMU_HARDFLOAT_3F32_USE_FP 0 +# define QEMU_HARDFLOAT_3F64_USE_FP 0 +#endif + +/* + * QEMU_HARDFLOAT_USE_ISINF chooses whether to use isinf() over + * float{32,64}_is_infinity when !USE_FP. + * On x86_64/aarch64, using the former over the latter can yield a ~6% speedup. + * On power64 however, using isinf() reduces fp-bench performance by up to 50%. + */ +#if defined(__x86_64__) || defined(__aarch64__) +# define QEMU_HARDFLOAT_USE_ISINF 1 +#else +# define QEMU_HARDFLOAT_USE_ISINF 0 +#endif + +/* + * Some targets clear the FP flags before most FP operations. This prevents + * the use of hardfloat, since hardfloat relies on the inexact flag being + * already set. + */ +#if defined(TARGET_PPC) +# define QEMU_NO_HARDFLOAT 1 +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN +#else +# define QEMU_NO_HARDFLOAT 0 +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN __attribute__((noinline)) +#endif + +static inline bool can_use_fpu(const float_status *s) +{ + if (QEMU_NO_HARDFLOAT) { + return false; + } + return likely(s->float_exception_flags & float_flag_inexact && + s->float_rounding_mode == float_round_nearest_even); +} + +/* + * Hardfloat generation functions. Each operation can have two flavors: + * either using softfloat primitives (e.g. float32_is_zero_or_normal) for + * most condition checks, or native ones (e.g. fpclassify). + * + * The flavor is chosen by the callers. Instead of using macros, we rely on the + * compiler to propagate constants and inline everything into the callers. + * + * We only generate functions for operations with two inputs, since only + * these are common enough to justify consolidating them into common code. + */ + +typedef union { + float32 s; + float h; +} union_float32; + +typedef union { + float64 s; + double h; +} union_float64; + +typedef bool (*f32_check_fn)(union_float32 a, union_float32 b); +typedef bool (*f64_check_fn)(union_float64 a, union_float64 b); + +typedef float32 (*soft_f32_op2_fn)(float32 a, float32 b, float_status *s); +typedef float64 (*soft_f64_op2_fn)(float64 a, float64 b, float_status *s); +typedef float (*hard_f32_op2_fn)(float a, float b); +typedef double (*hard_f64_op2_fn)(double a, double b); + +/* 2-input is-zero-or-normal */ +static inline bool f32_is_zon2(union_float32 a, union_float32 b) +{ + if (QEMU_HARDFLOAT_2F32_USE_FP) { + /* + * Not using a temp variable for consecutive fpclassify calls ends up + * generating faster code. + */ + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO); + } + return float32_is_zero_or_normal(a.s) && + float32_is_zero_or_normal(b.s); +} + +static inline bool f64_is_zon2(union_float64 a, union_float64 b) +{ + if (QEMU_HARDFLOAT_2F64_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO); + } + return float64_is_zero_or_normal(a.s) && + float64_is_zero_or_normal(b.s); +} + +/* 3-input is-zero-or-normal */ +static inline +bool f32_is_zon3(union_float32 a, union_float32 b, union_float32 c) +{ + if (QEMU_HARDFLOAT_3F32_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO) && + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == FP_ZERO); + } + return float32_is_zero_or_normal(a.s) && + float32_is_zero_or_normal(b.s) && + float32_is_zero_or_normal(c.s); +} + +static inline +bool f64_is_zon3(union_float64 a, union_float64 b, union_float64 c) +{ + if (QEMU_HARDFLOAT_3F64_USE_FP) { + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == FP_ZERO) && + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == FP_ZERO) && + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == FP_ZERO); + } + return float64_is_zero_or_normal(a.s) && + float64_is_zero_or_normal(b.s) && + float64_is_zero_or_normal(c.s); +} + +static inline bool f32_is_inf(union_float32 a) +{ + if (QEMU_HARDFLOAT_USE_ISINF) { + return isinff(a.h); + } + return float32_is_infinity(a.s); +} + +static inline bool f64_is_inf(union_float64 a) +{ + if (QEMU_HARDFLOAT_USE_ISINF) { + return isinf(a.h); + } + return float64_is_infinity(a.s); +} + +/* Note: @fast_test and @post can be NULL */ +static inline float32 +float32_gen2(float32 xa, float32 xb, float_status *s, + hard_f32_op2_fn hard, soft_f32_op2_fn soft, + f32_check_fn pre, f32_check_fn post, + f32_check_fn fast_test, soft_f32_op2_fn fast_op) +{ + union_float32 ua, ub, ur; + + ua.s = xa; + ub.s = xb; + + if (unlikely(!can_use_fpu(s))) { + goto soft; + } + + float32_input_flush2(&ua.s, &ub.s, s); + if (unlikely(!pre(ua, ub))) { + goto soft; + } + if (fast_test && fast_test(ua, ub)) { + return fast_op(ua.s, ub.s, s); + } + + ur.h = hard(ua.h, ub.h); + if (unlikely(f32_is_inf(ur))) { + s->float_exception_flags |= float_flag_overflow; + } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) { + if (post == NULL || post(ua, ub)) { + goto soft; + } + } + return ur.s; + + soft: + return soft(ua.s, ub.s, s); +} + +static inline float64 +float64_gen2(float64 xa, float64 xb, float_status *s, + hard_f64_op2_fn hard, soft_f64_op2_fn soft, + f64_check_fn pre, f64_check_fn post, + f64_check_fn fast_test, soft_f64_op2_fn fast_op) +{ + union_float64 ua, ub, ur; + + ua.s = xa; + ub.s = xb; + + if (unlikely(!can_use_fpu(s))) { + goto soft; + } + + float64_input_flush2(&ua.s, &ub.s, s); + if (unlikely(!pre(ua, ub))) { + goto soft; + } + if (fast_test && fast_test(ua, ub)) { + return fast_op(ua.s, ub.s, s); + } + + ur.h = hard(ua.h, ub.h); + if (unlikely(f64_is_inf(ur))) { + s->float_exception_flags |= float_flag_overflow; + } else if (unlikely(fabs(ur.h) <= DBL_MIN)) { + if (post == NULL || post(ua, ub)) { + goto soft; + } + } + return ur.s; + + soft: + return soft(ua.s, ub.s, s); +} + /*---------------------------------------------------------------------------- | Returns the fraction bits of the half-precision floating-point value `a'. *----------------------------------------------------------------------------*/ -- 2.17.1 ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-11-24 23:55 ` [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Emilio G. Cota @ 2018-11-25 0:25 ` Aleksandar Markovic 2018-11-25 1:25 ` Emilio G. Cota 2018-12-04 12:28 ` Alex Bennée 1 sibling, 1 reply; 10+ messages in thread From: Aleksandar Markovic @ 2018-11-25 0:25 UTC (permalink / raw) To: Emilio G. Cota; +Cc: Richard Henderson, Alex Bennée, qemu-devel Hi, Emilio. > Note: some architectures (at least PPC, there might be others) clear > the status flags passed to softfloat before most FP operations. This > precludes the use of hardfloat, so to avoid introducing a performance > regression for those targets, we add a flag to disable hardfloat. > In the long run though it would be good to fix the targets so that > at least the inexact flag passed to softfloat is indeed sticky. Can you elaborate more on this paragraph? Thanks, Aleksandar Markovic On Nov 25, 2018 1:08 AM, "Emilio G. Cota" <cota@braap.org> wrote: > The appended paves the way for leveraging the host FPU for a subset > of guest FP operations. For most guest workloads (e.g. FP flags > aren't ever cleared, inexact occurs often and rounding is set to the > default [to nearest]) this will yield sizable performance speedups. > > The approach followed here avoids checking the FP exception flags register. > See the added comment for details. > > This assumes that QEMU is running on an IEEE754-compliant FPU and > that the rounding is set to the default (to nearest). The > implementation-dependent specifics of the FPU should not matter; things > like tininess detection and snan representation are still dealt with in > soft-fp. However, this approach will break on most hosts if we compile > QEMU with flags such as -ffast-math. We control the flags so this should > be easy to enforce though. > > This patch just adds common code. Some operations will be migrated > to hardfloat in subsequent patches to ease bisection. > > Note: some architectures (at least PPC, there might be others) clear > the status flags passed to softfloat before most FP operations. This > precludes the use of hardfloat, so to avoid introducing a performance > regression for those targets, we add a flag to disable hardfloat. > In the long run though it would be good to fix the targets so that > at least the inexact flag passed to softfloat is indeed sticky. > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > fpu/softfloat.c | 315 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 315 insertions(+) > > diff --git a/fpu/softfloat.c b/fpu/softfloat.c > index ecdc00c633..306a12fa8d 100644 > --- a/fpu/softfloat.c > +++ b/fpu/softfloat.c > @@ -83,6 +83,7 @@ this code that are retained. > * target-dependent and needs the TARGET_* macros. > */ > #include "qemu/osdep.h" > +#include <math.h> > #include "qemu/bitops.h" > #include "fpu/softfloat.h" > > @@ -95,6 +96,320 @@ this code that are retained. > *----------------------------------------------------------- > -----------------*/ > #include "fpu/softfloat-macros.h" > > +/* > + * Hardfloat > + * > + * Fast emulation of guest FP instructions is challenging for two reasons. > + * First, FP instruction semantics are similar but not identical, > particularly > + * when handling NaNs. Second, emulating at reasonable speed the guest FP > + * exception flags is not trivial: reading the host's flags register with > a > + * feclearexcept & fetestexcept pair is slow [slightly slower than > soft-fp], > + * and trapping on every FP exception is not fast nor pleasant to work > with. > + * > + * We address these challenges by leveraging the host FPU for a subset of > the > + * operations. To do this we expand on the idea presented in this paper: > + * > + * Guo, Yu-Chuan, et al. "Translating the ARM Neon and VFP instructions > in a > + * binary translator." Software: Practice and Experience 46.12 > (2016):1591-1615. > + * > + * The idea is thus to leverage the host FPU to (1) compute FP operations > + * and (2) identify whether FP exceptions occurred while avoiding > + * expensive exception flag register accesses. > + * > + * An important optimization shown in the paper is that given that > exception > + * flags are rarely cleared by the guest, we can avoid recomputing some > flags. > + * This is particularly useful for the inexact flag, which is very > frequently > + * raised in floating-point workloads. > + * > + * We optimize the code further by deferring to soft-fp whenever FP > exception > + * detection might get hairy. Two examples: (1) when at least one operand > is > + * denormal/inf/NaN; (2) when operands are not guaranteed to lead to a 0 > result > + * and the result is < the minimum normal. > + */ > +#define GEN_INPUT_FLUSH__NOCHECK(name, soft_t) \ > + static inline void name(soft_t *a, float_status *s) \ > + { \ > + if (unlikely(soft_t ## _is_denormal(*a))) { \ > + *a = soft_t ## _set_sign(soft_t ## _zero, \ > + soft_t ## _is_neg(*a)); \ > + s->float_exception_flags |= float_flag_input_denormal; \ > + } \ > + } > + > +GEN_INPUT_FLUSH__NOCHECK(float32_input_flush__nocheck, float32) > +GEN_INPUT_FLUSH__NOCHECK(float64_input_flush__nocheck, float64) > +#undef GEN_INPUT_FLUSH__NOCHECK > + > +#define GEN_INPUT_FLUSH1(name, soft_t) \ > + static inline void name(soft_t *a, float_status *s) \ > + { \ > + if (likely(!s->flush_inputs_to_zero)) { \ > + return; \ > + } \ > + soft_t ## _input_flush__nocheck(a, s); \ > + } > + > +GEN_INPUT_FLUSH1(float32_input_flush1, float32) > +GEN_INPUT_FLUSH1(float64_input_flush1, float64) > +#undef GEN_INPUT_FLUSH1 > + > +#define GEN_INPUT_FLUSH2(name, soft_t) \ > + static inline void name(soft_t *a, soft_t *b, float_status *s) \ > + { \ > + if (likely(!s->flush_inputs_to_zero)) { \ > + return; \ > + } \ > + soft_t ## _input_flush__nocheck(a, s); \ > + soft_t ## _input_flush__nocheck(b, s); \ > + } > + > +GEN_INPUT_FLUSH2(float32_input_flush2, float32) > +GEN_INPUT_FLUSH2(float64_input_flush2, float64) > +#undef GEN_INPUT_FLUSH2 > + > +#define GEN_INPUT_FLUSH3(name, soft_t) \ > + static inline void name(soft_t *a, soft_t *b, soft_t *c, float_status > *s) \ > + { \ > + if (likely(!s->flush_inputs_to_zero)) { \ > + return; \ > + } \ > + soft_t ## _input_flush__nocheck(a, s); \ > + soft_t ## _input_flush__nocheck(b, s); \ > + soft_t ## _input_flush__nocheck(c, s); \ > + } > + > +GEN_INPUT_FLUSH3(float32_input_flush3, float32) > +GEN_INPUT_FLUSH3(float64_input_flush3, float64) > +#undef GEN_INPUT_FLUSH3 > + > +/* > + * Choose whether to use fpclassify or float32/64_* primitives in the > generated > + * hardfloat functions. Each combination of number of inputs and float > size > + * gets its own value. > + */ > +#if defined(__x86_64__) > +# define QEMU_HARDFLOAT_1F32_USE_FP 0 > +# define QEMU_HARDFLOAT_1F64_USE_FP 1 > +# define QEMU_HARDFLOAT_2F32_USE_FP 0 > +# define QEMU_HARDFLOAT_2F64_USE_FP 1 > +# define QEMU_HARDFLOAT_3F32_USE_FP 0 > +# define QEMU_HARDFLOAT_3F64_USE_FP 1 > +#else > +# define QEMU_HARDFLOAT_1F32_USE_FP 0 > +# define QEMU_HARDFLOAT_1F64_USE_FP 0 > +# define QEMU_HARDFLOAT_2F32_USE_FP 0 > +# define QEMU_HARDFLOAT_2F64_USE_FP 0 > +# define QEMU_HARDFLOAT_3F32_USE_FP 0 > +# define QEMU_HARDFLOAT_3F64_USE_FP 0 > +#endif > + > +/* > + * QEMU_HARDFLOAT_USE_ISINF chooses whether to use isinf() over > + * float{32,64}_is_infinity when !USE_FP. > + * On x86_64/aarch64, using the former over the latter can yield a ~6% > speedup. > + * On power64 however, using isinf() reduces fp-bench performance by up > to 50%. > + */ > +#if defined(__x86_64__) || defined(__aarch64__) > +# define QEMU_HARDFLOAT_USE_ISINF 1 > +#else > +# define QEMU_HARDFLOAT_USE_ISINF 0 > +#endif > + > +/* > + * Some targets clear the FP flags before most FP operations. This > prevents > + * the use of hardfloat, since hardfloat relies on the inexact flag being > + * already set. > + */ > +#if defined(TARGET_PPC) > +# define QEMU_NO_HARDFLOAT 1 > +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN > +#else > +# define QEMU_NO_HARDFLOAT 0 > +# define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN __attribute__((noinline)) > +#endif > + > +static inline bool can_use_fpu(const float_status *s) > +{ > + if (QEMU_NO_HARDFLOAT) { > + return false; > + } > + return likely(s->float_exception_flags & float_flag_inexact && > + s->float_rounding_mode == float_round_nearest_even); > +} > + > +/* > + * Hardfloat generation functions. Each operation can have two flavors: > + * either using softfloat primitives (e.g. float32_is_zero_or_normal) for > + * most condition checks, or native ones (e.g. fpclassify). > + * > + * The flavor is chosen by the callers. Instead of using macros, we rely > on the > + * compiler to propagate constants and inline everything into the callers. > + * > + * We only generate functions for operations with two inputs, since only > + * these are common enough to justify consolidating them into common code. > + */ > + > +typedef union { > + float32 s; > + float h; > +} union_float32; > + > +typedef union { > + float64 s; > + double h; > +} union_float64; > + > +typedef bool (*f32_check_fn)(union_float32 a, union_float32 b); > +typedef bool (*f64_check_fn)(union_float64 a, union_float64 b); > + > +typedef float32 (*soft_f32_op2_fn)(float32 a, float32 b, float_status *s); > +typedef float64 (*soft_f64_op2_fn)(float64 a, float64 b, float_status *s); > +typedef float (*hard_f32_op2_fn)(float a, float b); > +typedef double (*hard_f64_op2_fn)(double a, double b); > + > +/* 2-input is-zero-or-normal */ > +static inline bool f32_is_zon2(union_float32 a, union_float32 b) > +{ > + if (QEMU_HARDFLOAT_2F32_USE_FP) { > + /* > + * Not using a temp variable for consecutive fpclassify calls > ends up > + * generating faster code. > + */ > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO); > + } > + return float32_is_zero_or_normal(a.s) && > + float32_is_zero_or_normal(b.s); > +} > + > +static inline bool f64_is_zon2(union_float64 a, union_float64 b) > +{ > + if (QEMU_HARDFLOAT_2F64_USE_FP) { > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO); > + } > + return float64_is_zero_or_normal(a.s) && > + float64_is_zero_or_normal(b.s); > +} > + > +/* 3-input is-zero-or-normal */ > +static inline > +bool f32_is_zon3(union_float32 a, union_float32 b, union_float32 c) > +{ > + if (QEMU_HARDFLOAT_3F32_USE_FP) { > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO) && > + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == > FP_ZERO); > + } > + return float32_is_zero_or_normal(a.s) && > + float32_is_zero_or_normal(b.s) && > + float32_is_zero_or_normal(c.s); > +} > + > +static inline > +bool f64_is_zon3(union_float64 a, union_float64 b, union_float64 c) > +{ > + if (QEMU_HARDFLOAT_3F64_USE_FP) { > + return (fpclassify(a.h) == FP_NORMAL || fpclassify(a.h) == > FP_ZERO) && > + (fpclassify(b.h) == FP_NORMAL || fpclassify(b.h) == > FP_ZERO) && > + (fpclassify(c.h) == FP_NORMAL || fpclassify(c.h) == > FP_ZERO); > + } > + return float64_is_zero_or_normal(a.s) && > + float64_is_zero_or_normal(b.s) && > + float64_is_zero_or_normal(c.s); > +} > + > +static inline bool f32_is_inf(union_float32 a) > +{ > + if (QEMU_HARDFLOAT_USE_ISINF) { > + return isinff(a.h); > + } > + return float32_is_infinity(a.s); > +} > + > +static inline bool f64_is_inf(union_float64 a) > +{ > + if (QEMU_HARDFLOAT_USE_ISINF) { > + return isinf(a.h); > + } > + return float64_is_infinity(a.s); > +} > + > +/* Note: @fast_test and @post can be NULL */ > +static inline float32 > +float32_gen2(float32 xa, float32 xb, float_status *s, > + hard_f32_op2_fn hard, soft_f32_op2_fn soft, > + f32_check_fn pre, f32_check_fn post, > + f32_check_fn fast_test, soft_f32_op2_fn fast_op) > +{ > + union_float32 ua, ub, ur; > + > + ua.s = xa; > + ub.s = xb; > + > + if (unlikely(!can_use_fpu(s))) { > + goto soft; > + } > + > + float32_input_flush2(&ua.s, &ub.s, s); > + if (unlikely(!pre(ua, ub))) { > + goto soft; > + } > + if (fast_test && fast_test(ua, ub)) { > + return fast_op(ua.s, ub.s, s); > + } > + > + ur.h = hard(ua.h, ub.h); > + if (unlikely(f32_is_inf(ur))) { > + s->float_exception_flags |= float_flag_overflow; > + } else if (unlikely(fabsf(ur.h) <= FLT_MIN)) { > + if (post == NULL || post(ua, ub)) { > + goto soft; > + } > + } > + return ur.s; > + > + soft: > + return soft(ua.s, ub.s, s); > +} > + > +static inline float64 > +float64_gen2(float64 xa, float64 xb, float_status *s, > + hard_f64_op2_fn hard, soft_f64_op2_fn soft, > + f64_check_fn pre, f64_check_fn post, > + f64_check_fn fast_test, soft_f64_op2_fn fast_op) > +{ > + union_float64 ua, ub, ur; > + > + ua.s = xa; > + ub.s = xb; > + > + if (unlikely(!can_use_fpu(s))) { > + goto soft; > + } > + > + float64_input_flush2(&ua.s, &ub.s, s); > + if (unlikely(!pre(ua, ub))) { > + goto soft; > + } > + if (fast_test && fast_test(ua, ub)) { > + return fast_op(ua.s, ub.s, s); > + } > + > + ur.h = hard(ua.h, ub.h); > + if (unlikely(f64_is_inf(ur))) { > + s->float_exception_flags |= float_flag_overflow; > + } else if (unlikely(fabs(ur.h) <= DBL_MIN)) { > + if (post == NULL || post(ua, ub)) { > + goto soft; > + } > + } > + return ur.s; > + > + soft: > + return soft(ua.s, ub.s, s); > +} > + > /*---------------------------------------------------------- > ------------------ > | Returns the fraction bits of the half-precision floating-point value > `a'. > *----------------------------------------------------------- > -----------------*/ > -- > 2.17.1 > > > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-11-25 0:25 ` Aleksandar Markovic @ 2018-11-25 1:25 ` Emilio G. Cota 0 siblings, 0 replies; 10+ messages in thread From: Emilio G. Cota @ 2018-11-25 1:25 UTC (permalink / raw) To: Aleksandar Markovic; +Cc: Richard Henderson, Alex Bennée, qemu-devel On Sun, Nov 25, 2018 at 01:25:25 +0100, Aleksandar Markovic wrote: > > Note: some architectures (at least PPC, there might be others) clear > > the status flags passed to softfloat before most FP operations. This > > precludes the use of hardfloat, so to avoid introducing a performance > > regression for those targets, we add a flag to disable hardfloat. > > In the long run though it would be good to fix the targets so that > > at least the inexact flag passed to softfloat is indeed sticky. > > Can you elaborate more on this paragraph? Sure. We only use hardfloat when the inexact flag is already set. If it isn't, we defer to softfloat. This is done for two reasons: - Computing the inexact flag requires duplicating most of what softfloat does, so it's not worth doing. Note that clearing and reading the host's fp flags is even slower, so that's not an option. - The inexact flag is raised *very* frequently. The flag remains set (in the guest) unless guest code explicitly clears it, which few guest workloads do. It therefore makes sense for hardfloat to only kick in once the inexact flag has already been set. Most targets directly keep the guest's FP flags in the same struct (float_status) that is passed to softfloat ops. PPC, however, keeps the state of the guest FP flags in one place, and passes a pristine float_status to softfloat code every time it calls it. Thus, given that hardfloat is entirely implemented in softfloat.c, PPC targets cannot currently take advantage of it. Changing this in the PPC target is not impossible, but it will require additional work that I'm not doing in this series, hence my note. So for now, PPC targets just have hardfloat disabled at compile time, which avoids adding overhead for a feature that they cannot use. Let me know if anything is unclear. Cheers, Emilio ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-11-24 23:55 ` [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Emilio G. Cota 2018-11-25 0:25 ` Aleksandar Markovic @ 2018-12-04 12:28 ` Alex Bennée 2018-12-04 13:33 ` Richard Henderson 1 sibling, 1 reply; 10+ messages in thread From: Alex Bennée @ 2018-12-04 12:28 UTC (permalink / raw) To: Emilio G. Cota; +Cc: qemu-devel, Richard Henderson Emilio G. Cota <cota@braap.org> writes: > The appended paves the way for leveraging the host FPU for a subset > of guest FP operations. For most guest workloads (e.g. FP flags > aren't ever cleared, inexact occurs often and rounding is set to the > default [to nearest]) this will yield sizable performance speedups. > > The approach followed here avoids checking the FP exception flags register. > See the added comment for details. > > This assumes that QEMU is running on an IEEE754-compliant FPU and > that the rounding is set to the default (to nearest). The > implementation-dependent specifics of the FPU should not matter; things > like tininess detection and snan representation are still dealt with in > soft-fp. However, this approach will break on most hosts if we compile > QEMU with flags such as -ffast-math. We control the flags so this should > be easy to enforce though. We don't currently enforce this though although maybe that would be too much hand holding for compiler ricers hell bent on not understanding the flags they use. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> -- Alex Bennée ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-12-04 12:28 ` Alex Bennée @ 2018-12-04 13:33 ` Richard Henderson 2018-12-04 13:52 ` Alex Bennée 0 siblings, 1 reply; 10+ messages in thread From: Richard Henderson @ 2018-12-04 13:33 UTC (permalink / raw) To: Alex Bennée, Emilio G. Cota; +Cc: qemu-devel On 12/4/18 6:28 AM, Alex Bennée wrote: > Emilio G. Cota <cota@braap.org> writes: >> This assumes that QEMU is running on an IEEE754-compliant FPU and >> that the rounding is set to the default (to nearest). The >> implementation-dependent specifics of the FPU should not matter; things >> like tininess detection and snan representation are still dealt with in >> soft-fp. However, this approach will break on most hosts if we compile >> QEMU with flags such as -ffast-math. We control the flags so this should >> be easy to enforce though. > > We don't currently enforce this though although maybe that would be too > much hand holding for compiler ricers hell bent on not understanding the > flags they use. We could always #ifdef __FAST_MATH__ #error "Silliness like this will get you nowhere" #endif r~ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-12-04 13:33 ` Richard Henderson @ 2018-12-04 13:52 ` Alex Bennée 2018-12-04 17:31 ` Emilio G. Cota 0 siblings, 1 reply; 10+ messages in thread From: Alex Bennée @ 2018-12-04 13:52 UTC (permalink / raw) To: Richard Henderson; +Cc: Emilio G. Cota, qemu-devel Richard Henderson <richard.henderson@linaro.org> writes: > On 12/4/18 6:28 AM, Alex Bennée wrote: >> Emilio G. Cota <cota@braap.org> writes: >>> This assumes that QEMU is running on an IEEE754-compliant FPU and >>> that the rounding is set to the default (to nearest). The >>> implementation-dependent specifics of the FPU should not matter; things >>> like tininess detection and snan representation are still dealt with in >>> soft-fp. However, this approach will break on most hosts if we compile >>> QEMU with flags such as -ffast-math. We control the flags so this should >>> be easy to enforce though. >> >> We don't currently enforce this though although maybe that would be too >> much hand holding for compiler ricers hell bent on not understanding the >> flags they use. > > We could always > > #ifdef __FAST_MATH__ > #error "Silliness like this will get you nowhere" > #endif Emilio, are you happy to add that guard with a suitable pithy comment? -- Alex Bennée ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-12-04 13:52 ` Alex Bennée @ 2018-12-04 17:31 ` Emilio G. Cota 2018-12-04 19:08 ` Alex Bennée 0 siblings, 1 reply; 10+ messages in thread From: Emilio G. Cota @ 2018-12-04 17:31 UTC (permalink / raw) To: Alex Bennée; +Cc: Richard Henderson, qemu-devel On Tue, Dec 04, 2018 at 13:52:16 +0000, Alex Bennée wrote: > > We could always > > > > #ifdef __FAST_MATH__ > > #error "Silliness like this will get you nowhere" > > #endif > > Emilio, are you happy to add that guard with a suitable pithy comment? Isn't it better to just disable hardfloat then? --- a/fpu/softfloat.c +++ b/fpu/softfloat.c @@ -220,7 +220,7 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64) * the use of hardfloat, since hardfloat relies on the inexact flag being * already set. */ -#if defined(TARGET_PPC) +#if defined(TARGET_PPC) || defined(__FAST_MATH__) # define QEMU_NO_HARDFLOAT 1 # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN #else Or perhaps disable it, as well as issue a #warning? E. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat 2018-12-04 17:31 ` Emilio G. Cota @ 2018-12-04 19:08 ` Alex Bennée 0 siblings, 0 replies; 10+ messages in thread From: Alex Bennée @ 2018-12-04 19:08 UTC (permalink / raw) To: Emilio G. Cota; +Cc: Richard Henderson, qemu-devel Emilio G. Cota <cota@braap.org> writes: > On Tue, Dec 04, 2018 at 13:52:16 +0000, Alex Bennée wrote: >> > We could always >> > >> > #ifdef __FAST_MATH__ >> > #error "Silliness like this will get you nowhere" >> > #endif >> >> Emilio, are you happy to add that guard with a suitable pithy comment? > > Isn't it better to just disable hardfloat then? > > --- a/fpu/softfloat.c > +++ b/fpu/softfloat.c > @@ -220,7 +220,7 @@ GEN_INPUT_FLUSH3(float64_input_flush3, float64) > * the use of hardfloat, since hardfloat relies on the inexact flag being > * already set. > */ > -#if defined(TARGET_PPC) > +#if defined(TARGET_PPC) || defined(__FAST_MATH__) > # define QEMU_NO_HARDFLOAT 1 > # define QEMU_SOFTFLOAT_ATTR QEMU_FLATTEN > #else > > Or perhaps disable it, as well as issue a #warning? Issuing the warning is only to tell the user they are being stupid but yeah certainly disable. Maybe we'll be around when someone comes asking why maths didn't get faster ;-) > > E. -- Alex Bennée ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-12-05 16:09 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <mailman.26358.1543950647.1282.qemu-devel@nongnu.org> 2018-12-05 11:07 ` [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Programmingkid 2018-12-05 16:08 ` Emilio G. Cota 2018-11-24 23:55 [Qemu-devel] [PATCH v6 00/13] hardfloat Emilio G. Cota 2018-11-24 23:55 ` [Qemu-devel] [PATCH v6 07/13] fpu: introduce hardfloat Emilio G. Cota 2018-11-25 0:25 ` Aleksandar Markovic 2018-11-25 1:25 ` Emilio G. Cota 2018-12-04 12:28 ` Alex Bennée 2018-12-04 13:33 ` Richard Henderson 2018-12-04 13:52 ` Alex Bennée 2018-12-04 17:31 ` Emilio G. Cota 2018-12-04 19:08 ` Alex Bennée
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).