qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Richard Henderson <richard.henderson@linaro.org>
To: "BALATON Zoltan" <balaton@eik.bme.hu>,
	"Alex Bennée" <alex.bennee@linaro.org>
Cc: qemu-trivial@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH] bitops.h: Compile out asserts without --enable-debug
Date: Mon, 22 May 2023 16:34:33 -0700	[thread overview]
Message-ID: <ec9cfe5a-d5f2-466d-34dc-c35817e7e010@linaro.org> (raw)
In-Reply-To: <49c4781a-1e91-85ef-d9cb-6996e9bbb10c@eik.bme.hu>

[-- Attachment #1: Type: text/plain, Size: 6588 bytes --]

On 5/22/23 15:26, BALATON Zoltan wrote:
> On Mon, 22 May 2023, Alex Bennée wrote:
>> (ajb: add Richard for his compiler-fu)
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>
>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>> used in performance critical places. It crept into target/ppc via
>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>> where hardfloat is also disabled is doubly affected.
>>>>
>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>> verify the constants are in the range. For example examining
>>>> the start of float64_add:
>>>>
>> <snip>
>>>>
>>>> I don't see any check and abort steps because all the shift and mask
>>>> values are known at compile time. The softfloat compilation certainly
>>>> does have some assert points though:
>>>>
>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>>
>>>> but the references are for the ISRA segments so its tricky to know if
>>>> they get used or are just there for LTO purposes.
>>>>
>>>> If there are hot-paths that show up the extract/deposit functions I
>>>> suspect a better approach would be to implement _nocheck variants (or
>>>> maybe _noassert?) and use them where required rather than turning off
>>>> the assert checking for these utility functions.
>>>
>>> Just to clarify again, the asserts are still there when compiled with
>>> --enable-debug. The patch only turns them off for optimised release
>>> builds which I think makes sense if these asserts are to catch
>>> programming errors.
>>
>> Well as Peter said the general policy is to keep asserts in but I
>> appreciate this is a hotpath case.
>>
>>> I think I've also suggested adding noassert
>>> versions of these but that wasn't a popular idea and it may also not
>>> be easy to convert all places to use that like for example the
>>> register fields related usage in target/ppc as that would also affect
>>> other places.
>>
>> Is code generation or device emulation really on the hot-path. Generally
>> a well predicted assert is in the noise for those operations.
> 
> They aren't in code generation but in helpers as you can also see in the profile below and 
> so they can be on hot path. Also I've noticed that extract8 and extract16 just call 
> extract32 after adding another assert on their own in addition to the one in extract32 
> which is double overhead for really no reason. I'd delete all these asserts as the 
> likelhood of bugs these could catch is very low anyway (how often do you expect somebody 
> to call these with out of bound values that would not be obvious from the results 
> otherwise?) but leaving them in non-debug builds is totally useless in my opinion.
> 
>>> So this seems to be the simplest and most effective
>>> approach.
>>>
>>> The softfloat related usage in these tests I've done seem to mostly
>>> come from unpacking and repacking floats in softfloat which is done
>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>> small overheads add app quickly. Just 1 muladd will result in 9
>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>> op adds a bunch more. I did some profiling with perf to find these.
>>
>> After some messing about trying to get lame to cross compile to a static
>> binary I was able to replicate what you've seen:
>>
>>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>>   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
>>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>>   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
>>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd
> 
> I've run 32 bit PPC version in qemu-system-ppc so the profile is a bit different (has more 
> system related overhead that I plan to look at separately) but this part is similar to the 
> above. I also wonder what makes helper_compute_fprf_float64 a bottleneck as that does not 
> seem to have much extract/deposit, only a call to clz but it's hard to tell what it really 
> does due to nested calls and macros. I've also seen this function among the top contenders 
> in my profiling.
> 
>> what I find confusing is the fact the parts extraction and packing
>> should all be known constants which should cause the asserts to
>> disappear. However it looks like the compiler decided to bring in a copy
>> of the whole inline function (ISRA = >interprocedural scalar replacement
>> of aggregates) which obviously can't fold the constants and eliminate
>> the assert.
> 
> Could it be related to that while the parts size and start are marked const but pulled out 
> of a struct field so the compiler may not know their actual value until run time?
> 
>> Richard,
>>
>> Any idea of why the compiler might decide to do something like this?

Try this.


r~

[-- Attachment #2: z.patch --]
[-- Type: text/x-patch, Size: 2697 bytes --]

diff --git a/fpu/softfloat.c b/fpu/softfloat.c
index 108f9cb224..42e6c188b4 100644
--- a/fpu/softfloat.c
+++ b/fpu/softfloat.c
@@ -593,27 +593,27 @@ static void unpack_raw64(FloatParts64 *r, const FloatFmt *fmt, uint64_t raw)
     };
 }
 
-static inline void float16_unpack_raw(FloatParts64 *p, float16 f)
+static void QEMU_FLATTEN float16_unpack_raw(FloatParts64 *p, float16 f)
 {
     unpack_raw64(p, &float16_params, f);
 }
 
-static inline void bfloat16_unpack_raw(FloatParts64 *p, bfloat16 f)
+static void QEMU_FLATTEN bfloat16_unpack_raw(FloatParts64 *p, bfloat16 f)
 {
     unpack_raw64(p, &bfloat16_params, f);
 }
 
-static inline void float32_unpack_raw(FloatParts64 *p, float32 f)
+static void QEMU_FLATTEN float32_unpack_raw(FloatParts64 *p, float32 f)
 {
     unpack_raw64(p, &float32_params, f);
 }
 
-static inline void float64_unpack_raw(FloatParts64 *p, float64 f)
+static void QEMU_FLATTEN float64_unpack_raw(FloatParts64 *p, float64 f)
 {
     unpack_raw64(p, &float64_params, f);
 }
 
-static void floatx80_unpack_raw(FloatParts128 *p, floatx80 f)
+static void QEMU_FLATTEN floatx80_unpack_raw(FloatParts128 *p, floatx80 f)
 {
     *p = (FloatParts128) {
         .cls = float_class_unclassified,
@@ -623,7 +623,7 @@ static void floatx80_unpack_raw(FloatParts128 *p, floatx80 f)
     };
 }
 
-static void float128_unpack_raw(FloatParts128 *p, float128 f)
+static void QEMU_FLATTEN float128_unpack_raw(FloatParts128 *p, float128 f)
 {
     const int f_size = float128_params.frac_size - 64;
     const int e_size = float128_params.exp_size;
@@ -650,27 +650,27 @@ static uint64_t pack_raw64(const FloatParts64 *p, const FloatFmt *fmt)
     return ret;
 }
 
-static inline float16 float16_pack_raw(const FloatParts64 *p)
+static float16 QEMU_FLATTEN float16_pack_raw(const FloatParts64 *p)
 {
     return make_float16(pack_raw64(p, &float16_params));
 }
 
-static inline bfloat16 bfloat16_pack_raw(const FloatParts64 *p)
+static bfloat16 QEMU_FLATTEN bfloat16_pack_raw(const FloatParts64 *p)
 {
     return pack_raw64(p, &bfloat16_params);
 }
 
-static inline float32 float32_pack_raw(const FloatParts64 *p)
+static float32 QEMU_FLATTEN float32_pack_raw(const FloatParts64 *p)
 {
     return make_float32(pack_raw64(p, &float32_params));
 }
 
-static inline float64 float64_pack_raw(const FloatParts64 *p)
+static float64 QEMU_FLATTEN float64_pack_raw(const FloatParts64 *p)
 {
     return make_float64(pack_raw64(p, &float64_params));
 }
 
-static float128 float128_pack_raw(const FloatParts128 *p)
+static float128 QEMU_FLATTEN float128_pack_raw(const FloatParts128 *p)
 {
     const int f_size = float128_params.frac_size - 64;
     const int e_size = float128_params.exp_size;

  reply	other threads:[~2023-05-22 23:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-20 20:54 [PATCH] bitops.h: Compile out asserts without --enable-debug BALATON Zoltan
2023-05-22  9:16 ` Peter Maydell
2023-05-22 12:00   ` BALATON Zoltan
2023-05-22 12:05     ` Peter Maydell
2023-05-22 11:26 ` Alex Bennée
2023-05-22 13:25   ` BALATON Zoltan
2023-05-22 16:48     ` Alex Bennée
2023-05-22 22:26       ` BALATON Zoltan
2023-05-22 23:34         ` Richard Henderson [this message]
2023-05-23  6:43           ` Alex Bennée
2023-06-05 22:06             ` BALATON Zoltan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec9cfe5a-d5f2-466d-34dc-c35817e7e010@linaro.org \
    --to=richard.henderson@linaro.org \
    --cc=alex.bennee@linaro.org \
    --cc=balaton@eik.bme.hu \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-trivial@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).