All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: Richard Henderson <richard.henderson@linaro.org>
Cc: BALATON Zoltan <balaton@eik.bme.hu>,
	qemu-trivial@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH] bitops.h: Compile out asserts without --enable-debug
Date: Tue, 23 May 2023 07:43:41 +0100	[thread overview]
Message-ID: <87o7mb5xp9.fsf@linaro.org> (raw)
In-Reply-To: <ec9cfe5a-d5f2-466d-34dc-c35817e7e010@linaro.org>


Richard Henderson <richard.henderson@linaro.org> writes:

> On 5/22/23 15:26, BALATON Zoltan wrote:
>> On Mon, 22 May 2023, Alex Bennée wrote:
>>> (ajb: add Richard for his compiler-fu)
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>
>>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>>> used in performance critical places. It crept into target/ppc via
>>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>>> where hardfloat is also disabled is doubly affected.
>>>>>
>>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>>> verify the constants are in the range. For example examining
>>>>> the start of float64_add:
>>>>>
>>> <snip>
>>>>>
>>>>> I don't see any check and abort steps because all the shift and mask
>>>>> values are known at compile time. The softfloat compilation certainly
>>>>> does have some assert points though:
>>>>>
>>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o  |grep assert
>>>>>   136: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND g_assertion_mess[...]
>>>>>   138: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __assert_fail
>>>>>
>>>>> but the references are for the ISRA segments so its tricky to know if
>>>>> they get used or are just there for LTO purposes.
>>>>>
>>>>> If there are hot-paths that show up the extract/deposit functions I
>>>>> suspect a better approach would be to implement _nocheck variants (or
>>>>> maybe _noassert?) and use them where required rather than turning off
>>>>> the assert checking for these utility functions.
>>>>
>>>> Just to clarify again, the asserts are still there when compiled with
>>>> --enable-debug. The patch only turns them off for optimised release
>>>> builds which I think makes sense if these asserts are to catch
>>>> programming errors.
>>>
>>> Well as Peter said the general policy is to keep asserts in but I
>>> appreciate this is a hotpath case.
>>>
>>>> I think I've also suggested adding noassert
>>>> versions of these but that wasn't a popular idea and it may also not
>>>> be easy to convert all places to use that like for example the
>>>> register fields related usage in target/ppc as that would also affect
>>>> other places.
>>>
>>> Is code generation or device emulation really on the hot-path. Generally
>>> a well predicted assert is in the noise for those operations.
>> They aren't in code generation but in helpers as you can also see in
>> the profile below and so they can be on hot path. Also I've noticed
>> that extract8 and extract16 just call extract32 after adding another
>> assert on their own in addition to the one in extract32 which is
>> double overhead for really no reason. I'd delete all these asserts
>> as the likelhood of bugs these could catch is very low anyway (how
>> often do you expect somebody to call these with out of bound values
>> that would not be obvious from the results otherwise?) but leaving
>> them in non-debug builds is totally useless in my opinion.
>> 
>>>> So this seems to be the simplest and most effective
>>>> approach.
>>>>
>>>> The softfloat related usage in these tests I've done seem to mostly
>>>> come from unpacking and repacking floats in softfloat which is done
>>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>>> small overheads add app quickly. Just 1 muladd will result in 9
>>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>>> op adds a bunch more. I did some profiling with perf to find these.
>>>
>>> After some messing about trying to get lame to cross compile to a static
>>> binary I was able to replicate what you've seen:
>>>
>>>  11.44%  qemu-ppc64  qemu-ppc64               [.] unpack_raw64.isra.0
>>>  11.03%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
>>>   8.26%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
>>>   6.75%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
>>>   5.34%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
>>>   4.75%  qemu-ppc64  qemu-ppc64               [.] pack_raw64.isra.0
>>>   4.38%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize
>>>   3.62%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
>>>   3.32%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
>>>   2.68%  qemu-ppc64  qemu-ppc64               [.] float64_add
>>>   2.51%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
>>>   2.30%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
>>>   1.80%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
>>>   1.40%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
>>>   1.34%  qemu-ppc64  qemu-ppc64               [.] parts64_mul
>>>   1.16%  qemu-ppc64  qemu-ppc64               [.] parts64_addsub
>>>   1.14%  qemu-ppc64  qemu-ppc64               [.] helper_reset_fpstatus
>>>   1.06%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status
>>>   1.04%  qemu-ppc64  qemu-ppc64               [.] float64_muladd
>> I've run 32 bit PPC version in qemu-system-ppc so the profile is a
>> bit different (has more system related overhead that I plan to look
>> at separately) but this part is similar to the above. I also wonder
>> what makes helper_compute_fprf_float64 a bottleneck as that does not
>> seem to have much extract/deposit, only a call to clz but it's hard
>> to tell what it really does due to nested calls and macros. I've
>> also seen this function among the top contenders in my profiling.
>> 
>>> what I find confusing is the fact the parts extraction and packing
>>> should all be known constants which should cause the asserts to
>>> disappear. However it looks like the compiler decided to bring in a copy
>>> of the whole inline function (ISRA = >interprocedural scalar replacement
>>> of aggregates) which obviously can't fold the constants and eliminate
>>> the assert.
>> Could it be related to that while the parts size and start are
>> marked const but pulled out of a struct field so the compiler may
>> not know their actual value until run time?
>> 
>>> Richard,
>>>
>>> Any idea of why the compiler might decide to do something like this?
>
> Try this.

That seems to have done the trick, translated code is now dominating the
profile:

+   14.12%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000619420
+   13.30%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000616850
+   12.58%    12.19%  qemu-ppc64  qemu-ppc64               [.] parts64_uncanon_normal
+   10.62%     0.00%  qemu-ppc64  [unknown]                [.] 0x000000400061bf70
+    9.91%     9.73%  qemu-ppc64  qemu-ppc64               [.] helper_compute_fprf_float64
+    7.84%     7.82%  qemu-ppc64  qemu-ppc64               [.] do_float_check_status
+    6.47%     5.78%  qemu-ppc64  qemu-ppc64               [.] parts64_canonicalize.constprop.0
+    6.46%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000620130
+    6.42%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000004000619400
+    6.17%     6.04%  qemu-ppc64  qemu-ppc64               [.] parts64_muladd
+    5.85%     0.00%  qemu-ppc64  [unknown]                [.] 0x00000040006167e0
+    5.74%     0.00%  qemu-ppc64  [unknown]                [.] 0x0000b693fcffffd3
+    5.45%     4.78%  qemu-ppc64  qemu-ppc64               [.] float64r32_round_pack_canonical
+    4.79%     0.00%  qemu-ppc64  [unknown]                [.] 0xfaff203940fe8240
+    4.29%     3.82%  qemu-ppc64  qemu-ppc64               [.] float64r32_muladd
+    4.20%     3.87%  qemu-ppc64  qemu-ppc64               [.] helper_todouble
+    3.51%     2.98%  qemu-ppc64  qemu-ppc64               [.] float64r32_mul
+    3.12%     2.97%  qemu-ppc64  qemu-ppc64               [.] float64_hs_compare
+    3.06%     2.95%  qemu-ppc64  qemu-ppc64               [.] float64_add
+    2.88%     2.35%  qemu-ppc64  qemu-ppc64               [.] float64r32_add
+    2.65%     0.00%  qemu-ppc64  [unknown]                [.] 0x6c555e9100004039
+    2.55%     1.30%  qemu-ppc64  qemu-ppc64               [.] helper_float_check_status

eyeballing the float functions and I can't see any asserts expressed.

Before:

Executed in  721.65 secs    fish           external
   usr time  721.38 secs  561.00 micros  721.38 secs
   sys time    0.20 secs  261.00 micros    0.20 secs

After:

Executed in  650.38 secs    fish           external
   usr time  650.11 secs    0.00 micros  650.11 secs
   sys time    0.20 secs  989.00 micros    0.20 secs



>
>
> r~
>
> [2. text/x-patch; z.patch]...


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


  reply	other threads:[~2023-05-23  7:11 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-20 20:54 [PATCH] bitops.h: Compile out asserts without --enable-debug BALATON Zoltan
2023-05-22  9:16 ` Peter Maydell
2023-05-22 12:00   ` BALATON Zoltan
2023-05-22 12:05     ` Peter Maydell
2023-05-22 11:26 ` Alex Bennée
2023-05-22 13:25   ` BALATON Zoltan
2023-05-22 16:48     ` Alex Bennée
2023-05-22 22:26       ` BALATON Zoltan
2023-05-22 23:34         ` Richard Henderson
2023-05-23  6:43           ` Alex Bennée [this message]
2023-06-05 22:06             ` BALATON Zoltan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87o7mb5xp9.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=balaton@eik.bme.hu \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-trivial@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.