From: "Alex Bennée" <alex.bennee@linaro.org>
To: Richard Henderson <richard.henderson@linaro.org>
Cc: BALATON Zoltan <balaton@eik.bme.hu>,
qemu-trivial@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [PATCH] bitops.h: Compile out asserts without --enable-debug
Date: Tue, 23 May 2023 07:43:41 +0100 [thread overview]
Message-ID: <87o7mb5xp9.fsf@linaro.org> (raw)
In-Reply-To: <ec9cfe5a-d5f2-466d-34dc-c35817e7e010@linaro.org>
Richard Henderson <richard.henderson@linaro.org> writes:
> On 5/22/23 15:26, BALATON Zoltan wrote:
>> On Mon, 22 May 2023, Alex Bennée wrote:
>>> (ajb: add Richard for his compiler-fu)
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>> On Mon, 22 May 2023, Alex Bennée wrote:
>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>
>>>>>> The low level extract and deposit funtions provided by bitops.h are
>>>>>> used in performance critical places. It crept into target/ppc via
>>>>>> FIELD_EX64 and also used by softfloat so PPC code using a lot of FPU
>>>>>> where hardfloat is also disabled is doubly affected.
>>>>>
>>>>> Most of these asserts compile out to nothing if the compiler is able to
>>>>> verify the constants are in the range. For example examining
>>>>> the start of float64_add:
>>>>>
>>> <snip>
>>>>>
>>>>> I don't see any check and abort steps because all the shift and mask
>>>>> values are known at compile time. The softfloat compilation certainly
>>>>> does have some assert points though:
>>>>>
>>>>> readelf -s ./libqemu-ppc64-softmmu.fa.p/fpu_softfloat.c.o |grep assert
>>>>> 136: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND g_assertion_mess[...]
>>>>> 138: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND __assert_fail
>>>>>
>>>>> but the references are for the ISRA segments so its tricky to know if
>>>>> they get used or are just there for LTO purposes.
>>>>>
>>>>> If there are hot-paths that show up the extract/deposit functions I
>>>>> suspect a better approach would be to implement _nocheck variants (or
>>>>> maybe _noassert?) and use them where required rather than turning off
>>>>> the assert checking for these utility functions.
>>>>
>>>> Just to clarify again, the asserts are still there when compiled with
>>>> --enable-debug. The patch only turns them off for optimised release
>>>> builds which I think makes sense if these asserts are to catch
>>>> programming errors.
>>>
>>> Well as Peter said the general policy is to keep asserts in but I
>>> appreciate this is a hotpath case.
>>>
>>>> I think I've also suggested adding noassert
>>>> versions of these but that wasn't a popular idea and it may also not
>>>> be easy to convert all places to use that like for example the
>>>> register fields related usage in target/ppc as that would also affect
>>>> other places.
>>>
>>> Is code generation or device emulation really on the hot-path. Generally
>>> a well predicted assert is in the noise for those operations.
>> They aren't in code generation but in helpers as you can also see in
>> the profile below and so they can be on hot path. Also I've noticed
>> that extract8 and extract16 just call extract32 after adding another
>> assert on their own in addition to the one in extract32 which is
>> double overhead for really no reason. I'd delete all these asserts
>> as the likelhood of bugs these could catch is very low anyway (how
>> often do you expect somebody to call these with out of bound values
>> that would not be obvious from the results otherwise?) but leaving
>> them in non-debug builds is totally useless in my opinion.
>>
>>>> So this seems to be the simplest and most effective
>>>> approach.
>>>>
>>>> The softfloat related usage in these tests I've done seem to mostly
>>>> come from unpacking and repacking floats in softfloat which is done
>>>> for every operation, e.g. muladd which mp3 encoding mostly uses does 3
>>>> unpacks and 1 pack for each call and each unpack is 3 extracts so even
>>>> small overheads add app quickly. Just 1 muladd will result in 9
>>>> extracts and 2 deposits at least plus updating PPC flags for each FPU
>>>> op adds a bunch more. I did some profiling with perf to find these.
>>>
>>> After some messing about trying to get lame to cross compile to a static
>>> binary I was able to replicate what you've seen:
>>>
>>> 11.44% qemu-ppc64 qemu-ppc64 [.] unpack_raw64.isra.0
>>> 11.03% qemu-ppc64 qemu-ppc64 [.] parts64_uncanon_normal
>>> 8.26% qemu-ppc64 qemu-ppc64 [.] helper_compute_fprf_float64
>>> 6.75% qemu-ppc64 qemu-ppc64 [.] do_float_check_status
>>> 5.34% qemu-ppc64 qemu-ppc64 [.] parts64_muladd
>>> 4.75% qemu-ppc64 qemu-ppc64 [.] pack_raw64.isra.0
>>> 4.38% qemu-ppc64 qemu-ppc64 [.] parts64_canonicalize
>>> 3.62% qemu-ppc64 qemu-ppc64 [.] float64r32_round_pack_canonical
>>> 3.32% qemu-ppc64 qemu-ppc64 [.] helper_todouble
>>> 2.68% qemu-ppc64 qemu-ppc64 [.] float64_add
>>> 2.51% qemu-ppc64 qemu-ppc64 [.] float64_hs_compare
>>> 2.30% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd
>>> 1.80% qemu-ppc64 qemu-ppc64 [.] float64r32_mul
>>> 1.40% qemu-ppc64 qemu-ppc64 [.] float64r32_add
>>> 1.34% qemu-ppc64 qemu-ppc64 [.] parts64_mul
>>> 1.16% qemu-ppc64 qemu-ppc64 [.] parts64_addsub
>>> 1.14% qemu-ppc64 qemu-ppc64 [.] helper_reset_fpstatus
>>> 1.06% qemu-ppc64 qemu-ppc64 [.] helper_float_check_status
>>> 1.04% qemu-ppc64 qemu-ppc64 [.] float64_muladd
>> I've run 32 bit PPC version in qemu-system-ppc so the profile is a
>> bit different (has more system related overhead that I plan to look
>> at separately) but this part is similar to the above. I also wonder
>> what makes helper_compute_fprf_float64 a bottleneck as that does not
>> seem to have much extract/deposit, only a call to clz but it's hard
>> to tell what it really does due to nested calls and macros. I've
>> also seen this function among the top contenders in my profiling.
>>
>>> what I find confusing is the fact the parts extraction and packing
>>> should all be known constants which should cause the asserts to
>>> disappear. However it looks like the compiler decided to bring in a copy
>>> of the whole inline function (ISRA = >interprocedural scalar replacement
>>> of aggregates) which obviously can't fold the constants and eliminate
>>> the assert.
>> Could it be related to that while the parts size and start are
>> marked const but pulled out of a struct field so the compiler may
>> not know their actual value until run time?
>>
>>> Richard,
>>>
>>> Any idea of why the compiler might decide to do something like this?
>
> Try this.
That seems to have done the trick, translated code is now dominating the
profile:
+ 14.12% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000619420
+ 13.30% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000616850
+ 12.58% 12.19% qemu-ppc64 qemu-ppc64 [.] parts64_uncanon_normal
+ 10.62% 0.00% qemu-ppc64 [unknown] [.] 0x000000400061bf70
+ 9.91% 9.73% qemu-ppc64 qemu-ppc64 [.] helper_compute_fprf_float64
+ 7.84% 7.82% qemu-ppc64 qemu-ppc64 [.] do_float_check_status
+ 6.47% 5.78% qemu-ppc64 qemu-ppc64 [.] parts64_canonicalize.constprop.0
+ 6.46% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000620130
+ 6.42% 0.00% qemu-ppc64 [unknown] [.] 0x0000004000619400
+ 6.17% 6.04% qemu-ppc64 qemu-ppc64 [.] parts64_muladd
+ 5.85% 0.00% qemu-ppc64 [unknown] [.] 0x00000040006167e0
+ 5.74% 0.00% qemu-ppc64 [unknown] [.] 0x0000b693fcffffd3
+ 5.45% 4.78% qemu-ppc64 qemu-ppc64 [.] float64r32_round_pack_canonical
+ 4.79% 0.00% qemu-ppc64 [unknown] [.] 0xfaff203940fe8240
+ 4.29% 3.82% qemu-ppc64 qemu-ppc64 [.] float64r32_muladd
+ 4.20% 3.87% qemu-ppc64 qemu-ppc64 [.] helper_todouble
+ 3.51% 2.98% qemu-ppc64 qemu-ppc64 [.] float64r32_mul
+ 3.12% 2.97% qemu-ppc64 qemu-ppc64 [.] float64_hs_compare
+ 3.06% 2.95% qemu-ppc64 qemu-ppc64 [.] float64_add
+ 2.88% 2.35% qemu-ppc64 qemu-ppc64 [.] float64r32_add
+ 2.65% 0.00% qemu-ppc64 [unknown] [.] 0x6c555e9100004039
+ 2.55% 1.30% qemu-ppc64 qemu-ppc64 [.] helper_float_check_status
eyeballing the float functions and I can't see any asserts expressed.
Before:
Executed in 721.65 secs fish external
usr time 721.38 secs 561.00 micros 721.38 secs
sys time 0.20 secs 261.00 micros 0.20 secs
After:
Executed in 650.38 secs fish external
usr time 650.11 secs 0.00 micros 650.11 secs
sys time 0.20 secs 989.00 micros 0.20 secs
>
>
> r~
>
> [2. text/x-patch; z.patch]...
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
next prev parent reply other threads:[~2023-05-23 7:12 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-20 20:54 [PATCH] bitops.h: Compile out asserts without --enable-debug BALATON Zoltan
2023-05-22 9:16 ` Peter Maydell
2023-05-22 12:00 ` BALATON Zoltan
2023-05-22 12:05 ` Peter Maydell
2023-05-22 11:26 ` Alex Bennée
2023-05-22 13:25 ` BALATON Zoltan
2023-05-22 16:48 ` Alex Bennée
2023-05-22 22:26 ` BALATON Zoltan
2023-05-22 23:34 ` Richard Henderson
2023-05-23 6:43 ` Alex Bennée [this message]
2023-06-05 22:06 ` BALATON Zoltan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o7mb5xp9.fsf@linaro.org \
--to=alex.bennee@linaro.org \
--cc=balaton@eik.bme.hu \
--cc=qemu-devel@nongnu.org \
--cc=qemu-trivial@nongnu.org \
--cc=richard.henderson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).