Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Thomas Huth <thuth@redhat.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	"Richard Henderson" <rth@twiddle.net>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Aurelien Jarno" <aurelien@aurel32.net>,
	"Cornelia Huck" <cohuck@redhat.com>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	qemu-s390x <qemu-s390x@nongnu.org>
Subject: Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
Date: Tue, 15 Jan 2019 16:01:32 +0000	[thread overview]
Message-ID: <87va2poqoz.fsf@linaro.org> (raw)
In-Reply-To: <6cb80b50-0352-430e-0c46-85ed69f95c88@redhat.com>


Thomas Huth <thuth@redhat.com> writes:

> On 2019-01-15 15:46, Alex Bennée wrote:
>>
>> Peter Maydell <peter.maydell@linaro.org> writes:
>>
>>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>>
>>>> Richard Henderson <rth@twiddle.net> writes:
>>>>> But perhaps
>>>>>
>>>>>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>>>     *r = n % d;
>>>>>     return n / d;
>>>>>
>>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>>> hosts.
>>>>
>>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>>> helper? Anyone got an s390x about so we can benchmark the two
>>>> approaches?
>>>
>>> The project has an s390x system available; however it's usually
>>> running merge build tests so not so useful for benchmarking.
>>> (I can set up accounts on it but that requires me to faff about
>>> figuring out how to create new accounts :-))
>>
>> I'm happy to leave this up to those who care about s390x host
>> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
>> helper getting too #ifdefy.
>
> Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got
> available:

Ahh I should have mentioned we already have the technology for this ;-)

If you build the fpu/next tree on a s390x you can then run:

  ./tests/fp/fp-bench f64_div

with and without the CONFIG_128 path. To get an idea of the real world
impact you can compile a foreign binary and run it on a s390x system
with:

  $QEMU ./tests/fp/fp-bench f64_div -t host

And that will give you the peak performance assuming your program is
doing nothing but f64_div operations. If the two QEMU's are basically in
the same ballpark then it doesn't make enough difference. That said:

> #include <stdio.h>
> #include <time.h>
> #include <stdint.h>
>
> uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>     *r = n >> 64;
>     return n;
> }
>
> uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     *r = n % d;
>     return n / d;
> }
>
<snip>
>
> int main()
> {
> 	uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
> 	uint64_t rs = 0, rn = 0;
> 	clock_t start, end;
> 	long i;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd1(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	r = n1 = n0 = d = rs = rn = 0;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd2(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	r = n1 = n0 = d = rs = rn = 0;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd3(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	return 0;
> }
>
> ... and results with GCC v8.2.1 are (using -O2):
>
> test 1: time=609	, rs=2264924160200000000 , rn = 6136218997527160832
> test 2: time=10127	, rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=2350	, rs=2264924183048928865 , rn = 4842822048162311089
>
> Thus the int128 version is the slowest!

I'd expect a little slow down due to the indirection into libgcc.. but
that seems pretty high.

>
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?
>
> Results with Clang v7.0.1 (using -O2, too) are these:
>
> test 2: time=5035	, rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=1970	, rs=2264924183048928865 , rn =
> 4842822048162311089

You can run:

  ./tests/fp/fp-test f64_div -l 2 -r all

For a proper comprehensive test.

--
Alex Bennée

next prev parent reply	other threads:[~2019-01-15 16:01 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
2019-01-14 16:37   ` Alex Bennée
2019-01-14 17:03     ` Thomas Huth
2019-01-14 18:58       ` Alex Bennée
2019-01-14 21:36         ` Richard Henderson
2019-01-14 22:48           ` Alex Bennée
2019-01-15 10:14             ` Peter Maydell
2019-01-15 14:46               ` Alex Bennée
2019-01-15 15:29                 ` Thomas Huth
2019-01-15 16:01                   ` Alex Bennée [this message]
2019-01-15 20:05                     ` Emilio G. Cota
2019-01-16  6:33                       ` Thomas Huth
2019-01-16 17:08                         ` Alex Bennée
2019-01-17  6:06                           ` Thomas Huth
2019-01-17  7:42                             ` Alex Bennée
2019-01-16 18:21                         ` Emilio G. Cota
2019-01-15 22:05                   ` Richard Henderson
2019-01-14 21:40 ` Richard Henderson
2019-01-16 16:50 ` Cornelia Huck
2019-01-16 17:16   ` Alex Bennée
2019-01-17  5:57     ` Thomas Huth
2019-01-17  8:30 ` Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87va2poqoz.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=aurelien@aurel32.net \
    --cc=cohuck@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.