Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Alex Bennée" <alex.bennee@linaro.org>
To: Thomas Huth <thuth@redhat.com>
Cc: "Peter Maydell" <peter.maydell@linaro.org>,
	"Richard Henderson" <rth@twiddle.net>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>,
	"Aurelien Jarno" <aurelien@aurel32.net>,
	"Cornelia Huck" <cohuck@redhat.com>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	qemu-s390x <qemu-s390x@nongnu.org>
Subject: Re: [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x
Date: Tue, 15 Jan 2019 16:01:32 +0000	[thread overview]
Message-ID: <87va2poqoz.fsf@linaro.org> (raw)
In-Reply-To: <6cb80b50-0352-430e-0c46-85ed69f95c88@redhat.com>


Thomas Huth <thuth@redhat.com> writes:

> On 2019-01-15 15:46, Alex Bennée wrote:
>>
>> Peter Maydell <peter.maydell@linaro.org> writes:
>>
>>> On Mon, 14 Jan 2019 at 22:48, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>>
>>>> Richard Henderson <rth@twiddle.net> writes:
>>>>> But perhaps
>>>>>
>>>>>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>>>>>     *r = n % d;
>>>>>     return n / d;
>>>>>
>>>>> will allow the compiler to do what the assembly does for some 64-bit
>>>>> hosts.
>>>>
>>>> I wonder how much cost is incurred by the jumping to the (libgcc?) div
>>>> helper? Anyone got an s390x about so we can benchmark the two
>>>> approaches?
>>>
>>> The project has an s390x system available; however it's usually
>>> running merge build tests so not so useful for benchmarking.
>>> (I can set up accounts on it but that requires me to faff about
>>> figuring out how to create new accounts :-))
>>
>> I'm happy to leave this up to those who care about s390x host
>> performance (Thomas, Cornelia?). I'm just keen to avoid the divide
>> helper getting too #ifdefy.
>
> Ok, I just did a quick'n'dirty "benchmark" on the s390x that I've got
> available:

Ahh I should have mentioned we already have the technology for this ;-)

If you build the fpu/next tree on a s390x you can then run:

  ./tests/fp/fp-bench f64_div

with and without the CONFIG_128 path. To get an idea of the real world
impact you can compile a foreign binary and run it on a s390x system
with:

  $QEMU ./tests/fp/fp-bench f64_div -t host

And that will give you the peak performance assuming your program is
doing nothing but f64_div operations. If the two QEMU's are basically in
the same ballpark then it doesn't make enough difference. That said:

> #include <stdio.h>
> #include <time.h>
> #include <stdint.h>
>
> uint64_t udiv_qrnnd1(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     asm("dlgr %0, %1" : "+r"(n) : "r"(d));
>     *r = n >> 64;
>     return n;
> }
>
> uint64_t udiv_qrnnd2(uint64_t *r, uint64_t n1, uint64_t n0, uint64_t d)
> {
>     unsigned __int128 n = (unsigned __int128)n1 << 64 | n0;
>     *r = n % d;
>     return n / d;
> }
>
<snip>
>
> int main()
> {
> 	uint64_t r = 0, n1 = 0, n0 = 0, d = 0;
> 	uint64_t rs = 0, rn = 0;
> 	clock_t start, end;
> 	long i;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd1(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 1: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	r = n1 = n0 = d = rs = rn = 0;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd2(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 2: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	r = n1 = n0 = d = rs = rn = 0;
>
> 	start = clock();
> 	for (i=0; i<200000000L; i++) {
> 		n1 += 3;
> 		n0 += 987654321;
> 		d += 0x123456789;
> 		rs += udiv_qrnnd3(&r, n1, n0, d);
> 		rn += r;
> 	}
> 	end = clock();
> 	printf("test 3: time=%li\t, rs=%li , rn = %li\n", (end-start)/1000, rs, rn);
>
> 	return 0;
> }
>
> ... and results with GCC v8.2.1 are (using -O2):
>
> test 1: time=609	, rs=2264924160200000000 , rn = 6136218997527160832
> test 2: time=10127	, rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=2350	, rs=2264924183048928865 , rn = 4842822048162311089
>
> Thus the int128 version is the slowest!

I'd expect a little slow down due to the indirection into libgcc.. but
that seems pretty high.

>
> ... but at least it gives the same results as the DLGR instruction. The 64-bit
> version gives different results - do we have a bug here?
>
> Results with Clang v7.0.1 (using -O2, too) are these:
>
> test 2: time=5035	, rs=2264924160200000000 , rn = 6136218997527160832
> test 3: time=1970	, rs=2264924183048928865 , rn =
> 4842822048162311089

You can run:

  ./tests/fp/fp-test f64_div -l 2 -r all

For a proper comprehensive test.

--
Alex Bennée

next prev parent reply	other threads:[~2019-01-15 16:01 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-14 12:12 [Qemu-devel] [PATCH] include/fpu/softfloat: Fix compilation with Clang on s390x Thomas Huth
2019-01-14 12:16 ` Philippe Mathieu-Daudé
2019-01-14 16:37   ` Alex Bennée
2019-01-14 17:03     ` Thomas Huth
2019-01-14 18:58       ` Alex Bennée
2019-01-14 21:36         ` Richard Henderson
2019-01-14 22:48           ` Alex Bennée
2019-01-15 10:14             ` Peter Maydell
2019-01-15 14:46               ` Alex Bennée
2019-01-15 15:29                 ` Thomas Huth
2019-01-15 16:01                   ` Alex Bennée [this message]
2019-01-15 20:05                     ` Emilio G. Cota
2019-01-16  6:33                       ` Thomas Huth
2019-01-16 17:08                         ` Alex Bennée
2019-01-17  6:06                           ` Thomas Huth
2019-01-17  7:42                             ` Alex Bennée
2019-01-16 18:21                         ` Emilio G. Cota
2019-01-15 22:05                   ` Richard Henderson
2019-01-14 21:40 ` Richard Henderson
2019-01-16 16:50 ` Cornelia Huck
2019-01-16 17:16   ` Alex Bennée
2019-01-17  5:57     ` Thomas Huth
2019-01-17  8:30 ` Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87va2poqoz.fsf@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=aurelien@aurel32.net \
    --cc=cohuck@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-s390x@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).