From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:58459)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UZzN7-0002gC-Et
	for qemu-devel@nongnu.org; Wed, 08 May 2013 04:05:10 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UZzN5-0007GQ-DA
	for qemu-devel@nongnu.org; Wed, 08 May 2013 04:05:09 -0400
Received: from mail-we0-x232.google.com ([2a00:1450:400c:c03::232]:54599)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <paolo.bonzini@gmail.com>) id 1UZzN5-0007G4-6S
	for qemu-devel@nongnu.org; Wed, 08 May 2013 04:05:07 -0400
Received: by mail-we0-f178.google.com with SMTP id q57so1465943wes.37
	for <qemu-devel@nongnu.org>; Wed, 08 May 2013 01:05:06 -0700 (PDT)
Sender: Paolo Bonzini <paolo.bonzini@gmail.com>
Message-ID: <518A072E.9070708@redhat.com>
Date: Wed, 08 May 2013 10:05:02 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <86bo8mcsax.fsf@shell.gmplib.org>
In-Reply-To: <86bo8mcsax.fsf@shell.gmplib.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] Possible ppc comparision optimisation
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Torbjorn Granlund <tg@gmplib.org>
Cc: qemu-devel@nongnu.org

Il 08/05/2013 00:56, Torbjorn Granlund ha scritto:
> The current ppc gen_op_cmp generates a long sequence of instructions,
> using a plain series of three disjoint compares.
> 
> It is possible to compute the 3 result bits more cleverly.  Below is a
> possible replacement gen_op_cmp.  (It is tested by booting GNU/Linux
> ppx64, but not much more than that.)
> 
> Surely this should be faster than the old code?  OK, it is less
> readable, but cmp is pretty critical and should be made fast.
> 
> Should one truncate things using tcg_gen_trunc_tl_i32 and do the add,
> xori, addi as i32 variants?  (Why?)

I think that would be faster on 32-bit hosts, truncs are cheap.

> There could be a disadvantage of this compared to the old code, since
> this has a chained algebraic dependency, while the old code's many
> instructions might have been more independent.

What about these alternatives:

setcond LT, t0, arg0, arg1
setcond EQ, t1, arg0, arg1
trunc  s0, t0
trunc  s1, t1
shli   s0, s0, 1                ; s0 = (arg0 < arg1) ? 2 : 0
subi   s1, s1, 2                ; s1 = (arg0 != arg1) ? -2 : -1
sub    s0, s0, s1               ; < 4       == 1      > 2
shli   s0, s0, 1                ; < 8       == 2      > 4

=======

setcond LT, t0, arg0, arg1
setcond NE, t1, arg0, arg1
trunc   s0, t0
trunc   s1, t1
add     s0, s0, s1              ; < 2       == 0      > 1
movi    s1, 1
add     s0, s0, s1              ; < 3       == 1      > 2
shl     s1, s1, s0              ; < 8       == 2      > 4

Paolo

> static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
> {
>     TCGv t0 = tcg_temp_new();
>     TCGv t1 = tcg_temp_new();
>     TCGv_i32 s0 = tcg_temp_new_i32();
> 
>     tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);
> 
>     tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1);
>     tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1);
>     tcg_gen_add_tl(t0, t0, t1);
>     tcg_gen_xori_tl(t0, t0, 1);
>     tcg_gen_addi_tl(t0, t0, 1);
>     tcg_gen_trunc_tl_i32(s0, t0);
>     tcg_gen_shli_i32(s0, s0, 1);
>     tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0);
> 
>     tcg_temp_free(t0);
>     tcg_temp_free(t1);
>     tcg_temp_free_i32(s0);
> }
>