* [Qemu-devel] Possible ppc comparision optimisation @ 2013-05-07 22:56 Torbjorn Granlund 2013-05-08 8:05 ` Paolo Bonzini 0 siblings, 1 reply; 4+ messages in thread From: Torbjorn Granlund @ 2013-05-07 22:56 UTC (permalink / raw) To: qemu-devel The current ppc gen_op_cmp generates a long sequence of instructions, using a plain series of three disjoint compares. It is possible to compute the 3 result bits more cleverly. Below is a possible replacement gen_op_cmp. (It is tested by booting GNU/Linux ppx64, but not much more than that.) Surely this should be faster than the old code? OK, it is less readable, but cmp is pretty critical and should be made fast. Should one truncate things using tcg_gen_trunc_tl_i32 and do the add, xori, addi as i32 variants? (Why?) There could be a disadvantage of this compared to the old code, since this has a chained algebraic dependency, while the old code's many instructions might have been more independent. static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf) { TCGv t0 = tcg_temp_new(); TCGv t1 = tcg_temp_new(); TCGv_i32 s0 = tcg_temp_new_i32(); tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so); tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1); tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1); tcg_gen_add_tl(t0, t0, t1); tcg_gen_xori_tl(t0, t0, 1); tcg_gen_addi_tl(t0, t0, 1); tcg_gen_trunc_tl_i32(s0, t0); tcg_gen_shli_i32(s0, s0, 1); tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0); tcg_temp_free(t0); tcg_temp_free(t1); tcg_temp_free_i32(s0); } -- Torbjörn ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] Possible ppc comparision optimisation 2013-05-07 22:56 [Qemu-devel] Possible ppc comparision optimisation Torbjorn Granlund @ 2013-05-08 8:05 ` Paolo Bonzini 2013-05-08 15:44 ` Torbjorn Granlund 0 siblings, 1 reply; 4+ messages in thread From: Paolo Bonzini @ 2013-05-08 8:05 UTC (permalink / raw) To: Torbjorn Granlund; +Cc: qemu-devel Il 08/05/2013 00:56, Torbjorn Granlund ha scritto: > The current ppc gen_op_cmp generates a long sequence of instructions, > using a plain series of three disjoint compares. > > It is possible to compute the 3 result bits more cleverly. Below is a > possible replacement gen_op_cmp. (It is tested by booting GNU/Linux > ppx64, but not much more than that.) > > Surely this should be faster than the old code? OK, it is less > readable, but cmp is pretty critical and should be made fast. > > Should one truncate things using tcg_gen_trunc_tl_i32 and do the add, > xori, addi as i32 variants? (Why?) I think that would be faster on 32-bit hosts, truncs are cheap. > There could be a disadvantage of this compared to the old code, since > this has a chained algebraic dependency, while the old code's many > instructions might have been more independent. What about these alternatives: setcond LT, t0, arg0, arg1 setcond EQ, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 shli s0, s0, 1 ; s0 = (arg0 < arg1) ? 2 : 0 subi s1, s1, 2 ; s1 = (arg0 != arg1) ? -2 : -1 sub s0, s0, s1 ; < 4 == 1 > 2 shli s0, s0, 1 ; < 8 == 2 > 4 ======= setcond LT, t0, arg0, arg1 setcond NE, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 add s0, s0, s1 ; < 2 == 0 > 1 movi s1, 1 add s0, s0, s1 ; < 3 == 1 > 2 shl s1, s1, s0 ; < 8 == 2 > 4 Paolo > static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf) > { > TCGv t0 = tcg_temp_new(); > TCGv t1 = tcg_temp_new(); > TCGv_i32 s0 = tcg_temp_new_i32(); > > tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so); > > tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1); > tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1); > tcg_gen_add_tl(t0, t0, t1); > tcg_gen_xori_tl(t0, t0, 1); > tcg_gen_addi_tl(t0, t0, 1); > tcg_gen_trunc_tl_i32(s0, t0); > tcg_gen_shli_i32(s0, s0, 1); > tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0); > > tcg_temp_free(t0); > tcg_temp_free(t1); > tcg_temp_free_i32(s0); > } > ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] Possible ppc comparision optimisation 2013-05-08 8:05 ` Paolo Bonzini @ 2013-05-08 15:44 ` Torbjorn Granlund 2013-05-08 16:16 ` Paolo Bonzini 0 siblings, 1 reply; 4+ messages in thread From: Torbjorn Granlund @ 2013-05-08 15:44 UTC (permalink / raw) To: Paolo Bonzini; +Cc: qemu-devel Paolo Bonzini <pbonzini@redhat.com> writes: I think that would be faster on 32-bit hosts, truncs are cheap. And slower perhaps on 64-bit hosts, at least for operations where additional explicit trunctation will be needed (such as before comparisions and after right shifts). > There could be a disadvantage of this compared to the old code, since > this has a chained algebraic dependency, while the old code's many > instructions might have been more independent. What about these alternatives: setcond LT, t0, arg0, arg1 setcond EQ, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 shli s0, s0, 1 ; s0 = (arg0 < arg1) ? 2 : 0 subi s1, s1, 2 ; s1 = (arg0 != arg1) ? -2 : -1 sub s0, s0, s1 ; < 4 == 1 > 2 shli s0, s0, 1 ; < 8 == 2 > 4 ======= setcond LT, t0, arg0, arg1 setcond NE, t1, arg0, arg1 trunc s0, t0 trunc s1, t1 add s0, s0, s1 ; < 2 == 0 > 1 movi s1, 1 add s0, s0, s1 ; < 3 == 1 > 2 shl s1, s1, s0 ; < 8 == 2 > 4 Surely there are many alternative forms. Is your aim to add micro-parallelism? (Your sequences look a bit curious. Did you use a super-optimiser?) -- Torbjörn ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [Qemu-devel] Possible ppc comparision optimisation 2013-05-08 15:44 ` Torbjorn Granlund @ 2013-05-08 16:16 ` Paolo Bonzini 0 siblings, 0 replies; 4+ messages in thread From: Paolo Bonzini @ 2013-05-08 16:16 UTC (permalink / raw) To: Torbjorn Granlund; +Cc: qemu-devel Il 08/05/2013 17:44, Torbjorn Granlund ha scritto: > Paolo Bonzini <pbonzini@redhat.com> writes: > > I think that would be faster on 32-bit hosts, truncs are cheap. > > And slower perhaps on 64-bit hosts, at least for operations where > additional explicit trunctation will be needed (such as before > comparisions and after right shifts). > > > There could be a disadvantage of this compared to the old code, since > > this has a chained algebraic dependency, while the old code's many > > instructions might have been more independent. > > What about these alternatives: > > setcond LT, t0, arg0, arg1 > setcond EQ, t1, arg0, arg1 > trunc s0, t0 > trunc s1, t1 > shli s0, s0, 1 ; s0 = (arg0 < arg1) ? 2 : 0 > subi s1, s1, 2 ; s1 = (arg0 != arg1) ? -2 : -1 > sub s0, s0, s1 ; < 4 == 1 > 2 > shli s0, s0, 1 ; < 8 == 2 > 4 > > ======= > > setcond LT, t0, arg0, arg1 > setcond NE, t1, arg0, arg1 > trunc s0, t0 > trunc s1, t1 > add s0, s0, s1 ; < 2 == 0 > 1 > movi s1, 1 > add s0, s0, s1 ; < 3 == 1 > 2 > shl s1, s1, s0 ; < 8 == 2 > 4 > > Surely there are many alternative forms. > Is your aim to add micro-parallelism? Yes, I think in this respect I think the first one is better. The second could be three instructions on machines that have a set-nth-bit instruction _and_ a zero register, but I'm not sure they exist... > (Your sequences look a bit curious. Did you use a super-optimiser?) No, but I am attracted to these curious sequences from my previous life working on compilers. :) I know your superoptimizer and, in fact, we both worked on some parts of GCC (optimization of conditional branches/stores), just 20 years apart. The second is actually not too curious after you look at it for a while, it is a variant of the usual (x > y) + (x >= y) trick used to generate a 0/1/2 result. The first I found by trial and error based on yours; it is basically (x < y) * 2 - (x == y) + 2, with some reordering to get parallelism and avoid the need for subfi-like instructions. Paolo ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2013-05-08 16:17 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-07 22:56 [Qemu-devel] Possible ppc comparision optimisation Torbjorn Granlund 2013-05-08 8:05 ` Paolo Bonzini 2013-05-08 15:44 ` Torbjorn Granlund 2013-05-08 16:16 ` Paolo Bonzini
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).