* OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command)
@ 2009-06-28 18:19 Filip Navara
2009-06-28 21:24 ` Laurent Desnogues
0 siblings, 1 reply; 5+ messages in thread
From: Filip Navara @ 2009-06-28 18:19 UTC (permalink / raw)
To: Blue Swirl
Cc: Anthony Liguori, ehabkost, jan.kiszka, dlaor, qemu-devel,
Luiz Capitulino, Avi Kivity
On Sun, Jun 28, 2009 at 7:51 PM, Blue Swirl<blauwirbel@gmail.com> wrote:
> On 6/28/09, Filip Navara <filip.navara@gmail.com> wrote:
>> On Sun, Jun 28, 2009 at 5:52 PM, Avi Kivity<avi@redhat.com> wrote:
>> > It really isn't very complicated, and
>> > the thread only got so long because the topic is relatively simple. Post an
>> > RFC and a mile-long patchset about changing TCG to SSA form, and see how you
>> > get no replies.
>>
>>
>> I wouldn't even dare to push the SSA patch... Mile-long doesn't
>> describe it precisely enough. Imagine it was applied to all the
>> targets.
Just to be perfectly clear, this was meant as a joke. I don't have any
working SSA patch and neither am I working on one right now, but I am
interested in the topic. Main reason for my interest is this:
http://www.info.uni-karlsruhe.de/lehre/2006SS/uebau2/folien/08-RA_v1_4.pdf
http://www.info.uni-karlsruhe.de/~hack/ra_ssa.pdf
I'd like to know if the register allocation can be improved. I don't
believe SSA would help much in anything else since the input code to
translators was already compiled with optimizing compiler and so most
of the SSA-based optimizations would be redundant.
Doing a profiling run on several ARM demo programs showed that most of
the generated code was doing load/store operations to the machine
registers (in CPU_env). Sample run of FreeRTOS looked like this (OP
counts):
movi_i32 1603
ld_i32 1305
st_i32 1174
add_i32 530
...
If there could be done something that would allow the guest registers
to be stored in host registers, even if for a temporary amount of time
it would certainly help the guests that I'm dealing with.
Best regards,
Filip Navara
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) 2009-06-28 18:19 OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) Filip Navara @ 2009-06-28 21:24 ` Laurent Desnogues 2009-06-28 23:19 ` Filip Navara 0 siblings, 1 reply; 5+ messages in thread From: Laurent Desnogues @ 2009-06-28 21:24 UTC (permalink / raw) To: Filip Navara Cc: Anthony Liguori, ehabkost, jan.kiszka, dlaor, qemu-devel, Luiz Capitulino, Blue Swirl, Avi Kivity On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<filip.navara@gmail.com> wrote: > Doing a profiling run on several ARM demo programs showed that most of > the generated code was doing load/store operations to the machine > registers (in CPU_env). Sample run of FreeRTOS looked like this (OP > counts): > > movi_i32 1603 > ld_i32 1305 > st_i32 1174 > add_i32 530 > ... > > If there could be done something that would allow the guest registers > to be stored in host registers, even if for a temporary amount of time > it would certainly help the guests that I'm dealing with. TCG does a good job for register allocation. The problem you have here is that the ARM translator isn't using tcg_global_mem_new_i32 for ARM registers. Here's an example of number of ops I see when using tcg_global_mem_new_i32: exit_tb 4991 add_i32 7945 st_i32 8257 movi_i32 26812 mov_i32 38369 And with the trunk: exit_tb 4957 add_i32 8165 st_i32 20281 ld_i32 21926 movi_i32 25083 Laurent ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) 2009-06-28 21:24 ` Laurent Desnogues @ 2009-06-28 23:19 ` Filip Navara 2009-06-28 23:35 ` Filip Navara 0 siblings, 1 reply; 5+ messages in thread From: Filip Navara @ 2009-06-28 23:19 UTC (permalink / raw) To: Laurent Desnogues; +Cc: Blue Swirl, Anthony Liguori, qemu-devel, Avi Kivity [-- Attachment #1: Type: text/plain, Size: 1623 bytes --] On Sun, Jun 28, 2009 at 11:24 PM, Laurent Desnogues<laurent.desnogues@gmail.com> wrote: > On Sun, Jun 28, 2009 at 8:19 PM, Filip Navara<filip.navara@gmail.com> wrote: >> Doing a profiling run on several ARM demo programs showed that most of >> the generated code was doing load/store operations to the machine >> registers (in CPU_env). Sample run of FreeRTOS looked like this (OP >> counts): >> >> movi_i32 1603 >> ld_i32 1305 >> st_i32 1174 >> add_i32 530 >> ... >> >> If there could be done something that would allow the guest registers >> to be stored in host registers, even if for a temporary amount of time >> it would certainly help the guests that I'm dealing with. > > TCG does a good job for register allocation. > > The problem you have here is that the ARM translator > isn't using tcg_global_mem_new_i32 for ARM registers. Interesting, thanks for the tip. I have been trying to achieve the same effect using tcg_global_reg_new_i32, no wonder it felt so hard. :) > Here's an example of number of ops I see when using > tcg_global_mem_new_i32: > > exit_tb 4991 > add_i32 7945 > st_i32 8257 > movi_i32 26812 > mov_i32 38369 > > And with the trunk: > > exit_tb 4957 > add_i32 8165 > st_i32 20281 > ld_i32 21926 > movi_i32 25083 > > > Laurent > Attached is a proof-of-concept of ARM patch for using tcg_global_mem_new_i32. I didn't have much time to test it yet, but on synthetic benchmark it improved the performance by 13 DMIPS to the total of 216 DMIPS, which equals to 6% improvement. On x86 host the register allocation still looks very pathetic, I will post a follow-up soon. Best regards, Filip Navara [-- Attachment #2: 0001-First-try-at-using-tcg_global_mem_new_i32.patch.txt --] [-- Type: text/plain, Size: 3869 bytes --] From 4feddee0e7e02e1daab764dbbf9d694277b1e00a Mon Sep 17 00:00:00 2001 From: Filip Navara <filip.navara@gmail.com> Date: Mon, 29 Jun 2009 01:13:42 +0200 Subject: [PATCH] First try at using tcg_global_mem_new_i32. --- target-arm/translate.c | 40 +++++++++++++++++++++++----------------- 1 files changed, 23 insertions(+), 17 deletions(-) diff --git a/target-arm/translate.c b/target-arm/translate.c index 62c9eff..9a39536 100644 --- a/target-arm/translate.c +++ b/target-arm/translate.c @@ -77,6 +77,7 @@ typedef struct DisasContext { static TCGv_ptr cpu_env; /* We reuse the same 64-bit temporaries for efficiency. */ static TCGv_i64 cpu_V0, cpu_V1, cpu_M0; +static TCGv_i32 cpu_R[16]; /* FIXME: These should be removed. */ static TCGv cpu_T[2]; @@ -86,14 +87,26 @@ static TCGv_i64 cpu_F0d, cpu_F1d; #define ICOUNT_TEMP cpu_T[0] #include "gen-icount.h" +static const char *regnames[] = + { "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", + "r8", "r9", "r10", "r11", "r12", "r13", "r14", "pc" }; + /* initialize TCG globals. */ void arm_translate_init(void) { + int i; + cpu_env = tcg_global_reg_new_ptr(TCG_AREG0, "env"); cpu_T[0] = tcg_global_reg_new_i32(TCG_AREG1, "T0"); cpu_T[1] = tcg_global_reg_new_i32(TCG_AREG2, "T1"); + for (i = 0; i < 16; i++) { + cpu_R[i] = tcg_global_mem_new_i32(TCG_AREG0, + offsetof(CPUState, regs[i]), + regnames[i]); + } + #define GEN_HELPER 2 #include "helpers.h" } @@ -168,7 +181,7 @@ static void load_reg_var(DisasContext *s, TCGv var, int reg) addr = (long)s->pc + 4; tcg_gen_movi_i32(var, addr); } else { - tcg_gen_ld_i32(var, cpu_env, offsetof(CPUState, regs[reg])); + tcg_gen_mov_i32(var, cpu_R[reg]); } } @@ -188,7 +201,7 @@ static void store_reg(DisasContext *s, int reg, TCGv var) tcg_gen_andi_i32(var, var, ~1); s->is_jmp = DISAS_JUMP; } - tcg_gen_st_i32(var, cpu_env, offsetof(CPUState, regs[reg])); + tcg_gen_mov_i32(cpu_R[reg], var); dead_tmp(var); } @@ -790,27 +803,22 @@ static inline void gen_bx_im(DisasContext *s, uint32_t addr) TCGv tmp; s->is_jmp = DISAS_UPDATE; - tmp = new_tmp(); if (s->thumb != (addr & 1)) { + tmp = new_tmp(); tcg_gen_movi_i32(tmp, addr & 1); tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, thumb)); + dead_tmp(tmp); } - tcg_gen_movi_i32(tmp, addr & ~1); - tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, regs[15])); - dead_tmp(tmp); + tcg_gen_mov_i32(cpu_R[15], addr & ~1); } /* Set PC and Thumb state from var. var is marked as dead. */ static inline void gen_bx(DisasContext *s, TCGv var) { - TCGv tmp; - s->is_jmp = DISAS_UPDATE; - tmp = new_tmp(); - tcg_gen_andi_i32(tmp, var, 1); - store_cpu_field(tmp, thumb); - tcg_gen_andi_i32(var, var, ~1); - store_cpu_field(var, regs[15]); + tcg_gen_andi_i32(cpu_R[15], var, ~1); + tcg_gen_andi_i32(var, var, 1); + store_cpu_field(var, thumb); } /* Variant of store_reg which uses branch&exchange logic when storing @@ -889,9 +897,7 @@ static inline void gen_movl_T2_reg(DisasContext *s, int reg) static inline void gen_set_pc_im(uint32_t val) { - TCGv tmp = new_tmp(); - tcg_gen_movi_i32(tmp, val); - store_cpu_field(tmp, regs[15]); + tcg_gen_movi_i32(cpu_R[15], val); } static inline void gen_movl_reg_TN(DisasContext *s, int reg, int t) @@ -903,7 +909,7 @@ static inline void gen_movl_reg_TN(DisasContext *s, int reg, int t) } else { tmp = cpu_T[t]; } - tcg_gen_st_i32(tmp, cpu_env, offsetof(CPUState, regs[reg])); + tcg_gen_mov_i32(cpu_R[reg], tmp); if (reg == 15) { dead_tmp(tmp); s->is_jmp = DISAS_JUMP; -- 1.6.3.msysgit.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) 2009-06-28 23:19 ` Filip Navara @ 2009-06-28 23:35 ` Filip Navara 2009-06-29 6:39 ` Laurent Desnogues 0 siblings, 1 reply; 5+ messages in thread From: Filip Navara @ 2009-06-28 23:35 UTC (permalink / raw) To: Laurent Desnogues; +Cc: Blue Swirl, Anthony Liguori, qemu-devel, Avi Kivity On Mon, Jun 29, 2009 at 1:19 AM, Filip Navara<filip.navara@gmail.com> wrote: > On x86 host the register allocation still looks very pathetic, I will post a follow-up > soon. Let's look at the very first two instructions generated by the guest: ---------------- IN: 0x00200070: ldr r0, [pc, #108] ; 0x2000e4 0x00200074: ldr pc, [pc, #108] ; 0x2000e8 OP: movi_i32 tmp8,$0x200078 movi_i32 tmp9,$0x6c add_i32 tmp8,tmp8,tmp9 qemu_ld32u tmp9,tmp8,$0x0 mov_i32 r0,tmp9 movi_i32 tmp9,$0x20007c movi_i32 tmp10,$0x6c add_i32 tmp9,tmp9,tmp10 qemu_ld32u tmp8,tmp9,$0x0 movi_i32 tmp10,$0xfffffffe and_i32 tmp8,tmp8,tmp10 mov_i32 pc,tmp8 exit_tb $0x0 OUT: [size=128] 0x03230020: mov $0x200078,%eax 0x03230025: add $0x6c,%eax 0x03230028: mov %eax,%ecx 0x0323002a: mov %ecx,%edx 0x0323002c: mov %ecx,%eax -- this instruction sets %eax to value that it already has 0x0323002e: shr $0x6,%edx 0x03230031: and $0xfffffc03,%eax 0x03230037: and $0xff0,%edx 0x0323003d: lea 0x540(%edx,%ebp,1),%edx 0x03230044: cmp (%edx),%eax 0x03230046: mov %ecx,%eax 0x03230048: je 0x3230053 0x0323004a: xor %edx,%edx 0x0323004c: call 0x55cbc0 0x03230051: jmp 0x3230058 0x03230053: add 0xc(%edx),%eax 0x03230056: mov (%eax),%eax 0x03230058: mov $0x20007c,%edx 0x0323005d: add $0x6c,%edx 0x03230060: mov %edx,%ecx 0x03230062: mov %eax,0x0(%ebp) 0x03230065: mov %ecx,%edx -- same here 0x03230067: mov %ecx,%eax 0x03230069: shr $0x6,%edx 0x0323006c: and $0xfffffc03,%eax 0x03230072: and $0xff0,%edx 0x03230078: lea 0x540(%edx,%ebp,1),%edx 0x0323007f: cmp (%edx),%eax 0x03230081: mov %ecx,%eax 0x03230083: je 0x323008e 0x03230085: xor %edx,%edx 0x03230087: call 0x55cbc0 0x0323008c: jmp 0x3230093 0x0323008e: add 0xc(%edx),%eax 0x03230091: mov (%eax),%eax 0x03230093: and $0xfffffffe,%eax 0x03230096: mov %eax,0x3c(%ebp) 0x03230099: xor %eax,%eax 0x0323009b: jmp 0x7ec928 If someone can explain me why the redundant mov instructions are generated I'd be very happy. Thanks. Best regards, Filip Navara ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) 2009-06-28 23:35 ` Filip Navara @ 2009-06-29 6:39 ` Laurent Desnogues 0 siblings, 0 replies; 5+ messages in thread From: Laurent Desnogues @ 2009-06-29 6:39 UTC (permalink / raw) To: Filip Navara; +Cc: qemu-devel On Mon, Jun 29, 2009 at 1:35 AM, Filip Navara<filip.navara@gmail.com> wrote: > On Mon, Jun 29, 2009 at 1:19 AM, Filip Navara<filip.navara@gmail.com> wrote: >> On x86 host the register allocation still looks very pathetic, I will post a follow-up >> soon. > > Let's look at the very first two instructions generated by the guest: > > ---------------- > IN: > 0x00200070: ldr r0, [pc, #108] ; 0x2000e4 > 0x00200074: ldr pc, [pc, #108] ; 0x2000e8 > > OP: > movi_i32 tmp8,$0x200078 > movi_i32 tmp9,$0x6c > add_i32 tmp8,tmp8,tmp9 > qemu_ld32u tmp9,tmp8,$0x0 > mov_i32 r0,tmp9 > movi_i32 tmp9,$0x20007c > movi_i32 tmp10,$0x6c > add_i32 tmp9,tmp9,tmp10 > qemu_ld32u tmp8,tmp9,$0x0 > movi_i32 tmp10,$0xfffffffe > and_i32 tmp8,tmp8,tmp10 > mov_i32 pc,tmp8 > exit_tb $0x0 > > OUT: [size=128] > 0x03230020: mov $0x200078,%eax > 0x03230025: add $0x6c,%eax > 0x03230028: mov %eax,%ecx > 0x0323002a: mov %ecx,%edx > 0x0323002c: mov %ecx,%eax > > -- this instruction sets %eax to value that it already has > > 0x0323002e: shr $0x6,%edx > 0x03230031: and $0xfffffc03,%eax > 0x03230037: and $0xff0,%edx > 0x0323003d: lea 0x540(%edx,%ebp,1),%edx > 0x03230044: cmp (%edx),%eax > 0x03230046: mov %ecx,%eax > 0x03230048: je 0x3230053 > 0x0323004a: xor %edx,%edx > 0x0323004c: call 0x55cbc0 > 0x03230051: jmp 0x3230058 > 0x03230053: add 0xc(%edx),%eax > 0x03230056: mov (%eax),%eax > 0x03230058: mov $0x20007c,%edx > 0x0323005d: add $0x6c,%edx > 0x03230060: mov %edx,%ecx > 0x03230062: mov %eax,0x0(%ebp) > 0x03230065: mov %ecx,%edx > > -- same here > > 0x03230067: mov %ecx,%eax > 0x03230069: shr $0x6,%edx > 0x0323006c: and $0xfffffc03,%eax > 0x03230072: and $0xff0,%edx > 0x03230078: lea 0x540(%edx,%ebp,1),%edx > 0x0323007f: cmp (%edx),%eax > 0x03230081: mov %ecx,%eax > 0x03230083: je 0x323008e > 0x03230085: xor %edx,%edx > 0x03230087: call 0x55cbc0 > 0x0323008c: jmp 0x3230093 > 0x0323008e: add 0xc(%edx),%eax > 0x03230091: mov (%eax),%eax > 0x03230093: and $0xfffffffe,%eax > 0x03230096: mov %eax,0x3c(%ebp) > 0x03230099: xor %eax,%eax > 0x0323009b: jmp 0x7ec928 > > If someone can explain me why the redundant mov instructions are > generated I'd be very happy. Thanks. What you see here is due to hard-coded assembly instructions used to make a load. cf tcg_out_qemu_ld in tcg-target.c Laurent ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-06-29 6:39 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-06-28 18:19 OT: TCG SSA, speed, misc (was Re: [Qemu-devel] Re: [PATCH 08/11] QMP: Port balloon command) Filip Navara 2009-06-28 21:24 ` Laurent Desnogues 2009-06-28 23:19 ` Filip Navara 2009-06-28 23:35 ` Filip Navara 2009-06-29 6:39 ` Laurent Desnogues
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).