* [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination @ 2011-05-14 19:35 Blue Swirl 2011-05-14 21:16 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-14 19:35 UTC (permalink / raw) To: qemu-devel Here's a RFC series for eliminating AREG0. Blue Swirl (11): Move user emulator stuff from cpu-exec.c to user-exec.c Delete unused tb_invalidate_page_range The above should be OK to commit. cpu_loop_exit: avoid using AREG0 Delegate setup of TCG temporaries to targets These two are not, unless the overall plan is OK. TCG: fix negative frame offset calculations TCG/x86: use stack for TCG temps TCG/Sparc64: use stack for TCG temps But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. Add CONFIG_TARGET_NEEDS_AREG0 Don't compile legacy qemu_ld/st functions if target doesn't need them Should be OK, though the latter patch only touches x86. Add new qemu_ld and qemu_st functions sparc: use new qemu_ld and qemu_st functions The last two compile but QEMU segfaults. I just made a naive conversion for getting comments. Makefile.target | 14 +- configure | 6 + cpu-exec.c | 647 +---------------------------------------- dyngen-exec.h | 2 + exec-all.h | 3 +- hw/alpha_palcode.c | 2 +- linux-user/main.c | 22 +-- target-alpha/op_helper.c | 4 +- target-arm/op_helper.c | 6 +- target-cris/op_helper.c | 4 +- target-i386/op_helper.c | 16 +- target-lm32/op_helper.c | 6 +- target-m68k/op_helper.c | 6 +- target-microblaze/op_helper.c | 4 +- target-mips/op_helper.c | 4 +- target-ppc/op_helper.c | 3 +- target-sh4/op_helper.c | 10 +- target-sparc/helper.h | 1 - target-sparc/op_helper.c | 12 +- target-sparc/translate.c | 5 +- target-unicore32/op_helper.c | 2 +- tcg/arm/tcg-target.c | 2 + tcg/arm/tcg-target.h | 4 + tcg/hppa/tcg-target.c | 2 + tcg/hppa/tcg-target.h | 4 + tcg/i386/tcg-target.c | 62 +++-- tcg/i386/tcg-target.h | 4 + tcg/ia64/tcg-target.c | 2 + tcg/ia64/tcg-target.h | 4 + tcg/mips/tcg-target.c | 2 + tcg/mips/tcg-target.h | 4 + tcg/ppc/tcg-target.c | 2 + tcg/ppc/tcg-target.h | 4 + tcg/ppc64/tcg-target.c | 2 + tcg/ppc64/tcg-target.h | 4 + tcg/s390/tcg-target.c | 2 + tcg/s390/tcg-target.h | 4 + tcg/sparc/tcg-target.c | 5 +- tcg/sparc/tcg-target.h | 4 + tcg/tcg-op.h | 49 +++ tcg/tcg-opc.h | 3 + tcg/tcg.c | 12 +- translate-all.c | 2 - user-exec.c | 645 ++++++++++++++++++++++++++++++++++++++++ 44 files changed, 866 insertions(+), 741 deletions(-) create mode 100644 user-exec.c ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-14 19:35 [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination Blue Swirl @ 2011-05-14 21:16 ` Aurelien Jarno 2011-05-14 21:52 ` Blue Swirl 2011-05-14 23:31 ` Aurelien Jarno 0 siblings, 2 replies; 31+ messages in thread From: Aurelien Jarno @ 2011-05-14 21:16 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > Here's a RFC series for eliminating AREG0. > > Blue Swirl (11): > Move user emulator stuff from cpu-exec.c to user-exec.c > Delete unused tb_invalidate_page_range > > The above should be OK to commit. > > cpu_loop_exit: avoid using AREG0 > Delegate setup of TCG temporaries to targets > > These two are not, unless the overall plan is OK. > > TCG: fix negative frame offset calculations > TCG/x86: use stack for TCG temps > TCG/Sparc64: use stack for TCG temps > > But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. > > Add CONFIG_TARGET_NEEDS_AREG0 > Don't compile legacy qemu_ld/st functions if target doesn't need them > > Should be OK, though the latter patch only touches x86. > > Add new qemu_ld and qemu_st functions > sparc: use new qemu_ld and qemu_st functions > > The last two compile but QEMU segfaults. I just made a naive > conversion for getting comments. > What is the goal behing removing TCG_AREG0? If it is speed improvement, can you please provide some benchmarks? The env register is used very often (basically for every load/store, but also a lot of helpers), so it makes sense to reserve a register for it. For what I understand from your patch series, you prefer to pass this register explicitly to TCG functions. This basically means this TCG global will be loaded to host register as soon as it is used, but also regularly, as globals are saved back to their canonical location before an helper or a load/store. So it seems that this patch series will just allowing the "env register" to change over time, though it will not spare one more register for the TCG code, and it will emit longer TCG code to regularly reload the env global into a host register. In any case at then end benchmarks is what are need to decided, TCG has always shown that performance improvements doesn't match the improvement analysis. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-14 21:16 ` Aurelien Jarno @ 2011-05-14 21:52 ` Blue Swirl 2011-05-14 22:04 ` Aurelien Jarno 2011-05-14 23:31 ` Aurelien Jarno 1 sibling, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-14 21:52 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> Here's a RFC series for eliminating AREG0. >> >> Blue Swirl (11): >> Move user emulator stuff from cpu-exec.c to user-exec.c >> Delete unused tb_invalidate_page_range >> >> The above should be OK to commit. >> >> cpu_loop_exit: avoid using AREG0 >> Delegate setup of TCG temporaries to targets >> >> These two are not, unless the overall plan is OK. >> >> TCG: fix negative frame offset calculations >> TCG/x86: use stack for TCG temps >> TCG/Sparc64: use stack for TCG temps >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. >> >> Add CONFIG_TARGET_NEEDS_AREG0 >> Don't compile legacy qemu_ld/st functions if target doesn't need them >> >> Should be OK, though the latter patch only touches x86. >> >> Add new qemu_ld and qemu_st functions >> sparc: use new qemu_ld and qemu_st functions >> >> The last two compile but QEMU segfaults. I just made a naive >> conversion for getting comments. >> > > What is the goal behing removing TCG_AREG0? If it is speed improvement, > can you please provide some benchmarks? There was some discussion earlier about why this (or parts of the conversion) may be a speed improvement: http://article.gmane.org/gmane.comp.emulators.qemu/101826 http://article.gmane.org/gmane.comp.emulators.qemu/102156 There are no benchmarks yet. In fact, it may be difficult to make those without performing the removal completely. For example, patch 3/11 makes cpu_loop_exit take CPUState as a parameter instead of using global env, which would be available and the register is reserved anyway. This would only decrease performance at this stage, unless a complete conversion is done. I suspect the same would happen when moving all helpers from op_helper.c to helper.c. But after the whole conversion, this would be a neutral (no extra registers used) or even beneficial change (the code is free to use one more register). > The env register is used very often (basically for every load/store, but > also a lot of helpers), so it makes sense to reserve a register for it. > > For what I understand from your patch series, you prefer to pass this > register explicitly to TCG functions. This basically means this TCG > global will be loaded to host register as soon as it is used, but also > regularly, as globals are saved back to their canonical location before > an helper or a load/store. > > So it seems that this patch series will just allowing the "env register" > to change over time, though it will not spare one more register for the > TCG code, and it will emit longer TCG code to regularly reload the env > global into a host register. But there will be one more register available in some cases. In other cases, the number of registers used does not change. Moving the registers around is what worries me too. But there are other effects too, the helpers are now compiled so that the global env register is not used. Especially on hosts with low number of registers this is not optimal. > In any case at then end benchmarks is what are need to decided, TCG has > always shown that performance improvements doesn't match the improvement > analysis. If this turns out to be a bad idea, it means that the reverse conversion will be beneficial and we should convert helper.c code to op_helper.c and take advantage of the global env in a register. This was actually my standpoint when the discussion started, but I'm interested to see if this approach would work better. There are other opportunities as well, for example while refactoring, qemu_ld/st would be changed. Maybe your constant patches would be useful there or they could benefit from a rearranged interface. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-14 21:52 ` Blue Swirl @ 2011-05-14 22:04 ` Aurelien Jarno 2011-05-15 7:15 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-14 22:04 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > >> Here's a RFC series for eliminating AREG0. > >> > >> Blue Swirl (11): > >> Move user emulator stuff from cpu-exec.c to user-exec.c > >> Delete unused tb_invalidate_page_range > >> > >> The above should be OK to commit. > >> > >> cpu_loop_exit: avoid using AREG0 > >> Delegate setup of TCG temporaries to targets > >> > >> These two are not, unless the overall plan is OK. > >> > >> TCG: fix negative frame offset calculations > >> TCG/x86: use stack for TCG temps > >> TCG/Sparc64: use stack for TCG temps > >> > >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. > >> > >> Add CONFIG_TARGET_NEEDS_AREG0 > >> Don't compile legacy qemu_ld/st functions if target doesn't need them > >> > >> Should be OK, though the latter patch only touches x86. > >> > >> Add new qemu_ld and qemu_st functions > >> sparc: use new qemu_ld and qemu_st functions > >> > >> The last two compile but QEMU segfaults. I just made a naive > >> conversion for getting comments. > >> > > > > What is the goal behing removing TCG_AREG0? If it is speed improvement, > > can you please provide some benchmarks? > > There was some discussion earlier about why this (or parts of the > conversion) may be a speed improvement: > > http://article.gmane.org/gmane.comp.emulators.qemu/101826 > http://article.gmane.org/gmane.comp.emulators.qemu/102156 Ok, looks like I have missed that. > There are no benchmarks yet. In fact, it may be difficult to make > those without performing the removal completely. > > For example, patch 3/11 makes cpu_loop_exit take CPUState as a > parameter instead of using global env, which would be available and > the register is reserved anyway. This would only decrease performance > at this stage, unless a complete conversion is done. I suspect the > same would happen when moving all helpers from op_helper.c to > helper.c. But after the whole conversion, this would be a neutral (no > extra registers used) or even beneficial change (the code is free to > use one more register). > > > The env register is used very often (basically for every load/store, but > > also a lot of helpers), so it makes sense to reserve a register for it. > > > > For what I understand from your patch series, you prefer to pass this > > register explicitly to TCG functions. This basically means this TCG > > global will be loaded to host register as soon as it is used, but also > > regularly, as globals are saved back to their canonical location before > > an helper or a load/store. > > > > So it seems that this patch series will just allowing the "env register" > > to change over time, though it will not spare one more register for the > > TCG code, and it will emit longer TCG code to regularly reload the env > > global into a host register. > > But there will be one more register available in some cases. In other Inside the TCG code, it will basically happens very rarely, given load/store are really the most used instructions, and they need to load the env register. > cases, the number of registers used does not change. Moving the > registers around is what worries me too. > > But there are other effects too, the helpers are now compiled so that > the global env register is not used. Especially on hosts with low > number of registers this is not optimal. Most helpers are very small functions, so I am not sure more registers will help. > > In any case at then end benchmarks is what are need to decided, TCG has > > always shown that performance improvements doesn't match the improvement > > analysis. > > If this turns out to be a bad idea, it means that the reverse > conversion will be beneficial and we should convert helper.c code to > op_helper.c and take advantage of the global env in a register. This > was actually my standpoint when the discussion started, but I'm > interested to see if this approach would work better. Does it mean that you plan to do the code changes in git without really benchmarking, and revert all the changes later if it was a bad idea? -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-14 22:04 ` Aurelien Jarno @ 2011-05-15 7:15 ` Blue Swirl 2011-05-15 9:24 ` Aurelien Jarno 2011-05-15 9:27 ` Laurent Desnogues 0 siblings, 2 replies; 31+ messages in thread From: Blue Swirl @ 2011-05-15 7:15 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> >> Here's a RFC series for eliminating AREG0. >> >> >> >> Blue Swirl (11): >> >> Move user emulator stuff from cpu-exec.c to user-exec.c >> >> Delete unused tb_invalidate_page_range >> >> >> >> The above should be OK to commit. >> >> >> >> cpu_loop_exit: avoid using AREG0 >> >> Delegate setup of TCG temporaries to targets >> >> >> >> These two are not, unless the overall plan is OK. >> >> >> >> TCG: fix negative frame offset calculations >> >> TCG/x86: use stack for TCG temps >> >> TCG/Sparc64: use stack for TCG temps >> >> >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. >> >> >> >> Add CONFIG_TARGET_NEEDS_AREG0 >> >> Don't compile legacy qemu_ld/st functions if target doesn't need them >> >> >> >> Should be OK, though the latter patch only touches x86. >> >> >> >> Add new qemu_ld and qemu_st functions >> >> sparc: use new qemu_ld and qemu_st functions >> >> >> >> The last two compile but QEMU segfaults. I just made a naive >> >> conversion for getting comments. >> >> >> > >> > What is the goal behing removing TCG_AREG0? If it is speed improvement, >> > can you please provide some benchmarks? >> >> There was some discussion earlier about why this (or parts of the >> conversion) may be a speed improvement: >> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826 >> http://article.gmane.org/gmane.comp.emulators.qemu/102156 > > Ok, looks like I have missed that. > >> There are no benchmarks yet. In fact, it may be difficult to make >> those without performing the removal completely. >> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a >> parameter instead of using global env, which would be available and >> the register is reserved anyway. This would only decrease performance >> at this stage, unless a complete conversion is done. I suspect the >> same would happen when moving all helpers from op_helper.c to >> helper.c. But after the whole conversion, this would be a neutral (no >> extra registers used) or even beneficial change (the code is free to >> use one more register). >> >> > The env register is used very often (basically for every load/store, but >> > also a lot of helpers), so it makes sense to reserve a register for it. >> > >> > For what I understand from your patch series, you prefer to pass this >> > register explicitly to TCG functions. This basically means this TCG >> > global will be loaded to host register as soon as it is used, but also >> > regularly, as globals are saved back to their canonical location before >> > an helper or a load/store. >> > >> > So it seems that this patch series will just allowing the "env register" >> > to change over time, though it will not spare one more register for the >> > TCG code, and it will emit longer TCG code to regularly reload the env >> > global into a host register. >> >> But there will be one more register available in some cases. In other > > Inside the TCG code, it will basically happens very rarely, given > load/store are really the most used instructions, and they need to load > the env register. Not exactly, from a sample run with -d op_opt: $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn 1673966 movi_i32 653931 ld_i32 607432 mov_i32 428684 st_i32 326878 movi_i64 308626 add_i32 283186 call 256817 exit_tb 207232 nopn 189388 goto_tb 122398 and_i32 117997 shr_i32 89107 qemu_ld32 82926 set_label 82713 brcond_i32 67169 qemu_st32 55109 or_i32 46536 ext32u_i64 44288 xor_i32 38103 sub_i32 26361 shl_i32 23218 shl_i64 23218 qemu_st64 23218 or_i64 20474 shr_i64 20445 qemu_ld64 11161 qemu_ld8u 10409 qemu_st8 5013 qemu_ld16u 3795 qemu_st16 2776 qemu_ld8s 1915 sar_i32 1414 qemu_ld16s 839 not_i32 579 setcond_i32 213 br 42 ext32s_i64 30 mul_i64 But most other ops probably don't need any additional registers. It could still be that with the extra register, some values could be kept there instead of flushing to storage. >> cases, the number of registers used does not change. Moving the >> registers around is what worries me too. >> >> But there are other effects too, the helpers are now compiled so that >> the global env register is not used. Especially on hosts with low >> number of registers this is not optimal. > > Most helpers are very small functions, so I am not sure more registers > will help. $ nm --print-size --defined-only --size-sort --reverse-sort obj-amd64/sparc-softmmu/op_helper.o |head -20 0000000000003a70 0000000000000c75 T helper_st_asi 00000000000046f0 000000000000086d T helper_ld_asi 0000000000002550 000000000000041b T __stw_mmu 00000000000012c0 000000000000040c T __stq_mmu 0000000000001d10 00000000000003c5 T __stl_mmu 0000000000001a90 0000000000000277 T __ldq_mmu 00000000000022f0 000000000000025f T __ldl_mmu 0000000000002b50 000000000000024f T __ldw_mmu 0000000000001870 0000000000000219 t slow_ldq_mmu 00000000000020e0 0000000000000208 t slow_ldl_mmu 00000000000033f0 0000000000000202 T helper_stqf 0000000000002970 00000000000001de t slow_ldw_mmu 0000000000005120 00000000000001dd T helper_fcmpeq 0000000000003720 00000000000001c7 T helper_ldqf 0000000000001120 000000000000019e t slow_stb_mmu 00000000000016d0 000000000000019c T __stb_mmu 0000000000002da0 000000000000018a t slow_ldb_mmu 0000000000005410 0000000000000172 T helper_fcmped 0000000000002f30 000000000000016d T __ldb_mmu 0000000000000a70 0000000000000164 T do_unassigned_access These are not so small, but they also don't look like frequently used ones, judging by the names. It would be more interesting to see the sizes of heavily used helpers. >> > In any case at then end benchmarks is what are need to decided, TCG has >> > always shown that performance improvements doesn't match the improvement >> > analysis. >> >> If this turns out to be a bad idea, it means that the reverse >> conversion will be beneficial and we should convert helper.c code to >> op_helper.c and take advantage of the global env in a register. This >> was actually my standpoint when the discussion started, but I'm >> interested to see if this approach would work better. > > Does it mean that you plan to do the code changes in git without really > benchmarking, and revert all the changes later if it was a bad idea? I don't think that that would be a good idea. There are actually several possible plans that should be considered, based on the performance: - don't do anything - full conversion - avoid AREG0 use outside of generated code: move helpers from op_helper.c to helper.c - avoid AREG0 in cpu-exec.c etc. - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c It's also possible to refactor qemu_ld/st independently to the above so that they still use AREG0. The performance neutral changes (1, 2, 4 to 7) may have merits of their own, so they may be applied later if there are no objections. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 7:15 ` Blue Swirl @ 2011-05-15 9:24 ` Aurelien Jarno 2011-05-15 9:58 ` Blue Swirl 2011-05-15 9:27 ` Laurent Desnogues 1 sibling, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 9:24 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > >> >> Here's a RFC series for eliminating AREG0. > >> >> > >> >> Blue Swirl (11): > >> >> Move user emulator stuff from cpu-exec.c to user-exec.c > >> >> Delete unused tb_invalidate_page_range > >> >> > >> >> The above should be OK to commit. > >> >> > >> >> cpu_loop_exit: avoid using AREG0 > >> >> Delegate setup of TCG temporaries to targets > >> >> > >> >> These two are not, unless the overall plan is OK. > >> >> > >> >> TCG: fix negative frame offset calculations > >> >> TCG/x86: use stack for TCG temps > >> >> TCG/Sparc64: use stack for TCG temps > >> >> > >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. > >> >> > >> >> Add CONFIG_TARGET_NEEDS_AREG0 > >> >> Don't compile legacy qemu_ld/st functions if target doesn't need them > >> >> > >> >> Should be OK, though the latter patch only touches x86. > >> >> > >> >> Add new qemu_ld and qemu_st functions > >> >> sparc: use new qemu_ld and qemu_st functions > >> >> > >> >> The last two compile but QEMU segfaults. I just made a naive > >> >> conversion for getting comments. > >> >> > >> > > >> > What is the goal behing removing TCG_AREG0? If it is speed improvement, > >> > can you please provide some benchmarks? > >> > >> There was some discussion earlier about why this (or parts of the > >> conversion) may be a speed improvement: > >> > >> http://article.gmane.org/gmane.comp.emulators.qemu/101826 > >> http://article.gmane.org/gmane.comp.emulators.qemu/102156 > > > > Ok, looks like I have missed that. > > > >> There are no benchmarks yet. In fact, it may be difficult to make > >> those without performing the removal completely. > >> > >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a > >> parameter instead of using global env, which would be available and > >> the register is reserved anyway. This would only decrease performance > >> at this stage, unless a complete conversion is done. I suspect the > >> same would happen when moving all helpers from op_helper.c to > >> helper.c. But after the whole conversion, this would be a neutral (no > >> extra registers used) or even beneficial change (the code is free to > >> use one more register). > >> > >> > The env register is used very often (basically for every load/store, but > >> > also a lot of helpers), so it makes sense to reserve a register for it. > >> > > >> > For what I understand from your patch series, you prefer to pass this > >> > register explicitly to TCG functions. This basically means this TCG > >> > global will be loaded to host register as soon as it is used, but also > >> > regularly, as globals are saved back to their canonical location before > >> > an helper or a load/store. > >> > > >> > So it seems that this patch series will just allowing the "env register" > >> > to change over time, though it will not spare one more register for the > >> > TCG code, and it will emit longer TCG code to regularly reload the env > >> > global into a host register. > >> > >> But there will be one more register available in some cases. In other > > > > Inside the TCG code, it will basically happens very rarely, given > > load/store are really the most used instructions, and they need to load > > the env register. > > Not exactly, from a sample run with -d op_opt: > $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' > /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn > 1673966 movi_i32 > 653931 ld_i32 On which target is that? It looks like that some variables could be mapped directly as a global to avoid so many load/stores. Anyway, even this simple load instruction, needs a base register, and it mosts cases this register is TCG_AREG0. I agree that it can be replaced by another base register, but in anycase, a register has to be used for that. It means after such an instruction is encountered, one register is reserved for this global up to the next helper/memory load store, so there isn't any register spared, but more instructions generated to explicitly load this register. I also just realized all globals registered through tcg_global_mem_new() needs a base register and currently it is TCG_AREG0, that means even a mov_i32 might need access to TCG_AREG0. How do you plan to solve that part? > 607432 mov_i32 > 428684 st_i32 > 326878 movi_i64 > 308626 add_i32 > 283186 call > 256817 exit_tb > 207232 nopn > 189388 goto_tb > 122398 and_i32 > 117997 shr_i32 > 89107 qemu_ld32 > 82926 set_label > 82713 brcond_i32 > 67169 qemu_st32 > 55109 or_i32 > 46536 ext32u_i64 > 44288 xor_i32 > 38103 sub_i32 > 26361 shl_i32 > 23218 shl_i64 > 23218 qemu_st64 > 23218 or_i64 > 20474 shr_i64 > 20445 qemu_ld64 > 11161 qemu_ld8u > 10409 qemu_st8 > 5013 qemu_ld16u > 3795 qemu_st16 > 2776 qemu_ld8s > 1915 sar_i32 > 1414 qemu_ld16s > 839 not_i32 > 579 setcond_i32 > 213 br > 42 ext32s_i64 > 30 mul_i64 > > But most other ops probably don't need any additional registers. It > could still be that with the extra register, some values could be kept > there instead of flushing to storage. > > >> cases, the number of registers used does not change. Moving the > >> registers around is what worries me too. > >> > >> But there are other effects too, the helpers are now compiled so that > >> the global env register is not used. Especially on hosts with low > >> number of registers this is not optimal. > > > > Most helpers are very small functions, so I am not sure more registers > > will help. > > $ nm --print-size --defined-only --size-sort --reverse-sort > obj-amd64/sparc-softmmu/op_helper.o |head -20 > 0000000000003a70 0000000000000c75 T helper_st_asi > 00000000000046f0 000000000000086d T helper_ld_asi > 0000000000002550 000000000000041b T __stw_mmu > 00000000000012c0 000000000000040c T __stq_mmu > 0000000000001d10 00000000000003c5 T __stl_mmu > 0000000000001a90 0000000000000277 T __ldq_mmu > 00000000000022f0 000000000000025f T __ldl_mmu > 0000000000002b50 000000000000024f T __ldw_mmu > 0000000000001870 0000000000000219 t slow_ldq_mmu > 00000000000020e0 0000000000000208 t slow_ldl_mmu > 00000000000033f0 0000000000000202 T helper_stqf > 0000000000002970 00000000000001de t slow_ldw_mmu > 0000000000005120 00000000000001dd T helper_fcmpeq > 0000000000003720 00000000000001c7 T helper_ldqf > 0000000000001120 000000000000019e t slow_stb_mmu > 00000000000016d0 000000000000019c T __stb_mmu > 0000000000002da0 000000000000018a t slow_ldb_mmu > 0000000000005410 0000000000000172 T helper_fcmped > 0000000000002f30 000000000000016d T __ldb_mmu > 0000000000000a70 0000000000000164 T do_unassigned_access > > These are not so small, but they also don't look like frequently used > ones, judging by the names. It would be more interesting to see the > sizes of heavily used helpers. Agreed. Also most of these helpers need access to env anyway, so it means one more argument that might be kept in a register in some parts of the helper. It might be a good idea to recompile this file without reserving a register for the env pointer, and passing it as an argument instead just to see the difference. > >> > In any case at then end benchmarks is what are need to decided, TCG has > >> > always shown that performance improvements doesn't match the improvement > >> > analysis. > >> > >> If this turns out to be a bad idea, it means that the reverse > >> conversion will be beneficial and we should convert helper.c code to > >> op_helper.c and take advantage of the global env in a register. This > >> was actually my standpoint when the discussion started, but I'm > >> interested to see if this approach would work better. > > > > Does it mean that you plan to do the code changes in git without really > > benchmarking, and revert all the changes later if it was a bad idea? > > I don't think that that would be a good idea. There are actually > several possible plans that should be considered, based on the > performance: > - don't do anything > - full conversion > - avoid AREG0 use outside of generated code: move helpers from > op_helper.c to helper.c > - avoid AREG0 in cpu-exec.c etc. > - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c > > It's also possible to refactor qemu_ld/st independently to the above > so that they still use AREG0. > > The performance neutral changes (1, 2, 4 to 7) may have merits of > their own, so they may be applied later if there are no objections. > It seems the choices there don't match the different possible plans above (5 vs 7). That said I am not really sure that we agree on what is performance neutral or not. I have no objection to remove uses of TCG_AREG0 in GCC generated code, ie in the helpers, though I doubt it will make a real change on the performances: slightly bigger intermediate code, one more register move to call the helpers, on more free register for the helpers. That should not make a huge difference, and I agree it makes the code cleaner. On the other hand, I have objections to remove uses of TCG_AREG0 from the TCG part. This register is part of the TCG design and thus used very often. In any case we have to keep it for accesses to the globals, so we can keep it for softmmu load/stores too. If we go for the removal of TCG_AREG0 from the GCC side, it probably means loading it in the prologue instead. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 9:24 ` Aurelien Jarno @ 2011-05-15 9:58 ` Blue Swirl 2011-05-15 10:19 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 9:58 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> >> >> Here's a RFC series for eliminating AREG0. >> >> >> >> >> >> Blue Swirl (11): >> >> >> Move user emulator stuff from cpu-exec.c to user-exec.c >> >> >> Delete unused tb_invalidate_page_range >> >> >> >> >> >> The above should be OK to commit. >> >> >> >> >> >> cpu_loop_exit: avoid using AREG0 >> >> >> Delegate setup of TCG temporaries to targets >> >> >> >> >> >> These two are not, unless the overall plan is OK. >> >> >> >> >> >> TCG: fix negative frame offset calculations >> >> >> TCG/x86: use stack for TCG temps >> >> >> TCG/Sparc64: use stack for TCG temps >> >> >> >> >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. >> >> >> >> >> >> Add CONFIG_TARGET_NEEDS_AREG0 >> >> >> Don't compile legacy qemu_ld/st functions if target doesn't need them >> >> >> >> >> >> Should be OK, though the latter patch only touches x86. >> >> >> >> >> >> Add new qemu_ld and qemu_st functions >> >> >> sparc: use new qemu_ld and qemu_st functions >> >> >> >> >> >> The last two compile but QEMU segfaults. I just made a naive >> >> >> conversion for getting comments. >> >> >> >> >> > >> >> > What is the goal behing removing TCG_AREG0? If it is speed improvement, >> >> > can you please provide some benchmarks? >> >> >> >> There was some discussion earlier about why this (or parts of the >> >> conversion) may be a speed improvement: >> >> >> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826 >> >> http://article.gmane.org/gmane.comp.emulators.qemu/102156 >> > >> > Ok, looks like I have missed that. >> > >> >> There are no benchmarks yet. In fact, it may be difficult to make >> >> those without performing the removal completely. >> >> >> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a >> >> parameter instead of using global env, which would be available and >> >> the register is reserved anyway. This would only decrease performance >> >> at this stage, unless a complete conversion is done. I suspect the >> >> same would happen when moving all helpers from op_helper.c to >> >> helper.c. But after the whole conversion, this would be a neutral (no >> >> extra registers used) or even beneficial change (the code is free to >> >> use one more register). >> >> >> >> > The env register is used very often (basically for every load/store, but >> >> > also a lot of helpers), so it makes sense to reserve a register for it. >> >> > >> >> > For what I understand from your patch series, you prefer to pass this >> >> > register explicitly to TCG functions. This basically means this TCG >> >> > global will be loaded to host register as soon as it is used, but also >> >> > regularly, as globals are saved back to their canonical location before >> >> > an helper or a load/store. >> >> > >> >> > So it seems that this patch series will just allowing the "env register" >> >> > to change over time, though it will not spare one more register for the >> >> > TCG code, and it will emit longer TCG code to regularly reload the env >> >> > global into a host register. >> >> >> >> But there will be one more register available in some cases. In other >> > >> > Inside the TCG code, it will basically happens very rarely, given >> > load/store are really the most used instructions, and they need to load >> > the env register. >> >> Not exactly, from a sample run with -d op_opt: >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn >> 1673966 movi_i32 >> 653931 ld_i32 > > On which target is that? It looks like that some variables could be > mapped directly as a global to avoid so many load/stores. Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X. > Anyway, even this simple load instruction, needs a base register, and it > mosts cases this register is TCG_AREG0. I agree that it can be replaced by > another base register, but in anycase, a register has to be used for that. > > It means after such an instruction is encountered, one register is > reserved for this global up to the next helper/memory load store, so > there isn't any register spared, but more instructions generated to > explicitly load this register. > > I also just realized all globals registered through tcg_global_mem_new() > needs a base register and currently it is TCG_AREG0, that means even a > mov_i32 might need access to TCG_AREG0. How do you plan to solve that > part? In most cases the backing storage is CPUState, so cpu_env would be a logical choice. >> 607432 mov_i32 >> 428684 st_i32 >> 326878 movi_i64 >> 308626 add_i32 >> 283186 call >> 256817 exit_tb >> 207232 nopn >> 189388 goto_tb >> 122398 and_i32 >> 117997 shr_i32 >> 89107 qemu_ld32 >> 82926 set_label >> 82713 brcond_i32 >> 67169 qemu_st32 >> 55109 or_i32 >> 46536 ext32u_i64 >> 44288 xor_i32 >> 38103 sub_i32 >> 26361 shl_i32 >> 23218 shl_i64 >> 23218 qemu_st64 >> 23218 or_i64 >> 20474 shr_i64 >> 20445 qemu_ld64 >> 11161 qemu_ld8u >> 10409 qemu_st8 >> 5013 qemu_ld16u >> 3795 qemu_st16 >> 2776 qemu_ld8s >> 1915 sar_i32 >> 1414 qemu_ld16s >> 839 not_i32 >> 579 setcond_i32 >> 213 br >> 42 ext32s_i64 >> 30 mul_i64 >> >> But most other ops probably don't need any additional registers. It >> could still be that with the extra register, some values could be kept >> there instead of flushing to storage. >> >> >> cases, the number of registers used does not change. Moving the >> >> registers around is what worries me too. >> >> >> >> But there are other effects too, the helpers are now compiled so that >> >> the global env register is not used. Especially on hosts with low >> >> number of registers this is not optimal. >> > >> > Most helpers are very small functions, so I am not sure more registers >> > will help. >> >> $ nm --print-size --defined-only --size-sort --reverse-sort >> obj-amd64/sparc-softmmu/op_helper.o |head -20 >> 0000000000003a70 0000000000000c75 T helper_st_asi >> 00000000000046f0 000000000000086d T helper_ld_asi >> 0000000000002550 000000000000041b T __stw_mmu >> 00000000000012c0 000000000000040c T __stq_mmu >> 0000000000001d10 00000000000003c5 T __stl_mmu >> 0000000000001a90 0000000000000277 T __ldq_mmu >> 00000000000022f0 000000000000025f T __ldl_mmu >> 0000000000002b50 000000000000024f T __ldw_mmu >> 0000000000001870 0000000000000219 t slow_ldq_mmu >> 00000000000020e0 0000000000000208 t slow_ldl_mmu >> 00000000000033f0 0000000000000202 T helper_stqf >> 0000000000002970 00000000000001de t slow_ldw_mmu >> 0000000000005120 00000000000001dd T helper_fcmpeq >> 0000000000003720 00000000000001c7 T helper_ldqf >> 0000000000001120 000000000000019e t slow_stb_mmu >> 00000000000016d0 000000000000019c T __stb_mmu >> 0000000000002da0 000000000000018a t slow_ldb_mmu >> 0000000000005410 0000000000000172 T helper_fcmped >> 0000000000002f30 000000000000016d T __ldb_mmu >> 0000000000000a70 0000000000000164 T do_unassigned_access >> >> These are not so small, but they also don't look like frequently used >> ones, judging by the names. It would be more interesting to see the >> sizes of heavily used helpers. > > Agreed. Also most of these helpers need access to env anyway, so it > means one more argument that might be kept in a register in some parts > of the helper. It might be a good idea to recompile this file without > reserving a register for the env pointer, and passing it as an argument > instead just to see the difference. Which boils down to the full conversion. Maybe that is the only way to get meaningful performance statistics. >> >> > In any case at then end benchmarks is what are need to decided, TCG has >> >> > always shown that performance improvements doesn't match the improvement >> >> > analysis. >> >> >> >> If this turns out to be a bad idea, it means that the reverse >> >> conversion will be beneficial and we should convert helper.c code to >> >> op_helper.c and take advantage of the global env in a register. This >> >> was actually my standpoint when the discussion started, but I'm >> >> interested to see if this approach would work better. >> > >> > Does it mean that you plan to do the code changes in git without really >> > benchmarking, and revert all the changes later if it was a bad idea? >> >> I don't think that that would be a good idea. There are actually >> several possible plans that should be considered, based on the >> performance: >> - don't do anything >> - full conversion >> - avoid AREG0 use outside of generated code: move helpers from >> op_helper.c to helper.c >> - avoid AREG0 in cpu-exec.c etc. >> - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c >> >> It's also possible to refactor qemu_ld/st independently to the above >> so that they still use AREG0. >> >> The performance neutral changes (1, 2, 4 to 7) may have merits of >> their own, so they may be applied later if there are no objections. >> > > It seems the choices there don't match the different possible plans > above (5 vs 7). That said I am not really sure that we agree on what > is performance neutral or not. 4 to 7 use regular stack in place of temp_buf array in CPUState. I'd suppose this does not change anything performance wise, both the top of stack and CPUState are frequently accessed. The generated instructions are the same except for change of base register. Even if we'd go for maximizing AREG0 use, I think this makes sense since we can't use stack pointer for anything else. With some effort, it may be possible to calculate the amount of TCG temps needed and allocate just the right amount of stack. This could be slightly beneficial for performance, now the size of temp_buf array is fixed so it may have negative cache effects. I doubt this would be worthwhile. > I have no objection to remove uses of TCG_AREG0 in GCC generated code, > ie in the helpers, though I doubt it will make a real change on the > performances: slightly bigger intermediate code, one more register move > to call the helpers, on more free register for the helpers. That should > not make a huge difference, and I agree it makes the code cleaner. Maybe I'm missing something, butI don't think this would be very beneficial if AREG0 is not eliminated. The cleanup is a bonus, likewise for improved portability. > On the other hand, I have objections to remove uses of TCG_AREG0 from > the TCG part. This register is part of the TCG design and thus used very > often. In any case we have to keep it for accesses to the globals, so we > can keep it for softmmu load/stores too. Right, any changes would need to be backed by performance figures. > If we go for the removal of > TCG_AREG0 from the GCC side, it probably means loading it in the prologue > instead. Do you mean cpu-exec.c side with this? I think this is a very similar register tradeoff case to helpers. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 9:58 ` Blue Swirl @ 2011-05-15 10:19 ` Aurelien Jarno 2011-05-15 11:12 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 10:19 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: > >> >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > >> >> >> Here's a RFC series for eliminating AREG0. > >> >> >> > >> >> >> Blue Swirl (11): > >> >> >> Move user emulator stuff from cpu-exec.c to user-exec.c > >> >> >> Delete unused tb_invalidate_page_range > >> >> >> > >> >> >> The above should be OK to commit. > >> >> >> > >> >> >> cpu_loop_exit: avoid using AREG0 > >> >> >> Delegate setup of TCG temporaries to targets > >> >> >> > >> >> >> These two are not, unless the overall plan is OK. > >> >> >> > >> >> >> TCG: fix negative frame offset calculations > >> >> >> TCG/x86: use stack for TCG temps > >> >> >> TCG/Sparc64: use stack for TCG temps > >> >> >> > >> >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. > >> >> >> > >> >> >> Add CONFIG_TARGET_NEEDS_AREG0 > >> >> >> Don't compile legacy qemu_ld/st functions if target doesn't need them > >> >> >> > >> >> >> Should be OK, though the latter patch only touches x86. > >> >> >> > >> >> >> Add new qemu_ld and qemu_st functions > >> >> >> sparc: use new qemu_ld and qemu_st functions > >> >> >> > >> >> >> The last two compile but QEMU segfaults. I just made a naive > >> >> >> conversion for getting comments. > >> >> >> > >> >> > > >> >> > What is the goal behing removing TCG_AREG0? If it is speed improvement, > >> >> > can you please provide some benchmarks? > >> >> > >> >> There was some discussion earlier about why this (or parts of the > >> >> conversion) may be a speed improvement: > >> >> > >> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826 > >> >> http://article.gmane.org/gmane.comp.emulators.qemu/102156 > >> > > >> > Ok, looks like I have missed that. > >> > > >> >> There are no benchmarks yet. In fact, it may be difficult to make > >> >> those without performing the removal completely. > >> >> > >> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a > >> >> parameter instead of using global env, which would be available and > >> >> the register is reserved anyway. This would only decrease performance > >> >> at this stage, unless a complete conversion is done. I suspect the > >> >> same would happen when moving all helpers from op_helper.c to > >> >> helper.c. But after the whole conversion, this would be a neutral (no > >> >> extra registers used) or even beneficial change (the code is free to > >> >> use one more register). > >> >> > >> >> > The env register is used very often (basically for every load/store, but > >> >> > also a lot of helpers), so it makes sense to reserve a register for it. > >> >> > > >> >> > For what I understand from your patch series, you prefer to pass this > >> >> > register explicitly to TCG functions. This basically means this TCG > >> >> > global will be loaded to host register as soon as it is used, but also > >> >> > regularly, as globals are saved back to their canonical location before > >> >> > an helper or a load/store. > >> >> > > >> >> > So it seems that this patch series will just allowing the "env register" > >> >> > to change over time, though it will not spare one more register for the > >> >> > TCG code, and it will emit longer TCG code to regularly reload the env > >> >> > global into a host register. > >> >> > >> >> But there will be one more register available in some cases. In other > >> > > >> > Inside the TCG code, it will basically happens very rarely, given > >> > load/store are really the most used instructions, and they need to load > >> > the env register. > >> > >> Not exactly, from a sample run with -d op_opt: > >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' > >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn > >> 1673966 movi_i32 > >> 653931 ld_i32 > > > > On which target is that? It looks like that some variables could be > > mapped directly as a global to avoid so many load/stores. > > Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X. > > > Anyway, even this simple load instruction, needs a base register, and it > > mosts cases this register is TCG_AREG0. I agree that it can be replaced by > > another base register, but in anycase, a register has to be used for that. > > > > It means after such an instruction is encountered, one register is > > reserved for this global up to the next helper/memory load store, so > > there isn't any register spared, but more instructions generated to > > explicitly load this register. > > > > I also just realized all globals registered through tcg_global_mem_new() > > needs a base register and currently it is TCG_AREG0, that means even a > > mov_i32 might need access to TCG_AREG0. How do you plan to solve that > > part? > > In most cases the backing storage is CPUState, so cpu_env would be a > logical choice. > > >> 607432 mov_i32 > >> 428684 st_i32 > >> 326878 movi_i64 > >> 308626 add_i32 > >> 283186 call > >> 256817 exit_tb > >> 207232 nopn > >> 189388 goto_tb > >> 122398 and_i32 > >> 117997 shr_i32 > >> 89107 qemu_ld32 > >> 82926 set_label > >> 82713 brcond_i32 > >> 67169 qemu_st32 > >> 55109 or_i32 > >> 46536 ext32u_i64 > >> 44288 xor_i32 > >> 38103 sub_i32 > >> 26361 shl_i32 > >> 23218 shl_i64 > >> 23218 qemu_st64 > >> 23218 or_i64 > >> 20474 shr_i64 > >> 20445 qemu_ld64 > >> 11161 qemu_ld8u > >> 10409 qemu_st8 > >> 5013 qemu_ld16u > >> 3795 qemu_st16 > >> 2776 qemu_ld8s > >> 1915 sar_i32 > >> 1414 qemu_ld16s > >> 839 not_i32 > >> 579 setcond_i32 > >> 213 br > >> 42 ext32s_i64 > >> 30 mul_i64 > >> > >> But most other ops probably don't need any additional registers. It > >> could still be that with the extra register, some values could be kept > >> there instead of flushing to storage. > >> > >> >> cases, the number of registers used does not change. Moving the > >> >> registers around is what worries me too. > >> >> > >> >> But there are other effects too, the helpers are now compiled so that > >> >> the global env register is not used. Especially on hosts with low > >> >> number of registers this is not optimal. > >> > > >> > Most helpers are very small functions, so I am not sure more registers > >> > will help. > >> > >> $ nm --print-size --defined-only --size-sort --reverse-sort > >> obj-amd64/sparc-softmmu/op_helper.o |head -20 > >> 0000000000003a70 0000000000000c75 T helper_st_asi > >> 00000000000046f0 000000000000086d T helper_ld_asi > >> 0000000000002550 000000000000041b T __stw_mmu > >> 00000000000012c0 000000000000040c T __stq_mmu > >> 0000000000001d10 00000000000003c5 T __stl_mmu > >> 0000000000001a90 0000000000000277 T __ldq_mmu > >> 00000000000022f0 000000000000025f T __ldl_mmu > >> 0000000000002b50 000000000000024f T __ldw_mmu > >> 0000000000001870 0000000000000219 t slow_ldq_mmu > >> 00000000000020e0 0000000000000208 t slow_ldl_mmu > >> 00000000000033f0 0000000000000202 T helper_stqf > >> 0000000000002970 00000000000001de t slow_ldw_mmu > >> 0000000000005120 00000000000001dd T helper_fcmpeq > >> 0000000000003720 00000000000001c7 T helper_ldqf > >> 0000000000001120 000000000000019e t slow_stb_mmu > >> 00000000000016d0 000000000000019c T __stb_mmu > >> 0000000000002da0 000000000000018a t slow_ldb_mmu > >> 0000000000005410 0000000000000172 T helper_fcmped > >> 0000000000002f30 000000000000016d T __ldb_mmu > >> 0000000000000a70 0000000000000164 T do_unassigned_access > >> > >> These are not so small, but they also don't look like frequently used > >> ones, judging by the names. It would be more interesting to see the > >> sizes of heavily used helpers. > > > > Agreed. Also most of these helpers need access to env anyway, so it > > means one more argument that might be kept in a register in some parts > > of the helper. It might be a good idea to recompile this file without > > reserving a register for the env pointer, and passing it as an argument > > instead just to see the difference. > > Which boils down to the full conversion. Maybe that is the only way to > get meaningful performance statistics. > > >> >> > In any case at then end benchmarks is what are need to decided, TCG has > >> >> > always shown that performance improvements doesn't match the improvement > >> >> > analysis. > >> >> > >> >> If this turns out to be a bad idea, it means that the reverse > >> >> conversion will be beneficial and we should convert helper.c code to > >> >> op_helper.c and take advantage of the global env in a register. This > >> >> was actually my standpoint when the discussion started, but I'm > >> >> interested to see if this approach would work better. > >> > > >> > Does it mean that you plan to do the code changes in git without really > >> > benchmarking, and revert all the changes later if it was a bad idea? > >> > >> I don't think that that would be a good idea. There are actually > >> several possible plans that should be considered, based on the > >> performance: > >> - don't do anything > >> - full conversion > >> - avoid AREG0 use outside of generated code: move helpers from > >> op_helper.c to helper.c > >> - avoid AREG0 in cpu-exec.c etc. > >> - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c > >> > >> It's also possible to refactor qemu_ld/st independently to the above > >> so that they still use AREG0. > >> > >> The performance neutral changes (1, 2, 4 to 7) may have merits of > >> their own, so they may be applied later if there are no objections. > >> > > > > It seems the choices there don't match the different possible plans > > above (5 vs 7). That said I am not really sure that we agree on what > > is performance neutral or not. > > 4 to 7 use regular stack in place of temp_buf array in CPUState. I'd I still don't get where are this list of possible changes? Did I miss something in another thread? > suppose this does not change anything performance wise, both the top > of stack and CPUState are frequently accessed. The generated > instructions are the same except for change of base register. Even if > we'd go for maximizing AREG0 use, I think this makes sense since we > can't use stack pointer for anything else. I agree that saving temp_local in the stack instead of register shouldn't have any impact. > With some effort, it may be possible to calculate the amount of TCG > temps needed and allocate just the right amount of stack. This could > be slightly beneficial for performance, now the size of temp_buf array > is fixed so it may have negative cache effects. I doubt this would be > worthwhile. Yes, it is something doable, and I agree it probably won't have much impact on performance. > > I have no objection to remove uses of TCG_AREG0 in GCC generated code, > > ie in the helpers, though I doubt it will make a real change on the > > performances: slightly bigger intermediate code, one more register move > > to call the helpers, on more free register for the helpers. That should > > not make a huge difference, and I agree it makes the code cleaner. > > Maybe I'm missing something, butI don't think this would be very > beneficial if AREG0 is not eliminated. The cleanup is a bonus, > likewise for improved portability. My point is having AREG0 from GCC generated code, so we don't have to ask GCC to explicitly not use the AREG0 register. It basically means TCG_ARGE0 disappearing from everywhere, except tcg/* . On the TCG generated code, the env structure is used almost for every op, so it really makes sense to keep it in a register instead of having to reload the address of env regularly from memory. Given it only affects TCG generated code, I don't see the point of portability here. > > On the other hand, I have objections to remove uses of TCG_AREG0 from > > the TCG part. This register is part of the TCG design and thus used very > > often. In any case we have to keep it for accesses to the globals, so we > > can keep it for softmmu load/stores too. > > Right, any changes would need to be backed by performance figures. > > > If we go for the removal of > > TCG_AREG0 from the GCC side, it probably means loading it in the prologue > > instead. > > Do you mean cpu-exec.c side with this? I think this is a very similar > register tradeoff case to helpers. I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the env pointer into the TCG_AREG0 register (which is kept internal to TCG generated code). -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 10:19 ` Aurelien Jarno @ 2011-05-15 11:12 ` Blue Swirl 2011-05-15 11:36 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 11:12 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 12:58:49PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 12:24 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 10:15:52AM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> > On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >> >> >> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> >> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> >> >> >> Here's a RFC series for eliminating AREG0. >> >> >> >> >> >> >> >> Blue Swirl (11): >> >> >> >> Move user emulator stuff from cpu-exec.c to user-exec.c >> >> >> >> Delete unused tb_invalidate_page_range >> >> >> >> >> >> >> >> The above should be OK to commit. >> >> >> >> >> >> >> >> cpu_loop_exit: avoid using AREG0 >> >> >> >> Delegate setup of TCG temporaries to targets >> >> >> >> >> >> >> >> These two are not, unless the overall plan is OK. >> >> >> >> >> >> >> >> TCG: fix negative frame offset calculations >> >> >> >> TCG/x86: use stack for TCG temps >> >> >> >> TCG/Sparc64: use stack for TCG temps >> >> >> >> >> >> >> >> But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. >> >> >> >> >> >> >> >> Add CONFIG_TARGET_NEEDS_AREG0 >> >> >> >> Don't compile legacy qemu_ld/st functions if target doesn't need them >> >> >> >> >> >> >> >> Should be OK, though the latter patch only touches x86. >> >> >> >> >> >> >> >> Add new qemu_ld and qemu_st functions >> >> >> >> sparc: use new qemu_ld and qemu_st functions >> >> >> >> >> >> >> >> The last two compile but QEMU segfaults. I just made a naive >> >> >> >> conversion for getting comments. >> >> >> >> >> >> >> > >> >> >> > What is the goal behing removing TCG_AREG0? If it is speed improvement, >> >> >> > can you please provide some benchmarks? >> >> >> >> >> >> There was some discussion earlier about why this (or parts of the >> >> >> conversion) may be a speed improvement: >> >> >> >> >> >> http://article.gmane.org/gmane.comp.emulators.qemu/101826 >> >> >> http://article.gmane.org/gmane.comp.emulators.qemu/102156 >> >> > >> >> > Ok, looks like I have missed that. >> >> > >> >> >> There are no benchmarks yet. In fact, it may be difficult to make >> >> >> those without performing the removal completely. >> >> >> >> >> >> For example, patch 3/11 makes cpu_loop_exit take CPUState as a >> >> >> parameter instead of using global env, which would be available and >> >> >> the register is reserved anyway. This would only decrease performance >> >> >> at this stage, unless a complete conversion is done. I suspect the >> >> >> same would happen when moving all helpers from op_helper.c to >> >> >> helper.c. But after the whole conversion, this would be a neutral (no >> >> >> extra registers used) or even beneficial change (the code is free to >> >> >> use one more register). >> >> >> >> >> >> > The env register is used very often (basically for every load/store, but >> >> >> > also a lot of helpers), so it makes sense to reserve a register for it. >> >> >> > >> >> >> > For what I understand from your patch series, you prefer to pass this >> >> >> > register explicitly to TCG functions. This basically means this TCG >> >> >> > global will be loaded to host register as soon as it is used, but also >> >> >> > regularly, as globals are saved back to their canonical location before >> >> >> > an helper or a load/store. >> >> >> > >> >> >> > So it seems that this patch series will just allowing the "env register" >> >> >> > to change over time, though it will not spare one more register for the >> >> >> > TCG code, and it will emit longer TCG code to regularly reload the env >> >> >> > global into a host register. >> >> >> >> >> >> But there will be one more register available in some cases. In other >> >> > >> >> > Inside the TCG code, it will basically happens very rarely, given >> >> > load/store are really the most used instructions, and they need to load >> >> > the env register. >> >> >> >> Not exactly, from a sample run with -d op_opt: >> >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' >> >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn >> >> 1673966 movi_i32 >> >> 653931 ld_i32 >> > >> > On which target is that? It looks like that some variables could be >> > mapped directly as a global to avoid so many load/stores. >> >> Sparc-softmmu, boot of Aurora-1.0 up to first choices screen in X. >> >> > Anyway, even this simple load instruction, needs a base register, and it >> > mosts cases this register is TCG_AREG0. I agree that it can be replaced by >> > another base register, but in anycase, a register has to be used for that. >> > >> > It means after such an instruction is encountered, one register is >> > reserved for this global up to the next helper/memory load store, so >> > there isn't any register spared, but more instructions generated to >> > explicitly load this register. >> > >> > I also just realized all globals registered through tcg_global_mem_new() >> > needs a base register and currently it is TCG_AREG0, that means even a >> > mov_i32 might need access to TCG_AREG0. How do you plan to solve that >> > part? >> >> In most cases the backing storage is CPUState, so cpu_env would be a >> logical choice. >> >> >> 607432 mov_i32 >> >> 428684 st_i32 >> >> 326878 movi_i64 >> >> 308626 add_i32 >> >> 283186 call >> >> 256817 exit_tb >> >> 207232 nopn >> >> 189388 goto_tb >> >> 122398 and_i32 >> >> 117997 shr_i32 >> >> 89107 qemu_ld32 >> >> 82926 set_label >> >> 82713 brcond_i32 >> >> 67169 qemu_st32 >> >> 55109 or_i32 >> >> 46536 ext32u_i64 >> >> 44288 xor_i32 >> >> 38103 sub_i32 >> >> 26361 shl_i32 >> >> 23218 shl_i64 >> >> 23218 qemu_st64 >> >> 23218 or_i64 >> >> 20474 shr_i64 >> >> 20445 qemu_ld64 >> >> 11161 qemu_ld8u >> >> 10409 qemu_st8 >> >> 5013 qemu_ld16u >> >> 3795 qemu_st16 >> >> 2776 qemu_ld8s >> >> 1915 sar_i32 >> >> 1414 qemu_ld16s >> >> 839 not_i32 >> >> 579 setcond_i32 >> >> 213 br >> >> 42 ext32s_i64 >> >> 30 mul_i64 >> >> >> >> But most other ops probably don't need any additional registers. It >> >> could still be that with the extra register, some values could be kept >> >> there instead of flushing to storage. >> >> >> >> >> cases, the number of registers used does not change. Moving the >> >> >> registers around is what worries me too. >> >> >> >> >> >> But there are other effects too, the helpers are now compiled so that >> >> >> the global env register is not used. Especially on hosts with low >> >> >> number of registers this is not optimal. >> >> > >> >> > Most helpers are very small functions, so I am not sure more registers >> >> > will help. >> >> >> >> $ nm --print-size --defined-only --size-sort --reverse-sort >> >> obj-amd64/sparc-softmmu/op_helper.o |head -20 >> >> 0000000000003a70 0000000000000c75 T helper_st_asi >> >> 00000000000046f0 000000000000086d T helper_ld_asi >> >> 0000000000002550 000000000000041b T __stw_mmu >> >> 00000000000012c0 000000000000040c T __stq_mmu >> >> 0000000000001d10 00000000000003c5 T __stl_mmu >> >> 0000000000001a90 0000000000000277 T __ldq_mmu >> >> 00000000000022f0 000000000000025f T __ldl_mmu >> >> 0000000000002b50 000000000000024f T __ldw_mmu >> >> 0000000000001870 0000000000000219 t slow_ldq_mmu >> >> 00000000000020e0 0000000000000208 t slow_ldl_mmu >> >> 00000000000033f0 0000000000000202 T helper_stqf >> >> 0000000000002970 00000000000001de t slow_ldw_mmu >> >> 0000000000005120 00000000000001dd T helper_fcmpeq >> >> 0000000000003720 00000000000001c7 T helper_ldqf >> >> 0000000000001120 000000000000019e t slow_stb_mmu >> >> 00000000000016d0 000000000000019c T __stb_mmu >> >> 0000000000002da0 000000000000018a t slow_ldb_mmu >> >> 0000000000005410 0000000000000172 T helper_fcmped >> >> 0000000000002f30 000000000000016d T __ldb_mmu >> >> 0000000000000a70 0000000000000164 T do_unassigned_access >> >> >> >> These are not so small, but they also don't look like frequently used >> >> ones, judging by the names. It would be more interesting to see the >> >> sizes of heavily used helpers. >> > >> > Agreed. Also most of these helpers need access to env anyway, so it >> > means one more argument that might be kept in a register in some parts >> > of the helper. It might be a good idea to recompile this file without >> > reserving a register for the env pointer, and passing it as an argument >> > instead just to see the difference. >> >> Which boils down to the full conversion. Maybe that is the only way to >> get meaningful performance statistics. >> >> >> >> > In any case at then end benchmarks is what are need to decided, TCG has >> >> >> > always shown that performance improvements doesn't match the improvement >> >> >> > analysis. >> >> >> >> >> >> If this turns out to be a bad idea, it means that the reverse >> >> >> conversion will be beneficial and we should convert helper.c code to >> >> >> op_helper.c and take advantage of the global env in a register. This >> >> >> was actually my standpoint when the discussion started, but I'm >> >> >> interested to see if this approach would work better. >> >> > >> >> > Does it mean that you plan to do the code changes in git without really >> >> > benchmarking, and revert all the changes later if it was a bad idea? >> >> >> >> I don't think that that would be a good idea. There are actually >> >> several possible plans that should be considered, based on the >> >> performance: >> >> - don't do anything >> >> - full conversion >> >> - avoid AREG0 use outside of generated code: move helpers from >> >> op_helper.c to helper.c >> >> - avoid AREG0 in cpu-exec.c etc. >> >> - maximize AREG0 use: move helpers from helper.c etc. to op_helper.c >> >> >> >> It's also possible to refactor qemu_ld/st independently to the above >> >> so that they still use AREG0. >> >> >> >> The performance neutral changes (1, 2, 4 to 7) may have merits of >> >> their own, so they may be applied later if there are no objections. >> >> >> > >> > It seems the choices there don't match the different possible plans >> > above (5 vs 7). That said I am not really sure that we agree on what >> > is performance neutral or not. >> >> 4 to 7 use regular stack in place of temp_buf array in CPUState. I'd > > I still don't get where are this list of possible changes? Did I miss > something in another thread? I'm referring to the patches I sent. >> suppose this does not change anything performance wise, both the top >> of stack and CPUState are frequently accessed. The generated >> instructions are the same except for change of base register. Even if >> we'd go for maximizing AREG0 use, I think this makes sense since we >> can't use stack pointer for anything else. > > I agree that saving temp_local in the stack instead of register > shouldn't have any impact. > >> With some effort, it may be possible to calculate the amount of TCG >> temps needed and allocate just the right amount of stack. This could >> be slightly beneficial for performance, now the size of temp_buf array >> is fixed so it may have negative cache effects. I doubt this would be >> worthwhile. > > Yes, it is something doable, and I agree it probably won't have much > impact on performance. > >> > I have no objection to remove uses of TCG_AREG0 in GCC generated code, >> > ie in the helpers, though I doubt it will make a real change on the >> > performances: slightly bigger intermediate code, one more register move >> > to call the helpers, on more free register for the helpers. That should >> > not make a huge difference, and I agree it makes the code cleaner. >> >> Maybe I'm missing something, butI don't think this would be very >> beneficial if AREG0 is not eliminated. The cleanup is a bonus, >> likewise for improved portability. > > My point is having AREG0 from GCC generated code, so we don't have to ask > GCC to explicitly not use the AREG0 register. It basically means > TCG_ARGE0 disappearing from everywhere, except tcg/* . > > On the TCG generated code, the env structure is used almost for every > op, so it really makes sense to keep it in a register instead of having to > reload the address of env regularly from memory. Given it only affects > TCG generated code, I don't see the point of portability here. For example, maybe the bugs in Sparc glibc could be avoided by using one of %i set of registers (not accessible from helpers) for AREG0 within generated code instead of %g registers which seem to be fragile. >> > On the other hand, I have objections to remove uses of TCG_AREG0 from >> > the TCG part. This register is part of the TCG design and thus used very >> > often. In any case we have to keep it for accesses to the globals, so we >> > can keep it for softmmu load/stores too. >> >> Right, any changes would need to be backed by performance figures. >> >> > If we go for the removal of >> > TCG_AREG0 from the GCC side, it probably means loading it in the prologue >> > instead. >> >> Do you mean cpu-exec.c side with this? I think this is a very similar >> register tradeoff case to helpers. > > I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of > tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the > env pointer into the TCG_AREG0 register (which is kept internal to TCG > generated code). Maybe it would be easy to test and benchmark this change, it doesn't affect generated code or helpers. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:12 ` Blue Swirl @ 2011-05-15 11:36 ` Aurelien Jarno 2011-05-15 12:30 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 11:36 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > I still don't get where are this list of possible changes? Did I miss > > something in another thread? > > I'm referring to the patches I sent. Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think patches 6 and 7 should be done for all hosts or none of them. > > On the TCG generated code, the env structure is used almost for every > > op, so it really makes sense to keep it in a register instead of having to > > reload the address of env regularly from memory. Given it only affects > > TCG generated code, I don't see the point of portability here. > > For example, maybe the bugs in Sparc glibc could be avoided by using > one of %i set of registers (not accessible from helpers) for AREG0 > within generated code instead of %g registers which seem to be > fragile. First of all, but it's a different subject, I am not sure there are sparc glibc bugs, I'd rather says QEMU mis-uses some register. For example the following code is probably wrong: /* Note: must be synced with dyngen-exec.h */ #ifdef CONFIG_SOLARIS #define TCG_AREG0 TCG_REG_G2 #elif defined(__sparc_v9__) #define TCG_AREG0 TCG_REG_G5 #else #define TCG_AREG0 TCG_REG_G6 #endif __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, so the condition is probably wrong there. Secondly the SPARC ABI [1] on page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has the right to use this registers. Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code would prevent you to use a register from the %i set for it. > >> > On the other hand, I have objections to remove uses of TCG_AREG0 from > >> > the TCG part. This register is part of the TCG design and thus used very > >> > often. In any case we have to keep it for accesses to the globals, so we > >> > can keep it for softmmu load/stores too. > >> > >> Right, any changes would need to be backed by performance figures. > >> > >> > If we go for the removal of > >> > TCG_AREG0 from the GCC side, it probably means loading it in the prologue > >> > instead. > >> > >> Do you mean cpu-exec.c side with this? I think this is a very similar > >> register tradeoff case to helpers. > > > > I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of > > tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the > > env pointer into the TCG_AREG0 register (which is kept internal to TCG > > generated code). > > Maybe it would be easy to test and benchmark this change, it doesn't > affect generated code or helpers. > Yes it's something that can be benchmarked. From experience it won't make any significant performance change, it's basically two register moves (one in the caller, one in the prologue), for each TB which is not chained. In any case, it will introduce a slight overhead that has to be compensated by allowing the helpers to use an additional register. [1] http://www.sparc.com/standards/psABI3rd.pdf -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:36 ` Aurelien Jarno @ 2011-05-15 12:30 ` Blue Swirl 2011-05-15 12:49 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 12:30 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > I still don't get where are this list of possible changes? Did I miss >> > something in another thread? >> >> I'm referring to the patches I sent. > > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think > patches 6 and 7 should be done for all hosts or none of them. The changes can be done in steps, but of course removing temp_buf from CPUState would need all targets to be converted first. >> > On the TCG generated code, the env structure is used almost for every >> > op, so it really makes sense to keep it in a register instead of having to >> > reload the address of env regularly from memory. Given it only affects >> > TCG generated code, I don't see the point of portability here. >> >> For example, maybe the bugs in Sparc glibc could be avoided by using >> one of %i set of registers (not accessible from helpers) for AREG0 >> within generated code instead of %g registers which seem to be >> fragile. > > First of all, but it's a different subject, I am not sure there are > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For > example the following code is probably wrong: > > /* Note: must be synced with dyngen-exec.h */ > #ifdef CONFIG_SOLARIS > #define TCG_AREG0 TCG_REG_G2 > #elif defined(__sparc_v9__) > #define TCG_AREG0 TCG_REG_G5 > #else > #define TCG_AREG0 TCG_REG_G6 > #endif > > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, > so the condition is probably wrong there. Secondly the SPARC ABI [1] on > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has > the right to use this registers. Yes, but the situation is not so nice. Please see this post for status as of 2010: http://article.gmane.org/gmane.comp.emulators.qemu/63610 This is from Debian glibc 2.11.2-10: $ file /lib/libc-2.11.2.so /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, stripped $ objdump -d /lib/libc.so.6 |grep %g1|wc -l 69648 $ objdump -d /lib/libc.so.6 |grep %g2|wc -l 37299 $ objdump -d /lib/libc.so.6 |grep %g3|wc -l 20635 $ objdump -d /lib/libc.so.6 |grep %g4|wc -l 11603 $ objdump -d /lib/libc.so.6 |grep %g5|wc -l 448 $ objdump -d /lib/libc.so.6 |grep %g6|wc -l 150 $ objdump -d /lib/libc.so.6 |grep %g7|wc -l 3052 Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, or %g1 and %g5 for scratch purposes. However, it is the application registers %g2 to %g4 that are used heaviest. Looking inside the objdump it's easy to see that the uses are not for example saving or restoring, but actually using them without saving the previous value first: 000211e0 <__divdi3>: 211e0: 9d e3 bf a0 save %sp, -96, %sp 211e4: 90 10 00 18 mov %i0, %o0 211e8: 92 10 00 19 mov %i1, %o1 211ec: 94 10 00 1a mov %i2, %o2 211f0: 96 10 00 1b mov %i3, %o3 211f4: 80 a6 20 00 cmp %i0, 0 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> 211fc: a0 10 20 00 clr %l0 21200: 80 a2 a0 00 cmp %o2, 0 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> 21208: a0 38 00 10 xnor %g0, %l0, %l0 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> 21210: 98 10 20 00 clr %o4 21214: 84 10 00 08 mov %o0, %g2 ...whoops... 21218: 80 a4 20 00 cmp %l0, 0 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> 21220: 86 10 00 09 mov %o1, %g3 ...whoops... 21224: 86 a0 00 09 subcc %g0, %o1, %g3 21228: 84 60 00 02 subc %g0, %g2, %g2 2122c: b2 10 00 03 mov %g3, %i1 21230: 81 cf e0 08 rett %i7 + 8 21234: 90 10 00 02 mov %g2, %o0 21238: 92 a0 00 19 subcc %g0, %i1, %o1 2123c: 90 60 00 18 subc %g0, %i0, %o0 21240: 80 a2 a0 00 cmp %o2, 0 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> 21248: a0 10 3f ff mov -1, %l0 2124c: a0 38 00 10 xnor %g0, %l0, %l0 21250: 96 a0 00 0b subcc %g0, %o3, %o3 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> 21258: 94 60 00 0a subc %g0, %o2, %o2 2125c: 01 00 00 00 nop This is libc from OpenBSD/Sparc64 4.9: $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l 40562 $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l 20384 $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l 10240 $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l 6606 $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l 3811 $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l 4 $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l 20 Not so great there either. > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code > would prevent you to use a register from the %i set for it. The helpers currently use global env register, but %i registers can't be accessed from the next level of function call nesting hierarchy so they can't be used for global env. >> >> > On the other hand, I have objections to remove uses of TCG_AREG0 from >> >> > the TCG part. This register is part of the TCG design and thus used very >> >> > often. In any case we have to keep it for accesses to the globals, so we >> >> > can keep it for softmmu load/stores too. >> >> >> >> Right, any changes would need to be backed by performance figures. >> >> >> >> > If we go for the removal of >> >> > TCG_AREG0 from the GCC side, it probably means loading it in the prologue >> >> > instead. >> >> >> >> Do you mean cpu-exec.c side with this? I think this is a very similar >> >> register tradeoff case to helpers. >> > >> > I mean calling tcg_qemu_tb_exec(tb_ptr, env) instead of >> > tcg_qemu_tb_exec(tb_ptr), and modification in the prologue to save the >> > env pointer into the TCG_AREG0 register (which is kept internal to TCG >> > generated code). >> >> Maybe it would be easy to test and benchmark this change, it doesn't >> affect generated code or helpers. >> > > Yes it's something that can be benchmarked. From experience it won't > make any significant performance change, it's basically two register > moves (one in the caller, one in the prologue), for each TB which is > not chained. In any case, it will introduce a slight overhead that has > to be compensated by allowing the helpers to use an additional > register. But like in the helper case, cpu-exec.c can then be compiled without HELPER_CFLAGS or global env register, which would make that register available for GCC. This may or may not compensate for the extra moves. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 12:30 ` Blue Swirl @ 2011-05-15 12:49 ` Aurelien Jarno 2011-05-15 13:16 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 12:49 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> > I still don't get where are this list of possible changes? Did I miss > >> > something in another thread? > >> > >> I'm referring to the patches I sent. > > > > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think > > patches 6 and 7 should be done for all hosts or none of them. > > The changes can be done in steps, but of course removing temp_buf from > CPUState would need all targets to be converted first. > > >> > On the TCG generated code, the env structure is used almost for every > >> > op, so it really makes sense to keep it in a register instead of having to > >> > reload the address of env regularly from memory. Given it only affects > >> > TCG generated code, I don't see the point of portability here. > >> > >> For example, maybe the bugs in Sparc glibc could be avoided by using > >> one of %i set of registers (not accessible from helpers) for AREG0 > >> within generated code instead of %g registers which seem to be > >> fragile. > > > > First of all, but it's a different subject, I am not sure there are > > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For > > example the following code is probably wrong: > > > > /* Note: must be synced with dyngen-exec.h */ > > #ifdef CONFIG_SOLARIS > > #define TCG_AREG0 TCG_REG_G2 > > #elif defined(__sparc_v9__) > > #define TCG_AREG0 TCG_REG_G5 > > #else > > #define TCG_AREG0 TCG_REG_G6 > > #endif > > > > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, > > so the condition is probably wrong there. Secondly the SPARC ABI [1] on > > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has > > the right to use this registers. > > Yes, but the situation is not so nice. Please see this post for status > as of 2010: > http://article.gmane.org/gmane.comp.emulators.qemu/63610 > > This is from Debian glibc 2.11.2-10: > $ file /lib/libc-2.11.2.so > /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, > version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux > 2.6.18, stripped > $ objdump -d /lib/libc.so.6 |grep %g1|wc -l > 69648 > $ objdump -d /lib/libc.so.6 |grep %g2|wc -l > 37299 > $ objdump -d /lib/libc.so.6 |grep %g3|wc -l > 20635 > $ objdump -d /lib/libc.so.6 |grep %g4|wc -l > 11603 > $ objdump -d /lib/libc.so.6 |grep %g5|wc -l > 448 > $ objdump -d /lib/libc.so.6 |grep %g6|wc -l > 150 > $ objdump -d /lib/libc.so.6 |grep %g7|wc -l > 3052 > > Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, >From the calling convention point of view, sparc32 and sparc32plus are the same ABI, so %g5 is also reserved for system use. > or %g1 and %g5 for scratch purposes. However, it is the application > registers %g2 to %g4 that are used heaviest. Looking inside the > objdump it's easy to see that the uses are not for example saving or > restoring, but actually using them without saving the previous value > first: Well, we have to define system and application. System is defined as library in Chapter 6, and I don't see the libc there, and is probably considered as part of the application. > 000211e0 <__divdi3>: > 211e0: 9d e3 bf a0 save %sp, -96, %sp > 211e4: 90 10 00 18 mov %i0, %o0 > 211e8: 92 10 00 19 mov %i1, %o1 > 211ec: 94 10 00 1a mov %i2, %o2 > 211f0: 96 10 00 1b mov %i3, %o3 > 211f4: 80 a6 20 00 cmp %i0, 0 > 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> > 211fc: a0 10 20 00 clr %l0 > 21200: 80 a2 a0 00 cmp %o2, 0 > 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> > 21208: a0 38 00 10 xnor %g0, %l0, %l0 > 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> > 21210: 98 10 20 00 clr %o4 > 21214: 84 10 00 08 mov %o0, %g2 > > ...whoops... > > 21218: 80 a4 20 00 cmp %l0, 0 > 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> > 21220: 86 10 00 09 mov %o1, %g3 > > ...whoops... > > 21224: 86 a0 00 09 subcc %g0, %o1, %g3 > 21228: 84 60 00 02 subc %g0, %g2, %g2 > 2122c: b2 10 00 03 mov %g3, %i1 > 21230: 81 cf e0 08 rett %i7 + 8 > 21234: 90 10 00 02 mov %g2, %o0 > 21238: 92 a0 00 19 subcc %g0, %i1, %o1 > 2123c: 90 60 00 18 subc %g0, %i0, %o0 > 21240: 80 a2 a0 00 cmp %o2, 0 > 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> > 21248: a0 10 3f ff mov -1, %l0 > 2124c: a0 38 00 10 xnor %g0, %l0, %l0 > 21250: 96 a0 00 0b subcc %g0, %o3, %o3 > 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> > 21258: 94 60 00 0a subc %g0, %o2, %o2 > 2125c: 01 00 00 00 nop > > This is libc from OpenBSD/Sparc64 4.9: > $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l > 40562 > $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l > 20384 > $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l > 10240 > $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l > 6606 > $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l > 3811 > $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l > 4 > $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l > 20 > > Not so great there either. > > > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code > > would prevent you to use a register from the %i set for it. > > The helpers currently use global env register, but %i registers can't > be accessed from the next level of function call nesting hierarchy so > they can't be used for global env. > That's the current situation yes. Using %i registers for TCG_AREG0 does mean you can't use a global env register in the helpers, but it doesn't mean that internal TCG code can't use them for TCG_AREG0. What I am telling you since the beginning is that: - I have no objection that we stop using a fixed register in GCC generated code (that is completely removing HELPER_CFLAGS). However I don't really see the point of doing that, though the Sparc issue might be an argument. - I do have objection to remove TCG_AREG0 from inside the TCG generated code, this register is used for almost every TCG op, and I don't see any real argument for not keeping it. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 12:49 ` Aurelien Jarno @ 2011-05-15 13:16 ` Blue Swirl 2011-05-15 13:43 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 13:16 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> > I still don't get where are this list of possible changes? Did I miss >> >> > something in another thread? >> >> >> >> I'm referring to the patches I sent. >> > >> > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think >> > patches 6 and 7 should be done for all hosts or none of them. >> >> The changes can be done in steps, but of course removing temp_buf from >> CPUState would need all targets to be converted first. >> >> >> > On the TCG generated code, the env structure is used almost for every >> >> > op, so it really makes sense to keep it in a register instead of having to >> >> > reload the address of env regularly from memory. Given it only affects >> >> > TCG generated code, I don't see the point of portability here. >> >> >> >> For example, maybe the bugs in Sparc glibc could be avoided by using >> >> one of %i set of registers (not accessible from helpers) for AREG0 >> >> within generated code instead of %g registers which seem to be >> >> fragile. >> > >> > First of all, but it's a different subject, I am not sure there are >> > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For >> > example the following code is probably wrong: >> > >> > /* Note: must be synced with dyngen-exec.h */ >> > #ifdef CONFIG_SOLARIS >> > #define TCG_AREG0 TCG_REG_G2 >> > #elif defined(__sparc_v9__) >> > #define TCG_AREG0 TCG_REG_G5 >> > #else >> > #define TCG_AREG0 TCG_REG_G6 >> > #endif >> > >> > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, >> > so the condition is probably wrong there. Secondly the SPARC ABI [1] on >> > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has >> > the right to use this registers. >> >> Yes, but the situation is not so nice. Please see this post for status >> as of 2010: >> http://article.gmane.org/gmane.comp.emulators.qemu/63610 >> >> This is from Debian glibc 2.11.2-10: >> $ file /lib/libc-2.11.2.so >> /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, >> version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux >> 2.6.18, stripped >> $ objdump -d /lib/libc.so.6 |grep %g1|wc -l >> 69648 >> $ objdump -d /lib/libc.so.6 |grep %g2|wc -l >> 37299 >> $ objdump -d /lib/libc.so.6 |grep %g3|wc -l >> 20635 >> $ objdump -d /lib/libc.so.6 |grep %g4|wc -l >> 11603 >> $ objdump -d /lib/libc.so.6 |grep %g5|wc -l >> 448 >> $ objdump -d /lib/libc.so.6 |grep %g6|wc -l >> 150 >> $ objdump -d /lib/libc.so.6 |grep %g7|wc -l >> 3052 >> >> Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, > > From the calling convention point of view, sparc32 and sparc32plus are > the same ABI, so %g5 is also reserved for system use. > >> or %g1 and %g5 for scratch purposes. However, it is the application >> registers %g2 to %g4 that are used heaviest. Looking inside the >> objdump it's easy to see that the uses are not for example saving or >> restoring, but actually using them without saving the previous value >> first: > > Well, we have to define system and application. System is defined as > library in Chapter 6, and I don't see the libc there, and is probably > considered as part of the application. No, for example unistd.h is described and even X11. GCC also says that libraries should be compiled without using the registers: http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options >> 000211e0 <__divdi3>: >> 211e0: 9d e3 bf a0 save %sp, -96, %sp >> 211e4: 90 10 00 18 mov %i0, %o0 >> 211e8: 92 10 00 19 mov %i1, %o1 >> 211ec: 94 10 00 1a mov %i2, %o2 >> 211f0: 96 10 00 1b mov %i3, %o3 >> 211f4: 80 a6 20 00 cmp %i0, 0 >> 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> >> 211fc: a0 10 20 00 clr %l0 >> 21200: 80 a2 a0 00 cmp %o2, 0 >> 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> >> 21208: a0 38 00 10 xnor %g0, %l0, %l0 >> 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> >> 21210: 98 10 20 00 clr %o4 >> 21214: 84 10 00 08 mov %o0, %g2 >> >> ...whoops... >> >> 21218: 80 a4 20 00 cmp %l0, 0 >> 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> >> 21220: 86 10 00 09 mov %o1, %g3 >> >> ...whoops... >> >> 21224: 86 a0 00 09 subcc %g0, %o1, %g3 >> 21228: 84 60 00 02 subc %g0, %g2, %g2 >> 2122c: b2 10 00 03 mov %g3, %i1 >> 21230: 81 cf e0 08 rett %i7 + 8 >> 21234: 90 10 00 02 mov %g2, %o0 >> 21238: 92 a0 00 19 subcc %g0, %i1, %o1 >> 2123c: 90 60 00 18 subc %g0, %i0, %o0 >> 21240: 80 a2 a0 00 cmp %o2, 0 >> 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> >> 21248: a0 10 3f ff mov -1, %l0 >> 2124c: a0 38 00 10 xnor %g0, %l0, %l0 >> 21250: 96 a0 00 0b subcc %g0, %o3, %o3 >> 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> >> 21258: 94 60 00 0a subc %g0, %o2, %o2 >> 2125c: 01 00 00 00 nop >> >> This is libc from OpenBSD/Sparc64 4.9: >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l >> 40562 >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l >> 20384 >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l >> 10240 >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l >> 6606 >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l >> 3811 >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l >> 4 >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l >> 20 >> >> Not so great there either. >> >> > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code >> > would prevent you to use a register from the %i set for it. >> >> The helpers currently use global env register, but %i registers can't >> be accessed from the next level of function call nesting hierarchy so >> they can't be used for global env. >> > > That's the current situation yes. Using %i registers for TCG_AREG0 does > mean you can't use a global env register in the helpers, but it doesn't > mean that internal TCG code can't use them for TCG_AREG0. Exactly. > What I am telling you since the beginning is that: > - I have no objection that we stop using a fixed register in GCC > generated code (that is completely removing HELPER_CFLAGS). However I > don't really see the point of doing that, though the Sparc issue might > be an argument. > - I do have objection to remove TCG_AREG0 from inside the TCG generated > code, this register is used for almost every TCG op, and I don't see > any real argument for not keeping it. I'm pretty much open at this point for all alternatives. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 13:16 ` Blue Swirl @ 2011-05-15 13:43 ` Aurelien Jarno 2011-05-15 14:02 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 13:43 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: > >> >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> >> > I still don't get where are this list of possible changes? Did I miss > >> >> > something in another thread? > >> >> > >> >> I'm referring to the patches I sent. > >> > > >> > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think > >> > patches 6 and 7 should be done for all hosts or none of them. > >> > >> The changes can be done in steps, but of course removing temp_buf from > >> CPUState would need all targets to be converted first. > >> > >> >> > On the TCG generated code, the env structure is used almost for every > >> >> > op, so it really makes sense to keep it in a register instead of having to > >> >> > reload the address of env regularly from memory. Given it only affects > >> >> > TCG generated code, I don't see the point of portability here. > >> >> > >> >> For example, maybe the bugs in Sparc glibc could be avoided by using > >> >> one of %i set of registers (not accessible from helpers) for AREG0 > >> >> within generated code instead of %g registers which seem to be > >> >> fragile. > >> > > >> > First of all, but it's a different subject, I am not sure there are > >> > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For > >> > example the following code is probably wrong: > >> > > >> > /* Note: must be synced with dyngen-exec.h */ > >> > #ifdef CONFIG_SOLARIS > >> > #define TCG_AREG0 TCG_REG_G2 > >> > #elif defined(__sparc_v9__) > >> > #define TCG_AREG0 TCG_REG_G5 > >> > #else > >> > #define TCG_AREG0 TCG_REG_G6 > >> > #endif > >> > > >> > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, > >> > so the condition is probably wrong there. Secondly the SPARC ABI [1] on > >> > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has > >> > the right to use this registers. > >> > >> Yes, but the situation is not so nice. Please see this post for status > >> as of 2010: > >> http://article.gmane.org/gmane.comp.emulators.qemu/63610 > >> > >> This is from Debian glibc 2.11.2-10: > >> $ file /lib/libc-2.11.2.so > >> /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, > >> version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux > >> 2.6.18, stripped > >> $ objdump -d /lib/libc.so.6 |grep %g1|wc -l > >> 69648 > >> $ objdump -d /lib/libc.so.6 |grep %g2|wc -l > >> 37299 > >> $ objdump -d /lib/libc.so.6 |grep %g3|wc -l > >> 20635 > >> $ objdump -d /lib/libc.so.6 |grep %g4|wc -l > >> 11603 > >> $ objdump -d /lib/libc.so.6 |grep %g5|wc -l > >> 448 > >> $ objdump -d /lib/libc.so.6 |grep %g6|wc -l > >> 150 > >> $ objdump -d /lib/libc.so.6 |grep %g7|wc -l > >> 3052 > >> > >> Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, > > > > From the calling convention point of view, sparc32 and sparc32plus are > > the same ABI, so %g5 is also reserved for system use. > > > >> or %g1 and %g5 for scratch purposes. However, it is the application > >> registers %g2 to %g4 that are used heaviest. Looking inside the > >> objdump it's easy to see that the uses are not for example saving or > >> restoring, but actually using them without saving the previous value > >> first: > > > > Well, we have to define system and application. System is defined as > > library in Chapter 6, and I don't see the libc there, and is probably > > considered as part of the application. > > No, for example unistd.h is described and even X11. GCC also says that > libraries should be compiled without using the registers: > > http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options > > >> 000211e0 <__divdi3>: > >> 211e0: 9d e3 bf a0 save %sp, -96, %sp > >> 211e4: 90 10 00 18 mov %i0, %o0 > >> 211e8: 92 10 00 19 mov %i1, %o1 > >> 211ec: 94 10 00 1a mov %i2, %o2 > >> 211f0: 96 10 00 1b mov %i3, %o3 > >> 211f4: 80 a6 20 00 cmp %i0, 0 > >> 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> > >> 211fc: a0 10 20 00 clr %l0 > >> 21200: 80 a2 a0 00 cmp %o2, 0 > >> 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> > >> 21208: a0 38 00 10 xnor %g0, %l0, %l0 > >> 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> > >> 21210: 98 10 20 00 clr %o4 > >> 21214: 84 10 00 08 mov %o0, %g2 > >> > >> ...whoops... > >> > >> 21218: 80 a4 20 00 cmp %l0, 0 > >> 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> > >> 21220: 86 10 00 09 mov %o1, %g3 > >> > >> ...whoops... > >> > >> 21224: 86 a0 00 09 subcc %g0, %o1, %g3 > >> 21228: 84 60 00 02 subc %g0, %g2, %g2 > >> 2122c: b2 10 00 03 mov %g3, %i1 > >> 21230: 81 cf e0 08 rett %i7 + 8 > >> 21234: 90 10 00 02 mov %g2, %o0 > >> 21238: 92 a0 00 19 subcc %g0, %i1, %o1 > >> 2123c: 90 60 00 18 subc %g0, %i0, %o0 > >> 21240: 80 a2 a0 00 cmp %o2, 0 > >> 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> > >> 21248: a0 10 3f ff mov -1, %l0 > >> 2124c: a0 38 00 10 xnor %g0, %l0, %l0 > >> 21250: 96 a0 00 0b subcc %g0, %o3, %o3 > >> 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> > >> 21258: 94 60 00 0a subc %g0, %o2, %o2 > >> 2125c: 01 00 00 00 nop > >> > >> This is libc from OpenBSD/Sparc64 4.9: > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l > >> 40562 > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l > >> 20384 > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l > >> 10240 > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l > >> 6606 > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l > >> 3811 > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l > >> 4 > >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l > >> 20 > >> > >> Not so great there either. > >> > >> > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code > >> > would prevent you to use a register from the %i set for it. > >> > >> The helpers currently use global env register, but %i registers can't > >> be accessed from the next level of function call nesting hierarchy so > >> they can't be used for global env. > >> > > > > That's the current situation yes. Using %i registers for TCG_AREG0 does > > mean you can't use a global env register in the helpers, but it doesn't > > mean that internal TCG code can't use them for TCG_AREG0. > > Exactly. > > > What I am telling you since the beginning is that: > > - I have no objection that we stop using a fixed register in GCC > > generated code (that is completely removing HELPER_CFLAGS). However I > > don't really see the point of doing that, though the Sparc issue might > > be an argument. > > - I do have objection to remove TCG_AREG0 from inside the TCG generated > > code, this register is used for almost every TCG op, and I don't see > > any real argument for not keeping it. > > I'm pretty much open at this point for all alternatives. > So what about getting rid of TCG_AREG0 for GCC generated code only, at least as a first step? So what about the following changes: - Change TCG_AREG0 of all targets to a callee saved register (if possible, e.g. sparc) - Change the prologue of all TCG targets to take env as an argument, and save it into TCG_AREG0. - Change all helpers to explicitly take an env pointer instead of using the fixed register. Note that it also includes softmmu helpers, but the TCG load/store instructions should be kept unchanged. - Remove HELPER_CFLAGS from makefiles when all helpers have been changed. - TCG_AREG0 can then be changed to another register if needed. And later we can do more steps to get a complete removal of TCG_AREG0, including in TCG code, though I still think it is a really bad idea. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 13:43 ` Aurelien Jarno @ 2011-05-15 14:02 ` Blue Swirl 2011-05-15 14:06 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 14:02 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: >> >> >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> >> > I still don't get where are this list of possible changes? Did I miss >> >> >> > something in another thread? >> >> >> >> >> >> I'm referring to the patches I sent. >> >> > >> >> > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think >> >> > patches 6 and 7 should be done for all hosts or none of them. >> >> >> >> The changes can be done in steps, but of course removing temp_buf from >> >> CPUState would need all targets to be converted first. >> >> >> >> >> > On the TCG generated code, the env structure is used almost for every >> >> >> > op, so it really makes sense to keep it in a register instead of having to >> >> >> > reload the address of env regularly from memory. Given it only affects >> >> >> > TCG generated code, I don't see the point of portability here. >> >> >> >> >> >> For example, maybe the bugs in Sparc glibc could be avoided by using >> >> >> one of %i set of registers (not accessible from helpers) for AREG0 >> >> >> within generated code instead of %g registers which seem to be >> >> >> fragile. >> >> > >> >> > First of all, but it's a different subject, I am not sure there are >> >> > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For >> >> > example the following code is probably wrong: >> >> > >> >> > /* Note: must be synced with dyngen-exec.h */ >> >> > #ifdef CONFIG_SOLARIS >> >> > #define TCG_AREG0 TCG_REG_G2 >> >> > #elif defined(__sparc_v9__) >> >> > #define TCG_AREG0 TCG_REG_G5 >> >> > #else >> >> > #define TCG_AREG0 TCG_REG_G6 >> >> > #endif >> >> > >> >> > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, >> >> > so the condition is probably wrong there. Secondly the SPARC ABI [1] on >> >> > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has >> >> > the right to use this registers. >> >> >> >> Yes, but the situation is not so nice. Please see this post for status >> >> as of 2010: >> >> http://article.gmane.org/gmane.comp.emulators.qemu/63610 >> >> >> >> This is from Debian glibc 2.11.2-10: >> >> $ file /lib/libc-2.11.2.so >> >> /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, >> >> version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux >> >> 2.6.18, stripped >> >> $ objdump -d /lib/libc.so.6 |grep %g1|wc -l >> >> 69648 >> >> $ objdump -d /lib/libc.so.6 |grep %g2|wc -l >> >> 37299 >> >> $ objdump -d /lib/libc.so.6 |grep %g3|wc -l >> >> 20635 >> >> $ objdump -d /lib/libc.so.6 |grep %g4|wc -l >> >> 11603 >> >> $ objdump -d /lib/libc.so.6 |grep %g5|wc -l >> >> 448 >> >> $ objdump -d /lib/libc.so.6 |grep %g6|wc -l >> >> 150 >> >> $ objdump -d /lib/libc.so.6 |grep %g7|wc -l >> >> 3052 >> >> >> >> Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, >> > >> > From the calling convention point of view, sparc32 and sparc32plus are >> > the same ABI, so %g5 is also reserved for system use. >> > >> >> or %g1 and %g5 for scratch purposes. However, it is the application >> >> registers %g2 to %g4 that are used heaviest. Looking inside the >> >> objdump it's easy to see that the uses are not for example saving or >> >> restoring, but actually using them without saving the previous value >> >> first: >> > >> > Well, we have to define system and application. System is defined as >> > library in Chapter 6, and I don't see the libc there, and is probably >> > considered as part of the application. >> >> No, for example unistd.h is described and even X11. GCC also says that >> libraries should be compiled without using the registers: >> >> http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options >> >> >> 000211e0 <__divdi3>: >> >> 211e0: 9d e3 bf a0 save %sp, -96, %sp >> >> 211e4: 90 10 00 18 mov %i0, %o0 >> >> 211e8: 92 10 00 19 mov %i1, %o1 >> >> 211ec: 94 10 00 1a mov %i2, %o2 >> >> 211f0: 96 10 00 1b mov %i3, %o3 >> >> 211f4: 80 a6 20 00 cmp %i0, 0 >> >> 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> >> >> 211fc: a0 10 20 00 clr %l0 >> >> 21200: 80 a2 a0 00 cmp %o2, 0 >> >> 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> >> >> 21208: a0 38 00 10 xnor %g0, %l0, %l0 >> >> 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> >> >> 21210: 98 10 20 00 clr %o4 >> >> 21214: 84 10 00 08 mov %o0, %g2 >> >> >> >> ...whoops... >> >> >> >> 21218: 80 a4 20 00 cmp %l0, 0 >> >> 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> >> >> 21220: 86 10 00 09 mov %o1, %g3 >> >> >> >> ...whoops... >> >> >> >> 21224: 86 a0 00 09 subcc %g0, %o1, %g3 >> >> 21228: 84 60 00 02 subc %g0, %g2, %g2 >> >> 2122c: b2 10 00 03 mov %g3, %i1 >> >> 21230: 81 cf e0 08 rett %i7 + 8 >> >> 21234: 90 10 00 02 mov %g2, %o0 >> >> 21238: 92 a0 00 19 subcc %g0, %i1, %o1 >> >> 2123c: 90 60 00 18 subc %g0, %i0, %o0 >> >> 21240: 80 a2 a0 00 cmp %o2, 0 >> >> 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> >> >> 21248: a0 10 3f ff mov -1, %l0 >> >> 2124c: a0 38 00 10 xnor %g0, %l0, %l0 >> >> 21250: 96 a0 00 0b subcc %g0, %o3, %o3 >> >> 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> >> >> 21258: 94 60 00 0a subc %g0, %o2, %o2 >> >> 2125c: 01 00 00 00 nop >> >> >> >> This is libc from OpenBSD/Sparc64 4.9: >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l >> >> 40562 >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l >> >> 20384 >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l >> >> 10240 >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l >> >> 6606 >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l >> >> 3811 >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l >> >> 4 >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l >> >> 20 >> >> >> >> Not so great there either. >> >> >> >> > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code >> >> > would prevent you to use a register from the %i set for it. >> >> >> >> The helpers currently use global env register, but %i registers can't >> >> be accessed from the next level of function call nesting hierarchy so >> >> they can't be used for global env. >> >> >> > >> > That's the current situation yes. Using %i registers for TCG_AREG0 does >> > mean you can't use a global env register in the helpers, but it doesn't >> > mean that internal TCG code can't use them for TCG_AREG0. >> >> Exactly. >> >> > What I am telling you since the beginning is that: >> > - I have no objection that we stop using a fixed register in GCC >> > generated code (that is completely removing HELPER_CFLAGS). However I >> > don't really see the point of doing that, though the Sparc issue might >> > be an argument. >> > - I do have objection to remove TCG_AREG0 from inside the TCG generated >> > code, this register is used for almost every TCG op, and I don't see >> > any real argument for not keeping it. >> >> I'm pretty much open at this point for all alternatives. >> > > So what about getting rid of TCG_AREG0 for GCC generated code only, at > least as a first step? > > So what about the following changes: > - Change TCG_AREG0 of all targets to a callee saved register (if > possible, e.g. sparc) > - Change the prologue of all TCG targets to take env as an argument, > and save it into TCG_AREG0. This can be the first step. > - Change all helpers to explicitly take an env pointer instead of using > the fixed register. Note that it also includes softmmu helpers, but > the TCG load/store instructions should be kept unchanged. I think this step will lose performance slightly if TCG is not changed. > - Remove HELPER_CFLAGS from makefiles when all helpers have been > changed. This should restore most of the performance loss from previous step, maybe even improve. > - TCG_AREG0 can then be changed to another register if needed. I'd combine this with callee saved register change. Anyway, at this point there should be a lot of flexibility with the register choice. > And later we can do more steps to get a complete removal of TCG_AREG0, > including in TCG code, though I still think it is a really bad idea. Maybe. At this point most of the hard work has been done, so it's possible to make experiments. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 14:02 ` Blue Swirl @ 2011-05-15 14:06 ` Aurelien Jarno 2011-05-15 14:10 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 14:06 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sun, May 15, 2011 at 05:02:37PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> > On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: > >> >> On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> >> > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: > >> >> >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> >> >> > I still don't get where are this list of possible changes? Did I miss > >> >> >> > something in another thread? > >> >> >> > >> >> >> I'm referring to the patches I sent. > >> >> > > >> >> > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think > >> >> > patches 6 and 7 should be done for all hosts or none of them. > >> >> > >> >> The changes can be done in steps, but of course removing temp_buf from > >> >> CPUState would need all targets to be converted first. > >> >> > >> >> >> > On the TCG generated code, the env structure is used almost for every > >> >> >> > op, so it really makes sense to keep it in a register instead of having to > >> >> >> > reload the address of env regularly from memory. Given it only affects > >> >> >> > TCG generated code, I don't see the point of portability here. > >> >> >> > >> >> >> For example, maybe the bugs in Sparc glibc could be avoided by using > >> >> >> one of %i set of registers (not accessible from helpers) for AREG0 > >> >> >> within generated code instead of %g registers which seem to be > >> >> >> fragile. > >> >> > > >> >> > First of all, but it's a different subject, I am not sure there are > >> >> > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For > >> >> > example the following code is probably wrong: > >> >> > > >> >> > /* Note: must be synced with dyngen-exec.h */ > >> >> > #ifdef CONFIG_SOLARIS > >> >> > #define TCG_AREG0 TCG_REG_G2 > >> >> > #elif defined(__sparc_v9__) > >> >> > #define TCG_AREG0 TCG_REG_G5 > >> >> > #else > >> >> > #define TCG_AREG0 TCG_REG_G6 > >> >> > #endif > >> >> > > >> >> > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, > >> >> > so the condition is probably wrong there. Secondly the SPARC ABI [1] on > >> >> > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has > >> >> > the right to use this registers. > >> >> > >> >> Yes, but the situation is not so nice. Please see this post for status > >> >> as of 2010: > >> >> http://article.gmane.org/gmane.comp.emulators.qemu/63610 > >> >> > >> >> This is from Debian glibc 2.11.2-10: > >> >> $ file /lib/libc-2.11.2.so > >> >> /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, > >> >> version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux > >> >> 2.6.18, stripped > >> >> $ objdump -d /lib/libc.so.6 |grep %g1|wc -l > >> >> 69648 > >> >> $ objdump -d /lib/libc.so.6 |grep %g2|wc -l > >> >> 37299 > >> >> $ objdump -d /lib/libc.so.6 |grep %g3|wc -l > >> >> 20635 > >> >> $ objdump -d /lib/libc.so.6 |grep %g4|wc -l > >> >> 11603 > >> >> $ objdump -d /lib/libc.so.6 |grep %g5|wc -l > >> >> 448 > >> >> $ objdump -d /lib/libc.so.6 |grep %g6|wc -l > >> >> 150 > >> >> $ objdump -d /lib/libc.so.6 |grep %g7|wc -l > >> >> 3052 > >> >> > >> >> Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, > >> > > >> > From the calling convention point of view, sparc32 and sparc32plus are > >> > the same ABI, so %g5 is also reserved for system use. > >> > > >> >> or %g1 and %g5 for scratch purposes. However, it is the application > >> >> registers %g2 to %g4 that are used heaviest. Looking inside the > >> >> objdump it's easy to see that the uses are not for example saving or > >> >> restoring, but actually using them without saving the previous value > >> >> first: > >> > > >> > Well, we have to define system and application. System is defined as > >> > library in Chapter 6, and I don't see the libc there, and is probably > >> > considered as part of the application. > >> > >> No, for example unistd.h is described and even X11. GCC also says that > >> libraries should be compiled without using the registers: > >> > >> http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options > >> > >> >> 000211e0 <__divdi3>: > >> >> 211e0: 9d e3 bf a0 save %sp, -96, %sp > >> >> 211e4: 90 10 00 18 mov %i0, %o0 > >> >> 211e8: 92 10 00 19 mov %i1, %o1 > >> >> 211ec: 94 10 00 1a mov %i2, %o2 > >> >> 211f0: 96 10 00 1b mov %i3, %o3 > >> >> 211f4: 80 a6 20 00 cmp %i0, 0 > >> >> 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> > >> >> 211fc: a0 10 20 00 clr %l0 > >> >> 21200: 80 a2 a0 00 cmp %o2, 0 > >> >> 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> > >> >> 21208: a0 38 00 10 xnor %g0, %l0, %l0 > >> >> 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> > >> >> 21210: 98 10 20 00 clr %o4 > >> >> 21214: 84 10 00 08 mov %o0, %g2 > >> >> > >> >> ...whoops... > >> >> > >> >> 21218: 80 a4 20 00 cmp %l0, 0 > >> >> 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> > >> >> 21220: 86 10 00 09 mov %o1, %g3 > >> >> > >> >> ...whoops... > >> >> > >> >> 21224: 86 a0 00 09 subcc %g0, %o1, %g3 > >> >> 21228: 84 60 00 02 subc %g0, %g2, %g2 > >> >> 2122c: b2 10 00 03 mov %g3, %i1 > >> >> 21230: 81 cf e0 08 rett %i7 + 8 > >> >> 21234: 90 10 00 02 mov %g2, %o0 > >> >> 21238: 92 a0 00 19 subcc %g0, %i1, %o1 > >> >> 2123c: 90 60 00 18 subc %g0, %i0, %o0 > >> >> 21240: 80 a2 a0 00 cmp %o2, 0 > >> >> 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> > >> >> 21248: a0 10 3f ff mov -1, %l0 > >> >> 2124c: a0 38 00 10 xnor %g0, %l0, %l0 > >> >> 21250: 96 a0 00 0b subcc %g0, %o3, %o3 > >> >> 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> > >> >> 21258: 94 60 00 0a subc %g0, %o2, %o2 > >> >> 2125c: 01 00 00 00 nop > >> >> > >> >> This is libc from OpenBSD/Sparc64 4.9: > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l > >> >> 40562 > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l > >> >> 20384 > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l > >> >> 10240 > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l > >> >> 6606 > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l > >> >> 3811 > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l > >> >> 4 > >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l > >> >> 20 > >> >> > >> >> Not so great there either. > >> >> > >> >> > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code > >> >> > would prevent you to use a register from the %i set for it. > >> >> > >> >> The helpers currently use global env register, but %i registers can't > >> >> be accessed from the next level of function call nesting hierarchy so > >> >> they can't be used for global env. > >> >> > >> > > >> > That's the current situation yes. Using %i registers for TCG_AREG0 does > >> > mean you can't use a global env register in the helpers, but it doesn't > >> > mean that internal TCG code can't use them for TCG_AREG0. > >> > >> Exactly. > >> > >> > What I am telling you since the beginning is that: > >> > - I have no objection that we stop using a fixed register in GCC > >> > generated code (that is completely removing HELPER_CFLAGS). However I > >> > don't really see the point of doing that, though the Sparc issue might > >> > be an argument. > >> > - I do have objection to remove TCG_AREG0 from inside the TCG generated > >> > code, this register is used for almost every TCG op, and I don't see > >> > any real argument for not keeping it. > >> > >> I'm pretty much open at this point for all alternatives. > >> > > > > So what about getting rid of TCG_AREG0 for GCC generated code only, at > > least as a first step? > > > > So what about the following changes: > > - Change TCG_AREG0 of all targets to a callee saved register (if > > possible, e.g. sparc) > > - Change the prologue of all TCG targets to take env as an argument, > > and save it into TCG_AREG0. > > This can be the first step. > > > - Change all helpers to explicitly take an env pointer instead of using > > the fixed register. Note that it also includes softmmu helpers, but > > the TCG load/store instructions should be kept unchanged. > > I think this step will lose performance slightly if TCG is not changed. Agreed. What do you have in mind by "if TCG is not changed"? > > - Remove HELPER_CFLAGS from makefiles when all helpers have been > > changed. > > This should restore most of the performance loss from previous step, > maybe even improve. > > > - TCG_AREG0 can then be changed to another register if needed. > > I'd combine this with callee saved register change. Anyway, at this > point there should be a lot of flexibility with the register choice. > > > And later we can do more steps to get a complete removal of TCG_AREG0, > > including in TCG code, though I still think it is a really bad idea. > > Maybe. At this point most of the hard work has been done, so it's > possible to make experiments. That's exactly my point. We more or less agree on previous step, not on this one, so we will need to experiment for that. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 14:06 ` Aurelien Jarno @ 2011-05-15 14:10 ` Blue Swirl 0 siblings, 0 replies; 31+ messages in thread From: Blue Swirl @ 2011-05-15 14:10 UTC (permalink / raw) To: Aurelien Jarno; +Cc: qemu-devel On Sun, May 15, 2011 at 5:06 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 05:02:37PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 4:43 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 04:16:20PM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 3:49 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> > On Sun, May 15, 2011 at 03:30:17PM +0300, Blue Swirl wrote: >> >> >> On Sun, May 15, 2011 at 2:36 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> >> > On Sun, May 15, 2011 at 02:12:27PM +0300, Blue Swirl wrote: >> >> >> >> On Sun, May 15, 2011 at 1:19 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> >> >> > I still don't get where are this list of possible changes? Did I miss >> >> >> >> > something in another thread? >> >> >> >> >> >> >> >> I'm referring to the patches I sent. >> >> >> > >> >> >> > Ok, patches 1, 2 and 4 to 7 looks ok at a first glance, though I think >> >> >> > patches 6 and 7 should be done for all hosts or none of them. >> >> >> >> >> >> The changes can be done in steps, but of course removing temp_buf from >> >> >> CPUState would need all targets to be converted first. >> >> >> >> >> >> >> > On the TCG generated code, the env structure is used almost for every >> >> >> >> > op, so it really makes sense to keep it in a register instead of having to >> >> >> >> > reload the address of env regularly from memory. Given it only affects >> >> >> >> > TCG generated code, I don't see the point of portability here. >> >> >> >> >> >> >> >> For example, maybe the bugs in Sparc glibc could be avoided by using >> >> >> >> one of %i set of registers (not accessible from helpers) for AREG0 >> >> >> >> within generated code instead of %g registers which seem to be >> >> >> >> fragile. >> >> >> > >> >> >> > First of all, but it's a different subject, I am not sure there are >> >> >> > sparc glibc bugs, I'd rather says QEMU mis-uses some register. For >> >> >> > example the following code is probably wrong: >> >> >> > >> >> >> > /* Note: must be synced with dyngen-exec.h */ >> >> >> > #ifdef CONFIG_SOLARIS >> >> >> > #define TCG_AREG0 TCG_REG_G2 >> >> >> > #elif defined(__sparc_v9__) >> >> >> > #define TCG_AREG0 TCG_REG_G5 >> >> >> > #else >> >> >> > #define TCG_AREG0 TCG_REG_G6 >> >> >> > #endif >> >> >> > >> >> >> > __sparc_v9__ can set on the 32-bit ABI, when the compiler targets V8+, >> >> >> > so the condition is probably wrong there. Secondly the SPARC ABI [1] on >> >> >> > page 23 that %g5 to %g7 are reserved for system. I don't think QEMU has >> >> >> > the right to use this registers. >> >> >> >> >> >> Yes, but the situation is not so nice. Please see this post for status >> >> >> as of 2010: >> >> >> http://article.gmane.org/gmane.comp.emulators.qemu/63610 >> >> >> >> >> >> This is from Debian glibc 2.11.2-10: >> >> >> $ file /lib/libc-2.11.2.so >> >> >> /lib/libc-2.11.2.so: ELF 32-bit MSB shared object, SPARC32PLUS, >> >> >> version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux >> >> >> 2.6.18, stripped >> >> >> $ objdump -d /lib/libc.so.6 |grep %g1|wc -l >> >> >> 69648 >> >> >> $ objdump -d /lib/libc.so.6 |grep %g2|wc -l >> >> >> 37299 >> >> >> $ objdump -d /lib/libc.so.6 |grep %g3|wc -l >> >> >> 20635 >> >> >> $ objdump -d /lib/libc.so.6 |grep %g4|wc -l >> >> >> 11603 >> >> >> $ objdump -d /lib/libc.so.6 |grep %g5|wc -l >> >> >> 448 >> >> >> $ objdump -d /lib/libc.so.6 |grep %g6|wc -l >> >> >> 150 >> >> >> $ objdump -d /lib/libc.so.6 |grep %g7|wc -l >> >> >> 3052 >> >> >> >> >> >> Glibc is compiled for Sparc32plus, so it should only use %g6 and %g7, >> >> > >> >> > From the calling convention point of view, sparc32 and sparc32plus are >> >> > the same ABI, so %g5 is also reserved for system use. >> >> > >> >> >> or %g1 and %g5 for scratch purposes. However, it is the application >> >> >> registers %g2 to %g4 that are used heaviest. Looking inside the >> >> >> objdump it's easy to see that the uses are not for example saving or >> >> >> restoring, but actually using them without saving the previous value >> >> >> first: >> >> > >> >> > Well, we have to define system and application. System is defined as >> >> > library in Chapter 6, and I don't see the libc there, and is probably >> >> > considered as part of the application. >> >> >> >> No, for example unistd.h is described and even X11. GCC also says that >> >> libraries should be compiled without using the registers: >> >> >> >> http://gcc.gnu.org/onlinedocs/gcc-4.6.0/gcc/SPARC-Options.html#SPARC-Options >> >> >> >> >> 000211e0 <__divdi3>: >> >> >> 211e0: 9d e3 bf a0 save %sp, -96, %sp >> >> >> 211e4: 90 10 00 18 mov %i0, %o0 >> >> >> 211e8: 92 10 00 19 mov %i1, %o1 >> >> >> 211ec: 94 10 00 1a mov %i2, %o2 >> >> >> 211f0: 96 10 00 1b mov %i3, %o3 >> >> >> 211f4: 80 a6 20 00 cmp %i0, 0 >> >> >> 211f8: 06 40 00 10 bl,pn %icc, 21238 <__divdi3+0x58> >> >> >> 211fc: a0 10 20 00 clr %l0 >> >> >> 21200: 80 a2 a0 00 cmp %o2, 0 >> >> >> 21204: 26 40 00 13 bl,a,pn %icc, 21250 <__divdi3+0x70> >> >> >> 21208: a0 38 00 10 xnor %g0, %l0, %l0 >> >> >> 2120c: 7f ff fe ed call 20dc0 <__ashldi3+0x40> >> >> >> 21210: 98 10 20 00 clr %o4 >> >> >> 21214: 84 10 00 08 mov %o0, %g2 >> >> >> >> >> >> ...whoops... >> >> >> >> >> >> 21218: 80 a4 20 00 cmp %l0, 0 >> >> >> 2121c: 02 40 00 04 be,pn %icc, 2122c <__divdi3+0x4c> >> >> >> 21220: 86 10 00 09 mov %o1, %g3 >> >> >> >> >> >> ...whoops... >> >> >> >> >> >> 21224: 86 a0 00 09 subcc %g0, %o1, %g3 >> >> >> 21228: 84 60 00 02 subc %g0, %g2, %g2 >> >> >> 2122c: b2 10 00 03 mov %g3, %i1 >> >> >> 21230: 81 cf e0 08 rett %i7 + 8 >> >> >> 21234: 90 10 00 02 mov %g2, %o0 >> >> >> 21238: 92 a0 00 19 subcc %g0, %i1, %o1 >> >> >> 2123c: 90 60 00 18 subc %g0, %i0, %o0 >> >> >> 21240: 80 a2 a0 00 cmp %o2, 0 >> >> >> 21244: 16 4f ff f2 bge %icc, 2120c <__divdi3+0x2c> >> >> >> 21248: a0 10 3f ff mov -1, %l0 >> >> >> 2124c: a0 38 00 10 xnor %g0, %l0, %l0 >> >> >> 21250: 96 a0 00 0b subcc %g0, %o3, %o3 >> >> >> 21254: 10 6f ff ee b %xcc, 2120c <__divdi3+0x2c> >> >> >> 21258: 94 60 00 0a subc %g0, %o2, %o2 >> >> >> 2125c: 01 00 00 00 nop >> >> >> >> >> >> This is libc from OpenBSD/Sparc64 4.9: >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g1|wc -l >> >> >> 40562 >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g2|wc -l >> >> >> 20384 >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g3|wc -l >> >> >> 10240 >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g4|wc -l >> >> >> 6606 >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g5|wc -l >> >> >> 3811 >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g6|wc -l >> >> >> 4 >> >> >> $ objdump -d /usr/lib/libc.so.58.0 |grep %g7|wc -l >> >> >> 20 >> >> >> >> >> >> Not so great there either. >> >> >> >> >> >> > Anyway, I don't see why keeping TCG_AREG0 inside the TCG generated code >> >> >> > would prevent you to use a register from the %i set for it. >> >> >> >> >> >> The helpers currently use global env register, but %i registers can't >> >> >> be accessed from the next level of function call nesting hierarchy so >> >> >> they can't be used for global env. >> >> >> >> >> > >> >> > That's the current situation yes. Using %i registers for TCG_AREG0 does >> >> > mean you can't use a global env register in the helpers, but it doesn't >> >> > mean that internal TCG code can't use them for TCG_AREG0. >> >> >> >> Exactly. >> >> >> >> > What I am telling you since the beginning is that: >> >> > - I have no objection that we stop using a fixed register in GCC >> >> > generated code (that is completely removing HELPER_CFLAGS). However I >> >> > don't really see the point of doing that, though the Sparc issue might >> >> > be an argument. >> >> > - I do have objection to remove TCG_AREG0 from inside the TCG generated >> >> > code, this register is used for almost every TCG op, and I don't see >> >> > any real argument for not keeping it. >> >> >> >> I'm pretty much open at this point for all alternatives. >> >> >> > >> > So what about getting rid of TCG_AREG0 for GCC generated code only, at >> > least as a first step? >> > >> > So what about the following changes: >> > - Change TCG_AREG0 of all targets to a callee saved register (if >> > possible, e.g. sparc) >> > - Change the prologue of all TCG targets to take env as an argument, >> > and save it into TCG_AREG0. >> >> This can be the first step. >> >> > - Change all helpers to explicitly take an env pointer instead of using >> > the fixed register. Note that it also includes softmmu helpers, but >> > the TCG load/store instructions should be kept unchanged. >> >> I think this step will lose performance slightly if TCG is not changed. > > Agreed. What do you have in mind by "if TCG is not changed"? If the register is still fixed in generated code, it would be easily available for the helpers as well in for of global env. Not taking advantage of this should be a minor loss, which is offset with one register more for GCC at the next step. >> > - Remove HELPER_CFLAGS from makefiles when all helpers have been >> > changed. >> >> This should restore most of the performance loss from previous step, >> maybe even improve. >> >> > - TCG_AREG0 can then be changed to another register if needed. >> >> I'd combine this with callee saved register change. Anyway, at this >> point there should be a lot of flexibility with the register choice. >> >> > And later we can do more steps to get a complete removal of TCG_AREG0, >> > including in TCG code, though I still think it is a really bad idea. >> >> Maybe. At this point most of the hard work has been done, so it's >> possible to make experiments. > > That's exactly my point. We more or less agree on previous step, not on > this one, so we will need to experiment for that. OK. Also, most of the cleanup benefits will have been achieved by this point. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 7:15 ` Blue Swirl 2011-05-15 9:24 ` Aurelien Jarno @ 2011-05-15 9:27 ` Laurent Desnogues 2011-05-15 9:49 ` Aurelien Jarno 2011-05-15 11:03 ` Blue Swirl 1 sibling, 2 replies; 31+ messages in thread From: Laurent Desnogues @ 2011-05-15 9:27 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel, Aurelien Jarno On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: > On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: [...] >>> > The env register is used very often (basically for every load/store, but >>> > also a lot of helpers), so it makes sense to reserve a register for it. >>> > >>> > For what I understand from your patch series, you prefer to pass this >>> > register explicitly to TCG functions. This basically means this TCG >>> > global will be loaded to host register as soon as it is used, but also >>> > regularly, as globals are saved back to their canonical location before >>> > an helper or a load/store. >>> > >>> > So it seems that this patch series will just allowing the "env register" >>> > to change over time, though it will not spare one more register for the >>> > TCG code, and it will emit longer TCG code to regularly reload the env >>> > global into a host register. >>> >>> But there will be one more register available in some cases. In other >> >> Inside the TCG code, it will basically happens very rarely, given >> load/store are really the most used instructions, and they need to load >> the env register. > > Not exactly, from a sample run with -d op_opt: > $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' > /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn > 1673966 movi_i32 > 653931 ld_i32 > 607432 mov_i32 > 428684 st_i32 > 326878 movi_i64 > 308626 add_i32 > 283186 call > 256817 exit_tb > 207232 nopn > 189388 goto_tb > 122398 and_i32 > 117997 shr_i32 > 89107 qemu_ld32 > 82926 set_label > 82713 brcond_i32 > 67169 qemu_st32 > 55109 or_i32 > 46536 ext32u_i64 > 44288 xor_i32 > 38103 sub_i32 > 26361 shl_i32 > 23218 shl_i64 > 23218 qemu_st64 > 23218 or_i64 > 20474 shr_i64 > 20445 qemu_ld64 > 11161 qemu_ld8u > 10409 qemu_st8 > 5013 qemu_ld16u > 3795 qemu_st16 > 2776 qemu_ld8s > 1915 sar_i32 > 1414 qemu_ld16s > 839 not_i32 > 579 setcond_i32 > 213 br > 42 ext32s_i64 > 30 mul_i64 Unless I missed something, this doesn't show the usage of ld/st per TB, which is what Aurélien was looking for if I understood correctly. All I can say is that you had at most 256817 TB's and 234507 qemu_ld/st, so about one per TB. Anyway I must be thick, because I fail to see how generated code could access guest CPU registers without a pointer to the CPU env :-) IIUC the SPARC translator uses ld_i32/st_i32 mainly for accessing the guest CPU registers, which due to register windows is held in a dedicated global temp. Is that correct? If so this is kind of hiding accesses to the CPU env; all other targets read/write registers by using CPU env (through the use global temps in most cases). So I think most (if not almost all) TB will need a pointer to CPU env, which is why I think Aurélien's proposal to keep a dedicated register that'd be loaded in the prologue is the only way to not degrade performance of the generated code (I'd add that this dedicated register should be the one defined by the ABI as holding the first parameter value, if that's possible; I'm afraid this is not necessarily a good idea). Laurent ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 9:27 ` Laurent Desnogues @ 2011-05-15 9:49 ` Aurelien Jarno 2011-05-15 11:03 ` Blue Swirl 1 sibling, 0 replies; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 9:49 UTC (permalink / raw) To: Laurent Desnogues; +Cc: Blue Swirl, qemu-devel On Sun, May 15, 2011 at 11:27:17AM +0200, Laurent Desnogues wrote: > On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: > > On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: > >>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > [...] > >>> > The env register is used very often (basically for every load/store, but > >>> > also a lot of helpers), so it makes sense to reserve a register for it. > >>> > > >>> > For what I understand from your patch series, you prefer to pass this > >>> > register explicitly to TCG functions. This basically means this TCG > >>> > global will be loaded to host register as soon as it is used, but also > >>> > regularly, as globals are saved back to their canonical location before > >>> > an helper or a load/store. > >>> > > >>> > So it seems that this patch series will just allowing the "env register" > >>> > to change over time, though it will not spare one more register for the > >>> > TCG code, and it will emit longer TCG code to regularly reload the env > >>> > global into a host register. > >>> > >>> But there will be one more register available in some cases. In other > >> > >> Inside the TCG code, it will basically happens very rarely, given > >> load/store are really the most used instructions, and they need to load > >> the env register. > > > > Not exactly, from a sample run with -d op_opt: > > $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' > > /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn > > 1673966 movi_i32 > > 653931 ld_i32 > > 607432 mov_i32 > > 428684 st_i32 > > 326878 movi_i64 > > 308626 add_i32 > > 283186 call > > 256817 exit_tb > > 207232 nopn > > 189388 goto_tb > > 122398 and_i32 > > 117997 shr_i32 > > 89107 qemu_ld32 > > 82926 set_label > > 82713 brcond_i32 > > 67169 qemu_st32 > > 55109 or_i32 > > 46536 ext32u_i64 > > 44288 xor_i32 > > 38103 sub_i32 > > 26361 shl_i32 > > 23218 shl_i64 > > 23218 qemu_st64 > > 23218 or_i64 > > 20474 shr_i64 > > 20445 qemu_ld64 > > 11161 qemu_ld8u > > 10409 qemu_st8 > > 5013 qemu_ld16u > > 3795 qemu_st16 > > 2776 qemu_ld8s > > 1915 sar_i32 > > 1414 qemu_ld16s > > 839 not_i32 > > 579 setcond_i32 > > 213 br > > 42 ext32s_i64 > > 30 mul_i64 > > Unless I missed something, this doesn't show the usage of > ld/st per TB, which is what Aurélien was looking for if I > understood correctly. All I can say is that you had at > most 256817 TB's and 234507 qemu_ld/st, so about one per > TB. > > Anyway I must be thick, because I fail to see how > generated code could access guest CPU registers without a > pointer to the CPU env :-) > > IIUC the SPARC translator uses ld_i32/st_i32 mainly for > accessing the guest CPU registers, which due to register > windows is held in a dedicated global temp. Is that > correct? If so this is kind of hiding accesses to the > CPU env; all other targets read/write registers by using > CPU env (through the use global temps in most cases). > > So I think most (if not almost all) TB will need a pointer > to CPU env, which is why I think Aurélien's proposal to > keep a dedicated register that'd be loaded in the prologue > is the only way to not degrade performance of the > generated code (I'd add that this dedicated register > should be the one defined by the ABI as holding the first > parameter value, if that's possible; I'm afraid this is > not necessarily a good idea). > I also thought about making it the one defined by the ABI as holding the first parameter value, to avoid some register moves, but I am also afraid this is not necessarily a good idea. It is usually a caller saved register (because the caller is anyway overriding its value), which means that it has to be saved before calling every helper, and it probably means more register move in that case. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 9:27 ` Laurent Desnogues 2011-05-15 9:49 ` Aurelien Jarno @ 2011-05-15 11:03 ` Blue Swirl 2011-05-15 11:14 ` Aurelien Jarno 1 sibling, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 11:03 UTC (permalink / raw) To: Laurent Desnogues; +Cc: qemu-devel, Aurelien Jarno On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues <laurent.desnogues@gmail.com> wrote: > On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >>> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >>>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >>>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > [...] >>>> > The env register is used very often (basically for every load/store, but >>>> > also a lot of helpers), so it makes sense to reserve a register for it. >>>> > >>>> > For what I understand from your patch series, you prefer to pass this >>>> > register explicitly to TCG functions. This basically means this TCG >>>> > global will be loaded to host register as soon as it is used, but also >>>> > regularly, as globals are saved back to their canonical location before >>>> > an helper or a load/store. >>>> > >>>> > So it seems that this patch series will just allowing the "env register" >>>> > to change over time, though it will not spare one more register for the >>>> > TCG code, and it will emit longer TCG code to regularly reload the env >>>> > global into a host register. >>>> >>>> But there will be one more register available in some cases. In other >>> >>> Inside the TCG code, it will basically happens very rarely, given >>> load/store are really the most used instructions, and they need to load >>> the env register. >> >> Not exactly, from a sample run with -d op_opt: >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn >> 1673966 movi_i32 >> 653931 ld_i32 >> 607432 mov_i32 >> 428684 st_i32 >> 326878 movi_i64 >> 308626 add_i32 >> 283186 call >> 256817 exit_tb >> 207232 nopn >> 189388 goto_tb >> 122398 and_i32 >> 117997 shr_i32 >> 89107 qemu_ld32 >> 82926 set_label >> 82713 brcond_i32 >> 67169 qemu_st32 >> 55109 or_i32 >> 46536 ext32u_i64 >> 44288 xor_i32 >> 38103 sub_i32 >> 26361 shl_i32 >> 23218 shl_i64 >> 23218 qemu_st64 >> 23218 or_i64 >> 20474 shr_i64 >> 20445 qemu_ld64 >> 11161 qemu_ld8u >> 10409 qemu_st8 >> 5013 qemu_ld16u >> 3795 qemu_st16 >> 2776 qemu_ld8s >> 1915 sar_i32 >> 1414 qemu_ld16s >> 839 not_i32 >> 579 setcond_i32 >> 213 br >> 42 ext32s_i64 >> 30 mul_i64 > > Unless I missed something, this doesn't show the usage of > ld/st per TB, which is what Aurélien was looking for if I > understood correctly. All I can say is that you had at > most 256817 TB's and 234507 qemu_ld/st, so about one per > TB. The question was ratio of loads/stores to other instructions. The statistics are not per TB. There were about 174880 TBs. > Anyway I must be thick, because I fail to see how > generated code could access guest CPU registers without a > pointer to the CPU env :-) > > IIUC the SPARC translator uses ld_i32/st_i32 mainly for > accessing the guest CPU registers, which due to register > windows is held in a dedicated global temp. Is that > correct? If so this is kind of hiding accesses to the > CPU env; all other targets read/write registers by using > CPU env (through the use global temps in most cases). > > So I think most (if not almost all) TB will need a pointer > to CPU env, which is why I think Aurélien's proposal to > keep a dedicated register that'd be loaded in the prologue > is the only way to not degrade performance of the > generated code (I'd add that this dedicated register > should be the one defined by the ABI as holding the first > parameter value, if that's possible; I'm afraid this is > not necessarily a good idea). CPU env will be used, but the register could be made available for other uses too. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:03 ` Blue Swirl @ 2011-05-15 11:14 ` Aurelien Jarno 2011-05-15 11:33 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 11:14 UTC (permalink / raw) To: Blue Swirl; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues > <laurent.desnogues@gmail.com> wrote: > > On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: > >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >>> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: > >>>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >>>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > > [...] > >>>> > The env register is used very often (basically for every load/store, but > >>>> > also a lot of helpers), so it makes sense to reserve a register for it. > >>>> > > >>>> > For what I understand from your patch series, you prefer to pass this > >>>> > register explicitly to TCG functions. This basically means this TCG > >>>> > global will be loaded to host register as soon as it is used, but also > >>>> > regularly, as globals are saved back to their canonical location before > >>>> > an helper or a load/store. > >>>> > > >>>> > So it seems that this patch series will just allowing the "env register" > >>>> > to change over time, though it will not spare one more register for the > >>>> > TCG code, and it will emit longer TCG code to regularly reload the env > >>>> > global into a host register. > >>>> > >>>> But there will be one more register available in some cases. In other > >>> > >>> Inside the TCG code, it will basically happens very rarely, given > >>> load/store are really the most used instructions, and they need to load > >>> the env register. > >> > >> Not exactly, from a sample run with -d op_opt: > >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' > >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn > >> 1673966 movi_i32 > >> 653931 ld_i32 > >> 607432 mov_i32 > >> 428684 st_i32 > >> 326878 movi_i64 > >> 308626 add_i32 > >> 283186 call > >> 256817 exit_tb > >> 207232 nopn > >> 189388 goto_tb > >> 122398 and_i32 > >> 117997 shr_i32 > >> 89107 qemu_ld32 > >> 82926 set_label > >> 82713 brcond_i32 > >> 67169 qemu_st32 > >> 55109 or_i32 > >> 46536 ext32u_i64 > >> 44288 xor_i32 > >> 38103 sub_i32 > >> 26361 shl_i32 > >> 23218 shl_i64 > >> 23218 qemu_st64 > >> 23218 or_i64 > >> 20474 shr_i64 > >> 20445 qemu_ld64 > >> 11161 qemu_ld8u > >> 10409 qemu_st8 > >> 5013 qemu_ld16u > >> 3795 qemu_st16 > >> 2776 qemu_ld8s > >> 1915 sar_i32 > >> 1414 qemu_ld16s > >> 839 not_i32 > >> 579 setcond_i32 > >> 213 br > >> 42 ext32s_i64 > >> 30 mul_i64 > > > > Unless I missed something, this doesn't show the usage of > > ld/st per TB, which is what Aurélien was looking for if I > > understood correctly. All I can say is that you had at > > most 256817 TB's and 234507 qemu_ld/st, so about one per > > TB. > > The question was ratio of loads/stores to other instructions. The > statistics are not per TB. There were about 174880 TBs. > > > Anyway I must be thick, because I fail to see how > > generated code could access guest CPU registers without a > > pointer to the CPU env :-) > > > > IIUC the SPARC translator uses ld_i32/st_i32 mainly for > > accessing the guest CPU registers, which due to register > > windows is held in a dedicated global temp. Is that > > correct? If so this is kind of hiding accesses to the > > CPU env; all other targets read/write registers by using > > CPU env (through the use global temps in most cases). > > > > So I think most (if not almost all) TB will need a pointer > > to CPU env, which is why I think Aurélien's proposal to > > keep a dedicated register that'd be loaded in the prologue > > is the only way to not degrade performance of the > > generated code (I'd add that this dedicated register > > should be the one defined by the ABI as holding the first > > parameter value, if that's possible; I'm afraid this is > > not necessarily a good idea). > > CPU env will be used, but the register could be made available for > other uses too. > What other uses? It is needed for almost every TB, so there is not other possible use. Let's take an example, a simple TB from mips on x86_64: | IN: start_kernel | 0x80510958: jal 0x8051d50c | 0x8051095c: nop | | OP after liveness analysis: | movi_i32 ra,$0x80510960 | movi_i32 PC,$0x8051d50c | exit_tb $0x0 | end | OUT: [size=28] | 0x40f22960: mov $0x80510960,%ebp | 0x40f22965: mov %ebp,0x7c(%r14) | 0x40f22969: mov $0x8051d50c,%ebp | 0x40f2296e: mov %ebp,0x80(%r14) | 0x40f22975: xor %eax,%eax | 0x40f22977: jmpq 0x115042e x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite simple (only 2 movi_i32), the resulting code makes 2 access to env to save the two registers. Having to reload the env pointer each time to a register would clearly increase the size of this TB. Another question, imagine env is not in a register anymore. What kind of code do you plan to use to load the env pointer into a register? This is something we need to know to evaluate the cost of not having a fixed register for env withing the TCG code. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:14 ` Aurelien Jarno @ 2011-05-15 11:33 ` Blue Swirl 2011-05-15 11:37 ` Aurelien Jarno 2011-05-15 12:14 ` Laurent Desnogues 0 siblings, 2 replies; 31+ messages in thread From: Blue Swirl @ 2011-05-15 11:33 UTC (permalink / raw) To: Aurelien Jarno; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues >> <laurent.desnogues@gmail.com> wrote: >> > On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: >> >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >>> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >> >>>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >>>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> > [...] >> >>>> > The env register is used very often (basically for every load/store, but >> >>>> > also a lot of helpers), so it makes sense to reserve a register for it. >> >>>> > >> >>>> > For what I understand from your patch series, you prefer to pass this >> >>>> > register explicitly to TCG functions. This basically means this TCG >> >>>> > global will be loaded to host register as soon as it is used, but also >> >>>> > regularly, as globals are saved back to their canonical location before >> >>>> > an helper or a load/store. >> >>>> > >> >>>> > So it seems that this patch series will just allowing the "env register" >> >>>> > to change over time, though it will not spare one more register for the >> >>>> > TCG code, and it will emit longer TCG code to regularly reload the env >> >>>> > global into a host register. >> >>>> >> >>>> But there will be one more register available in some cases. In other >> >>> >> >>> Inside the TCG code, it will basically happens very rarely, given >> >>> load/store are really the most used instructions, and they need to load >> >>> the env register. >> >> >> >> Not exactly, from a sample run with -d op_opt: >> >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' >> >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn >> >> 1673966 movi_i32 >> >> 653931 ld_i32 >> >> 607432 mov_i32 >> >> 428684 st_i32 >> >> 326878 movi_i64 >> >> 308626 add_i32 >> >> 283186 call >> >> 256817 exit_tb >> >> 207232 nopn >> >> 189388 goto_tb >> >> 122398 and_i32 >> >> 117997 shr_i32 >> >> 89107 qemu_ld32 >> >> 82926 set_label >> >> 82713 brcond_i32 >> >> 67169 qemu_st32 >> >> 55109 or_i32 >> >> 46536 ext32u_i64 >> >> 44288 xor_i32 >> >> 38103 sub_i32 >> >> 26361 shl_i32 >> >> 23218 shl_i64 >> >> 23218 qemu_st64 >> >> 23218 or_i64 >> >> 20474 shr_i64 >> >> 20445 qemu_ld64 >> >> 11161 qemu_ld8u >> >> 10409 qemu_st8 >> >> 5013 qemu_ld16u >> >> 3795 qemu_st16 >> >> 2776 qemu_ld8s >> >> 1915 sar_i32 >> >> 1414 qemu_ld16s >> >> 839 not_i32 >> >> 579 setcond_i32 >> >> 213 br >> >> 42 ext32s_i64 >> >> 30 mul_i64 >> > >> > Unless I missed something, this doesn't show the usage of >> > ld/st per TB, which is what Aurélien was looking for if I >> > understood correctly. All I can say is that you had at >> > most 256817 TB's and 234507 qemu_ld/st, so about one per >> > TB. >> >> The question was ratio of loads/stores to other instructions. The >> statistics are not per TB. There were about 174880 TBs. >> >> > Anyway I must be thick, because I fail to see how >> > generated code could access guest CPU registers without a >> > pointer to the CPU env :-) >> > >> > IIUC the SPARC translator uses ld_i32/st_i32 mainly for >> > accessing the guest CPU registers, which due to register >> > windows is held in a dedicated global temp. Is that >> > correct? If so this is kind of hiding accesses to the >> > CPU env; all other targets read/write registers by using >> > CPU env (through the use global temps in most cases). >> > >> > So I think most (if not almost all) TB will need a pointer >> > to CPU env, which is why I think Aurélien's proposal to >> > keep a dedicated register that'd be loaded in the prologue >> > is the only way to not degrade performance of the >> > generated code (I'd add that this dedicated register >> > should be the one defined by the ABI as holding the first >> > parameter value, if that's possible; I'm afraid this is >> > not necessarily a good idea). >> >> CPU env will be used, but the register could be made available for >> other uses too. >> > > What other uses? It is needed for almost every TB, so there is not other > possible use. Let's take an example, a simple TB from mips on x86_64: > > | IN: start_kernel > | 0x80510958: jal 0x8051d50c > | 0x8051095c: nop > | > | OP after liveness analysis: > | movi_i32 ra,$0x80510960 > | movi_i32 PC,$0x8051d50c > | exit_tb $0x0 > | end > > | OUT: [size=28] > | 0x40f22960: mov $0x80510960,%ebp > | 0x40f22965: mov %ebp,0x7c(%r14) > | 0x40f22969: mov $0x8051d50c,%ebp > | 0x40f2296e: mov %ebp,0x80(%r14) > | 0x40f22975: xor %eax,%eax > | 0x40f22977: jmpq 0x115042e > > x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite > simple (only 2 movi_i32), the resulting code makes 2 access to env to > save the two registers. Having to reload the env pointer each time to a > register would clearly increase the size of this TB. I don't think TCG would be that simple, instead the pointer would be loaded only once in this case. > Another question, imagine env is not in a register anymore. What kind of > code do you plan to use to load the env pointer into a register? This is > something we need to know to evaluate the cost of not having a fixed > register for env withing the TCG code. The prologue could save it to the first temp dedicated for this. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:33 ` Blue Swirl @ 2011-05-15 11:37 ` Aurelien Jarno 2011-05-15 12:33 ` Blue Swirl 2011-05-15 12:14 ` Laurent Desnogues 1 sibling, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 11:37 UTC (permalink / raw) To: Blue Swirl; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 02:33:55PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues > >> <laurent.desnogues@gmail.com> wrote: > >> > On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: > >> >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> >>> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: > >> >>>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> >>>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > >> > [...] > >> >>>> > The env register is used very often (basically for every load/store, but > >> >>>> > also a lot of helpers), so it makes sense to reserve a register for it. > >> >>>> > > >> >>>> > For what I understand from your patch series, you prefer to pass this > >> >>>> > register explicitly to TCG functions. This basically means this TCG > >> >>>> > global will be loaded to host register as soon as it is used, but also > >> >>>> > regularly, as globals are saved back to their canonical location before > >> >>>> > an helper or a load/store. > >> >>>> > > >> >>>> > So it seems that this patch series will just allowing the "env register" > >> >>>> > to change over time, though it will not spare one more register for the > >> >>>> > TCG code, and it will emit longer TCG code to regularly reload the env > >> >>>> > global into a host register. > >> >>>> > >> >>>> But there will be one more register available in some cases. In other > >> >>> > >> >>> Inside the TCG code, it will basically happens very rarely, given > >> >>> load/store are really the most used instructions, and they need to load > >> >>> the env register. > >> >> > >> >> Not exactly, from a sample run with -d op_opt: > >> >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' > >> >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn > >> >> 1673966 movi_i32 > >> >> 653931 ld_i32 > >> >> 607432 mov_i32 > >> >> 428684 st_i32 > >> >> 326878 movi_i64 > >> >> 308626 add_i32 > >> >> 283186 call > >> >> 256817 exit_tb > >> >> 207232 nopn > >> >> 189388 goto_tb > >> >> 122398 and_i32 > >> >> 117997 shr_i32 > >> >> 89107 qemu_ld32 > >> >> 82926 set_label > >> >> 82713 brcond_i32 > >> >> 67169 qemu_st32 > >> >> 55109 or_i32 > >> >> 46536 ext32u_i64 > >> >> 44288 xor_i32 > >> >> 38103 sub_i32 > >> >> 26361 shl_i32 > >> >> 23218 shl_i64 > >> >> 23218 qemu_st64 > >> >> 23218 or_i64 > >> >> 20474 shr_i64 > >> >> 20445 qemu_ld64 > >> >> 11161 qemu_ld8u > >> >> 10409 qemu_st8 > >> >> 5013 qemu_ld16u > >> >> 3795 qemu_st16 > >> >> 2776 qemu_ld8s > >> >> 1915 sar_i32 > >> >> 1414 qemu_ld16s > >> >> 839 not_i32 > >> >> 579 setcond_i32 > >> >> 213 br > >> >> 42 ext32s_i64 > >> >> 30 mul_i64 > >> > > >> > Unless I missed something, this doesn't show the usage of > >> > ld/st per TB, which is what Aurélien was looking for if I > >> > understood correctly. All I can say is that you had at > >> > most 256817 TB's and 234507 qemu_ld/st, so about one per > >> > TB. > >> > >> The question was ratio of loads/stores to other instructions. The > >> statistics are not per TB. There were about 174880 TBs. > >> > >> > Anyway I must be thick, because I fail to see how > >> > generated code could access guest CPU registers without a > >> > pointer to the CPU env :-) > >> > > >> > IIUC the SPARC translator uses ld_i32/st_i32 mainly for > >> > accessing the guest CPU registers, which due to register > >> > windows is held in a dedicated global temp. Is that > >> > correct? If so this is kind of hiding accesses to the > >> > CPU env; all other targets read/write registers by using > >> > CPU env (through the use global temps in most cases). > >> > > >> > So I think most (if not almost all) TB will need a pointer > >> > to CPU env, which is why I think Aurélien's proposal to > >> > keep a dedicated register that'd be loaded in the prologue > >> > is the only way to not degrade performance of the > >> > generated code (I'd add that this dedicated register > >> > should be the one defined by the ABI as holding the first > >> > parameter value, if that's possible; I'm afraid this is > >> > not necessarily a good idea). > >> > >> CPU env will be used, but the register could be made available for > >> other uses too. > >> > > > > What other uses? It is needed for almost every TB, so there is not other > > possible use. Let's take an example, a simple TB from mips on x86_64: > > > > | IN: start_kernel > > | 0x80510958: jal 0x8051d50c > > | 0x8051095c: nop > > | > > | OP after liveness analysis: > > | movi_i32 ra,$0x80510960 > > | movi_i32 PC,$0x8051d50c > > | exit_tb $0x0 > > | end > > > > | OUT: [size=28] > > | 0x40f22960: mov $0x80510960,%ebp > > | 0x40f22965: mov %ebp,0x7c(%r14) > > | 0x40f22969: mov $0x8051d50c,%ebp > > | 0x40f2296e: mov %ebp,0x80(%r14) > > | 0x40f22975: xor %eax,%eax > > | 0x40f22977: jmpq 0x115042e > > > > x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite > > simple (only 2 movi_i32), the resulting code makes 2 access to env to > > save the two registers. Having to reload the env pointer each time to a > > register would clearly increase the size of this TB. > > I don't think TCG would be that simple, instead the pointer would be > loaded only once in this case. > > > Another question, imagine env is not in a register anymore. What kind of > > code do you plan to use to load the env pointer into a register? This is > > something we need to know to evaluate the cost of not having a fixed > > register for env withing the TCG code. > > The prologue could save it to the first temp dedicated for this. > So basically you mean you want to keep a dedicated temp for env in the TCG generated code??? It seems to go in the opposite of your previous mails stating a full elimination of TCG_AREG0. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:37 ` Aurelien Jarno @ 2011-05-15 12:33 ` Blue Swirl 0 siblings, 0 replies; 31+ messages in thread From: Blue Swirl @ 2011-05-15 12:33 UTC (permalink / raw) To: Aurelien Jarno; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 2:37 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 02:33:55PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 2:14 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 02:03:40PM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 12:27 PM, Laurent Desnogues >> >> <laurent.desnogues@gmail.com> wrote: >> >> > On Sun, May 15, 2011 at 9:15 AM, Blue Swirl <blauwirbel@gmail.com> wrote: >> >> >> On Sun, May 15, 2011 at 1:04 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> >>> On Sun, May 15, 2011 at 12:52:35AM +0300, Blue Swirl wrote: >> >> >>>> On Sun, May 15, 2011 at 12:16 AM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> >>>> > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: >> >> > [...] >> >> >>>> > The env register is used very often (basically for every load/store, but >> >> >>>> > also a lot of helpers), so it makes sense to reserve a register for it. >> >> >>>> > >> >> >>>> > For what I understand from your patch series, you prefer to pass this >> >> >>>> > register explicitly to TCG functions. This basically means this TCG >> >> >>>> > global will be loaded to host register as soon as it is used, but also >> >> >>>> > regularly, as globals are saved back to their canonical location before >> >> >>>> > an helper or a load/store. >> >> >>>> > >> >> >>>> > So it seems that this patch series will just allowing the "env register" >> >> >>>> > to change over time, though it will not spare one more register for the >> >> >>>> > TCG code, and it will emit longer TCG code to regularly reload the env >> >> >>>> > global into a host register. >> >> >>>> >> >> >>>> But there will be one more register available in some cases. In other >> >> >>> >> >> >>> Inside the TCG code, it will basically happens very rarely, given >> >> >>> load/store are really the most used instructions, and they need to load >> >> >>> the env register. >> >> >> >> >> >> Not exactly, from a sample run with -d op_opt: >> >> >> $ egrep -v -e '^$' -v -e 'OP after' -v -e ' end' -v -e 'Search PC' >> >> >> /tmp/qemu.log | awk '{print $1}' | sort | uniq -c|sort -rn >> >> >> 1673966 movi_i32 >> >> >> 653931 ld_i32 >> >> >> 607432 mov_i32 >> >> >> 428684 st_i32 >> >> >> 326878 movi_i64 >> >> >> 308626 add_i32 >> >> >> 283186 call >> >> >> 256817 exit_tb >> >> >> 207232 nopn >> >> >> 189388 goto_tb >> >> >> 122398 and_i32 >> >> >> 117997 shr_i32 >> >> >> 89107 qemu_ld32 >> >> >> 82926 set_label >> >> >> 82713 brcond_i32 >> >> >> 67169 qemu_st32 >> >> >> 55109 or_i32 >> >> >> 46536 ext32u_i64 >> >> >> 44288 xor_i32 >> >> >> 38103 sub_i32 >> >> >> 26361 shl_i32 >> >> >> 23218 shl_i64 >> >> >> 23218 qemu_st64 >> >> >> 23218 or_i64 >> >> >> 20474 shr_i64 >> >> >> 20445 qemu_ld64 >> >> >> 11161 qemu_ld8u >> >> >> 10409 qemu_st8 >> >> >> 5013 qemu_ld16u >> >> >> 3795 qemu_st16 >> >> >> 2776 qemu_ld8s >> >> >> 1915 sar_i32 >> >> >> 1414 qemu_ld16s >> >> >> 839 not_i32 >> >> >> 579 setcond_i32 >> >> >> 213 br >> >> >> 42 ext32s_i64 >> >> >> 30 mul_i64 >> >> > >> >> > Unless I missed something, this doesn't show the usage of >> >> > ld/st per TB, which is what Aurélien was looking for if I >> >> > understood correctly. All I can say is that you had at >> >> > most 256817 TB's and 234507 qemu_ld/st, so about one per >> >> > TB. >> >> >> >> The question was ratio of loads/stores to other instructions. The >> >> statistics are not per TB. There were about 174880 TBs. >> >> >> >> > Anyway I must be thick, because I fail to see how >> >> > generated code could access guest CPU registers without a >> >> > pointer to the CPU env :-) >> >> > >> >> > IIUC the SPARC translator uses ld_i32/st_i32 mainly for >> >> > accessing the guest CPU registers, which due to register >> >> > windows is held in a dedicated global temp. Is that >> >> > correct? If so this is kind of hiding accesses to the >> >> > CPU env; all other targets read/write registers by using >> >> > CPU env (through the use global temps in most cases). >> >> > >> >> > So I think most (if not almost all) TB will need a pointer >> >> > to CPU env, which is why I think Aurélien's proposal to >> >> > keep a dedicated register that'd be loaded in the prologue >> >> > is the only way to not degrade performance of the >> >> > generated code (I'd add that this dedicated register >> >> > should be the one defined by the ABI as holding the first >> >> > parameter value, if that's possible; I'm afraid this is >> >> > not necessarily a good idea). >> >> >> >> CPU env will be used, but the register could be made available for >> >> other uses too. >> >> >> > >> > What other uses? It is needed for almost every TB, so there is not other >> > possible use. Let's take an example, a simple TB from mips on x86_64: >> > >> > | IN: start_kernel >> > | 0x80510958: jal 0x8051d50c >> > | 0x8051095c: nop >> > | >> > | OP after liveness analysis: >> > | movi_i32 ra,$0x80510960 >> > | movi_i32 PC,$0x8051d50c >> > | exit_tb $0x0 >> > | end >> > >> > | OUT: [size=28] >> > | 0x40f22960: mov $0x80510960,%ebp >> > | 0x40f22965: mov %ebp,0x7c(%r14) >> > | 0x40f22969: mov $0x8051d50c,%ebp >> > | 0x40f2296e: mov %ebp,0x80(%r14) >> > | 0x40f22975: xor %eax,%eax >> > | 0x40f22977: jmpq 0x115042e >> > >> > x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite >> > simple (only 2 movi_i32), the resulting code makes 2 access to env to >> > save the two registers. Having to reload the env pointer each time to a >> > register would clearly increase the size of this TB. >> >> I don't think TCG would be that simple, instead the pointer would be >> loaded only once in this case. >> >> > Another question, imagine env is not in a register anymore. What kind of >> > code do you plan to use to load the env pointer into a register? This is >> > something we need to know to evaluate the cost of not having a fixed >> > register for env withing the TCG code. >> >> The prologue could save it to the first temp dedicated for this. >> > > So basically you mean you want to keep a dedicated temp for env in the > TCG generated code??? It seems to go in the opposite of your previous > mails stating a full elimination of TCG_AREG0. I meant the first stack/temp_buf temp. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 11:33 ` Blue Swirl 2011-05-15 11:37 ` Aurelien Jarno @ 2011-05-15 12:14 ` Laurent Desnogues 2011-05-15 12:37 ` Blue Swirl 1 sibling, 1 reply; 31+ messages in thread From: Laurent Desnogues @ 2011-05-15 12:14 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel, Aurelien Jarno On Sun, May 15, 2011 at 1:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote: [...] >> x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite >> simple (only 2 movi_i32), the resulting code makes 2 access to env to >> save the two registers. Having to reload the env pointer each time to a >> register would clearly increase the size of this TB. > > I don't think TCG would be that simple, instead the pointer would be > loaded only once in this case. Assuming TCG was able to allocate a register for that, it would be live at most for one TB, so you'd have to load it at least once per TB, and with block chaining that wouldn't be efficient as you'd keep on reloading it. Laurent ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 12:14 ` Laurent Desnogues @ 2011-05-15 12:37 ` Blue Swirl 2011-05-15 13:02 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 12:37 UTC (permalink / raw) To: Laurent Desnogues; +Cc: qemu-devel, Aurelien Jarno On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues <laurent.desnogues@gmail.com> wrote: > On Sun, May 15, 2011 at 1:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote: > [...] >>> x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite >>> simple (only 2 movi_i32), the resulting code makes 2 access to env to >>> save the two registers. Having to reload the env pointer each time to a >>> register would clearly increase the size of this TB. >> >> I don't think TCG would be that simple, instead the pointer would be >> loaded only once in this case. > > Assuming TCG was able to allocate a register for that, > it would be live at most for one TB, so you'd have to > load it at least once per TB, and with block chaining > that wouldn't be efficient as you'd keep on reloading it. Yes, but if there are better uses, the register can be flushed. Now this is not possible since the register is always unavailable. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 12:37 ` Blue Swirl @ 2011-05-15 13:02 ` Aurelien Jarno 2011-05-15 13:42 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 13:02 UTC (permalink / raw) To: Blue Swirl; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues > <laurent.desnogues@gmail.com> wrote: > > On Sun, May 15, 2011 at 1:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote: > > [...] > >>> x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite > >>> simple (only 2 movi_i32), the resulting code makes 2 access to env to > >>> save the two registers. Having to reload the env pointer each time to a > >>> register would clearly increase the size of this TB. > >> > >> I don't think TCG would be that simple, instead the pointer would be > >> loaded only once in this case. > > > > Assuming TCG was able to allocate a register for that, > > it would be live at most for one TB, so you'd have to > > load it at least once per TB, and with block chaining > > that wouldn't be efficient as you'd keep on reloading it. > > Yes, but if there are better uses, the register can be flushed. Now > this is not possible since the register is always unavailable. > What are the better uses, that justify to flush a register that is going to be used three or four host asm later? In the current generated code, roughly one every four instruction reference TCG_AREG0, so this register is really needed very often. If you think TCG will be faster by having one more register in between I suggest you to first optimize tcg_reg_alloc(), which simply spill a random register, even if they are some allocated register that won't be used until the end of the TB. You should also should check how often TCG spills a register (in which case it would have benefit from one more register). It happens less than 2000 times when booting an emulated mips system on x86_64, while more than 160000 TB are generated. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 13:02 ` Aurelien Jarno @ 2011-05-15 13:42 ` Blue Swirl 2011-05-15 14:03 ` Aurelien Jarno 0 siblings, 1 reply; 31+ messages in thread From: Blue Swirl @ 2011-05-15 13:42 UTC (permalink / raw) To: Aurelien Jarno; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues >> <laurent.desnogues@gmail.com> wrote: >> > On Sun, May 15, 2011 at 1:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote: >> > [...] >> >>> x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite >> >>> simple (only 2 movi_i32), the resulting code makes 2 access to env to >> >>> save the two registers. Having to reload the env pointer each time to a >> >>> register would clearly increase the size of this TB. >> >> >> >> I don't think TCG would be that simple, instead the pointer would be >> >> loaded only once in this case. >> > >> > Assuming TCG was able to allocate a register for that, >> > it would be live at most for one TB, so you'd have to >> > load it at least once per TB, and with block chaining >> > that wouldn't be efficient as you'd keep on reloading it. >> >> Yes, but if there are better uses, the register can be flushed. Now >> this is not possible since the register is always unavailable. >> > > What are the better uses, that justify to flush a register that is going > to be used three or four host asm later? It would obviously replace something else determined by TCG. > In the current generated code, roughly one every four instruction > reference TCG_AREG0, so this register is really needed very often. > > If you think TCG will be faster by having one more register in between > I suggest you to first optimize tcg_reg_alloc(), which simply spill > a random register, even if they are some allocated register that won't > be used until the end of the TB. You should also should check how often > TCG spills a register (in which case it would have benefit from one more > register). It happens less than 2000 times when booting an emulated mips > system on x86_64, while more than 160000 TB are generated. Right, on a modern CPU with lots of registers, one additional register won't be helpful, but on i386 the situation should be very different, there are very few registers. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 13:42 ` Blue Swirl @ 2011-05-15 14:03 ` Aurelien Jarno 2011-05-15 14:06 ` Blue Swirl 0 siblings, 1 reply; 31+ messages in thread From: Aurelien Jarno @ 2011-05-15 14:03 UTC (permalink / raw) To: Blue Swirl; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 04:42:05PM +0300, Blue Swirl wrote: > On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: > >> On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues > >> <laurent.desnogues@gmail.com> wrote: > >> > On Sun, May 15, 2011 at 1:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote: > >> > [...] > >> >>> x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite > >> >>> simple (only 2 movi_i32), the resulting code makes 2 access to env to > >> >>> save the two registers. Having to reload the env pointer each time to a > >> >>> register would clearly increase the size of this TB. > >> >> > >> >> I don't think TCG would be that simple, instead the pointer would be > >> >> loaded only once in this case. > >> > > >> > Assuming TCG was able to allocate a register for that, > >> > it would be live at most for one TB, so you'd have to > >> > load it at least once per TB, and with block chaining > >> > that wouldn't be efficient as you'd keep on reloading it. > >> > >> Yes, but if there are better uses, the register can be flushed. Now > >> this is not possible since the register is always unavailable. > >> > > > > What are the better uses, that justify to flush a register that is going > > to be used three or four host asm later? > > It would obviously replace something else determined by TCG. The register will be free only for a few host instructions. Could you please give more concrete example about such a usage? > > In the current generated code, roughly one every four instruction > > reference TCG_AREG0, so this register is really needed very often. > > > > If you think TCG will be faster by having one more register in between > > I suggest you to first optimize tcg_reg_alloc(), which simply spill > > a random register, even if they are some allocated register that won't > > be used until the end of the TB. You should also should check how often > > TCG spills a register (in which case it would have benefit from one more > > register). It happens less than 2000 times when booting an emulated mips > > system on x86_64, while more than 160000 TB are generated. > > Right, on a modern CPU with lots of registers, one additional register > won't be helpful, but on i386 the situation should be very different, > there are very few registers. > On i386, I indeed get a lot more of spilled registers, that is 340000. Still that number is not that high, it's less than two times per TB. If we consider that these register spills are pure loss (which is not always the case, sometime the spilled register is actually never used later, so it's just an anticipated save), that's 4 load/store per TB. It means to compensate, the env register should not be loaded more than 4 times in a TB, which looks like quite difficult to achieve given how often this register is used. Please also note that spilling globals currently need access to the env pointer, which might not be loaded, so another register spill is need to load it. This will make the code a lot more complex than now to avoid a deadlock (probably by spilling local temps first). -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-15 14:03 ` Aurelien Jarno @ 2011-05-15 14:06 ` Blue Swirl 0 siblings, 0 replies; 31+ messages in thread From: Blue Swirl @ 2011-05-15 14:06 UTC (permalink / raw) To: Aurelien Jarno; +Cc: Laurent Desnogues, qemu-devel On Sun, May 15, 2011 at 5:03 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Sun, May 15, 2011 at 04:42:05PM +0300, Blue Swirl wrote: >> On Sun, May 15, 2011 at 4:02 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Sun, May 15, 2011 at 03:37:00PM +0300, Blue Swirl wrote: >> >> On Sun, May 15, 2011 at 3:14 PM, Laurent Desnogues >> >> <laurent.desnogues@gmail.com> wrote: >> >> > On Sun, May 15, 2011 at 1:33 PM, Blue Swirl <blauwirbel@gmail.com> wrote: >> >> > [...] >> >> >>> x86_64 uses r14 as TCG_AREG0. Despite the instructions being quite >> >> >>> simple (only 2 movi_i32), the resulting code makes 2 access to env to >> >> >>> save the two registers. Having to reload the env pointer each time to a >> >> >>> register would clearly increase the size of this TB. >> >> >> >> >> >> I don't think TCG would be that simple, instead the pointer would be >> >> >> loaded only once in this case. >> >> > >> >> > Assuming TCG was able to allocate a register for that, >> >> > it would be live at most for one TB, so you'd have to >> >> > load it at least once per TB, and with block chaining >> >> > that wouldn't be efficient as you'd keep on reloading it. >> >> >> >> Yes, but if there are better uses, the register can be flushed. Now >> >> this is not possible since the register is always unavailable. >> >> >> > >> > What are the better uses, that justify to flush a register that is going >> > to be used three or four host asm later? >> >> It would obviously replace something else determined by TCG. > > The register will be free only for a few host instructions. Could you > please give more concrete example about such a usage? > >> > In the current generated code, roughly one every four instruction >> > reference TCG_AREG0, so this register is really needed very often. >> > >> > If you think TCG will be faster by having one more register in between >> > I suggest you to first optimize tcg_reg_alloc(), which simply spill >> > a random register, even if they are some allocated register that won't >> > be used until the end of the TB. You should also should check how often >> > TCG spills a register (in which case it would have benefit from one more >> > register). It happens less than 2000 times when booting an emulated mips >> > system on x86_64, while more than 160000 TB are generated. >> >> Right, on a modern CPU with lots of registers, one additional register >> won't be helpful, but on i386 the situation should be very different, >> there are very few registers. >> > > On i386, I indeed get a lot more of spilled registers, that is 340000. Still > that number is not that high, it's less than two times per TB. If we > consider that these register spills are pure loss (which is not always > the case, sometime the spilled register is actually never used later, so > it's just an anticipated save), that's 4 load/store per TB. > > It means to compensate, the env register should not be loaded more than > 4 times in a TB, which looks like quite difficult to achieve given how > often this register is used. > > Please also note that spilling globals currently need access to the env > pointer, which might not be loaded, so another register spill is need to > load it. This will make the code a lot more complex than now to avoid a > deadlock (probably by spilling local temps first). OK, this doesn't look so attractive after all. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination 2011-05-14 21:16 ` Aurelien Jarno 2011-05-14 21:52 ` Blue Swirl @ 2011-05-14 23:31 ` Aurelien Jarno 1 sibling, 0 replies; 31+ messages in thread From: Aurelien Jarno @ 2011-05-14 23:31 UTC (permalink / raw) To: Blue Swirl; +Cc: qemu-devel On Sat, May 14, 2011 at 11:16:16PM +0200, Aurelien Jarno wrote: > On Sat, May 14, 2011 at 10:35:20PM +0300, Blue Swirl wrote: > > Here's a RFC series for eliminating AREG0. > > > > Blue Swirl (11): > > Move user emulator stuff from cpu-exec.c to user-exec.c > > Delete unused tb_invalidate_page_range > > > > The above should be OK to commit. > > > > cpu_loop_exit: avoid using AREG0 > > Delegate setup of TCG temporaries to targets > > > > These two are not, unless the overall plan is OK. > > > > TCG: fix negative frame offset calculations > > TCG/x86: use stack for TCG temps > > TCG/Sparc64: use stack for TCG temps > > > > But these three should be OK. I've tested lightly x86_64 and Sparc64 hosts. > > > > Add CONFIG_TARGET_NEEDS_AREG0 > > Don't compile legacy qemu_ld/st functions if target doesn't need them > > > > Should be OK, though the latter patch only touches x86. > > > > Add new qemu_ld and qemu_st functions > > sparc: use new qemu_ld and qemu_st functions > > > > The last two compile but QEMU segfaults. I just made a naive > > conversion for getting comments. > > > > What is the goal behing removing TCG_AREG0? If it is speed improvement, > can you please provide some benchmarks? > > The env register is used very often (basically for every load/store, but > also a lot of helpers), so it makes sense to reserve a register for it. > > For what I understand from your patch series, you prefer to pass this > register explicitly to TCG functions. This basically means this TCG > global will be loaded to host register as soon as it is used, but also > regularly, as globals are saved back to their canonical location before > an helper or a load/store. > > So it seems that this patch series will just allowing the "env register" > to change over time, though it will not spare one more register for the > TCG code, and it will emit longer TCG code to regularly reload the env > global into a host register. > One way to solve that would be to use a env register only at the TCG level, but not at the GCC level. That means loading the value of env into TCG_AREG0 in the prologue. That way, all the TCG code will have direct access to env, and the GCC generated code (which includes the helper) don't need to have this register reserved. I doubt the latter will increase the performance, but I understand the cleanliness argument. This also means minimal code changes, especially as most (if not all) TCG targets seems to have TCG_AREG0 as a callee saved arguments. Basically the path to do the changes could be: - Make sure all targets have TCG_AREG0 as a callee saved register - Load env into TCG_AREG0 (and rename it as a TCG_ENV?) in the prologue (could be by passing it as argument to the prologue code) - Change all helpers to not use env directly - Change softmmu code to not use env directly - Remove GCC hack to reserve a register for env. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurelien@aurel32.net http://www.aurel32.net ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2011-05-15 14:11 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-05-14 19:35 [Qemu-devel] [PATCH RFC 00/11] AREG0 elimination Blue Swirl 2011-05-14 21:16 ` Aurelien Jarno 2011-05-14 21:52 ` Blue Swirl 2011-05-14 22:04 ` Aurelien Jarno 2011-05-15 7:15 ` Blue Swirl 2011-05-15 9:24 ` Aurelien Jarno 2011-05-15 9:58 ` Blue Swirl 2011-05-15 10:19 ` Aurelien Jarno 2011-05-15 11:12 ` Blue Swirl 2011-05-15 11:36 ` Aurelien Jarno 2011-05-15 12:30 ` Blue Swirl 2011-05-15 12:49 ` Aurelien Jarno 2011-05-15 13:16 ` Blue Swirl 2011-05-15 13:43 ` Aurelien Jarno 2011-05-15 14:02 ` Blue Swirl 2011-05-15 14:06 ` Aurelien Jarno 2011-05-15 14:10 ` Blue Swirl 2011-05-15 9:27 ` Laurent Desnogues 2011-05-15 9:49 ` Aurelien Jarno 2011-05-15 11:03 ` Blue Swirl 2011-05-15 11:14 ` Aurelien Jarno 2011-05-15 11:33 ` Blue Swirl 2011-05-15 11:37 ` Aurelien Jarno 2011-05-15 12:33 ` Blue Swirl 2011-05-15 12:14 ` Laurent Desnogues 2011-05-15 12:37 ` Blue Swirl 2011-05-15 13:02 ` Aurelien Jarno 2011-05-15 13:42 ` Blue Swirl 2011-05-15 14:03 ` Aurelien Jarno 2011-05-15 14:06 ` Blue Swirl 2011-05-14 23:31 ` Aurelien Jarno
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).