* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
@ 2015-03-29 21:52 ` Richard Henderson
0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2015-03-29 21:52 UTC (permalink / raw)
To: Emilio G. Cota; +Cc: qemu-trivial, Stefan Weil, Alex Bennée, qemu-devel
On Mar 27, 2015 14:09, "Emilio G. Cota" <cota@braap.org> wrote:
>
> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
> > Have you been able to measure any performance improvement with these new
> > structures? In theory, if aligned with cache lines, performance should
> > improve but real numbers would be nice.
>
> I haven't benchmarked anything, which makes me very uneasy. All
> I've checked is that the system boots, and FWIW I appreciate no
> difference in boot time.
No decrease in boot time is good. We /know/ we're saving memory, after all.
>
> Is there a benchmark suite to test TCG changes?
No, sorry.
r~
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
@ 2015-03-29 21:52 ` Richard Henderson
0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2015-03-29 21:52 UTC (permalink / raw)
To: Emilio G. Cota; +Cc: qemu-trivial, Stefan Weil, Alex Bennée, qemu-devel
On Mar 27, 2015 14:09, "Emilio G. Cota" <cota@braap.org> wrote:
>
> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
> > Have you been able to measure any performance improvement with these new
> > structures? In theory, if aligned with cache lines, performance should
> > improve but real numbers would be nice.
>
> I haven't benchmarked anything, which makes me very uneasy. All
> I've checked is that the system boots, and FWIW I appreciate no
> difference in boot time.
No decrease in boot time is good. We /know/ we're saving memory, after all.
>
> Is there a benchmark suite to test TCG changes?
No, sorry.
r~
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-29 21:52 ` Richard Henderson
@ 2015-03-30 5:33 ` Stefan Weil
-1 siblings, 0 replies; 17+ messages in thread
From: Stefan Weil @ 2015-03-30 5:33 UTC (permalink / raw)
To: Richard Henderson, Emilio G. Cota
Cc: qemu-trivial, Alex Bennée, qemu-devel
Am 29.03.2015 um 23:52 schrieb Richard Henderson:
> On Mar 27, 2015 14:09, "Emilio G. Cota" <cota@braap.org> wrote:
>> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
>>> Have you been able to measure any performance improvement with these new
>>> structures? In theory, if aligned with cache lines, performance should
>>> improve but real numbers would be nice.
>> I haven't benchmarked anything, which makes me very uneasy. All
>> I've checked is that the system boots, and FWIW I appreciate no
>> difference in boot time.
> No decrease in boot time is good. We /know/ we're saving memory, after all.
>
>> Is there a benchmark suite to test TCG changes?
> No, sorry.
>
>
> r~
Benchmarking TCG with QEMU's system emulation is nearly impossible
because operating systems usually contain lots of timer based operations.
The TCG interpreter for example is really slow, but a BIOS will boot
faster than expected with it.
The user mode emulation is much better for benchmarks.
Run some command line Linux application which mainly does
computations (not file i/o) using user mode emulation on Linux.
The OpenSSL package contains bntest which can be used
as a benchmark for TCG. Redirect all output to /dev/null when
you run it.
Binaries for i386 and x86_64 are available from
http://qemu.weilnetz.de/test/user/.
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
@ 2015-03-30 5:33 ` Stefan Weil
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Weil @ 2015-03-30 5:33 UTC (permalink / raw)
To: Richard Henderson, Emilio G. Cota
Cc: qemu-trivial, Alex Bennée, qemu-devel
Am 29.03.2015 um 23:52 schrieb Richard Henderson:
> On Mar 27, 2015 14:09, "Emilio G. Cota" <cota@braap.org> wrote:
>> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
>>> Have you been able to measure any performance improvement with these new
>>> structures? In theory, if aligned with cache lines, performance should
>>> improve but real numbers would be nice.
>> I haven't benchmarked anything, which makes me very uneasy. All
>> I've checked is that the system boots, and FWIW I appreciate no
>> difference in boot time.
> No decrease in boot time is good. We /know/ we're saving memory, after all.
>
>> Is there a benchmark suite to test TCG changes?
> No, sorry.
>
>
> r~
Benchmarking TCG with QEMU's system emulation is nearly impossible
because operating systems usually contain lots of timer based operations.
The TCG interpreter for example is really slow, but a BIOS will boot
faster than expected with it.
The user mode emulation is much better for benchmarks.
Run some command line Linux application which mainly does
computations (not file i/o) using user mode emulation on Linux.
The OpenSSL package contains bntest which can be used
as a benchmark for TCG. Redirect all output to /dev/null when
you run it.
Binaries for i386 and x86_64 are available from
http://qemu.weilnetz.de/test/user/.
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-29 21:52 ` Richard Henderson
@ 2015-03-30 5:43 ` Stefan Weil
-1 siblings, 0 replies; 17+ messages in thread
From: Stefan Weil @ 2015-03-30 5:43 UTC (permalink / raw)
To: Richard Henderson, Emilio G. Cota
Cc: qemu-trivial, Alex Bennée, qemu-devel
Am 29.03.2015 um 23:52 schrieb Richard Henderson:
> No decrease in boot time is good. We /know/ we're saving memory, after all.
Well, I would not mind a decrease in boot time, too.
The more it decreases, the better. :-)
To be honest: in my version I only used 1 bit bitfield entries for
boolean values, but 8 bit values (aligned on byte boundaries)
for other values because as far as I know, most (all?) cpu
architectures will need more time to extract some bits from
a machine word than to extract a byte.
I have no idea whether this makes a difference in performance
as I did not run any runtime benchmark.
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
@ 2015-03-30 5:43 ` Stefan Weil
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Weil @ 2015-03-30 5:43 UTC (permalink / raw)
To: Richard Henderson, Emilio G. Cota
Cc: qemu-trivial, Alex Bennée, qemu-devel
Am 29.03.2015 um 23:52 schrieb Richard Henderson:
> No decrease in boot time is good. We /know/ we're saving memory, after all.
Well, I would not mind a decrease in boot time, too.
The more it decreases, the better. :-)
To be honest: in my version I only used 1 bit bitfield entries for
boolean values, but 8 bit values (aligned on byte boundaries)
for other values because as far as I know, most (all?) cpu
architectures will need more time to extract some bits from
a machine word than to extract a byte.
I have no idea whether this makes a difference in performance
as I did not run any runtime benchmark.
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* [Qemu-trivial] [PATCH v2] tcg: optimise memory layout of TCGTemp
2015-03-30 5:43 ` Stefan Weil
@ 2015-04-03 0:07 ` Emilio G. Cota
-1 siblings, 0 replies; 17+ messages in thread
From: Emilio G. Cota @ 2015-04-03 0:07 UTC (permalink / raw)
To: Stefan Weil
Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel,
Richard Henderson
This brings down the size of the struct from 56 to 32 bytes on 64-bit,
and to 20 bytes on 32-bit. This leads to memory savings:
Before:
$ find . -name 'tcg.o' | xargs size
text data bss dec hex filename
41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
After:
$ find . -name 'tcg.o' | xargs size
text data bss dec hex filename
40883 29800 88 70771 11473 ./aarch64-softmmu/tcg/tcg.o
37473 29416 96 66985 105a9 ./x86_64-linux-user/tcg/tcg.o
38858 28816 96 67770 108ba ./arm-linux-user/tcg/tcg.o
40554 29096 88 69738 1106a ./arm-softmmu/tcg/tcg.o
39169 29672 88 68929 10d41 ./x86_64-softmmu/tcg/tcg.o
Note that using an entire byte for some enums that need less than
that wastes a few bits (noticeable in 32 bits, where we use
20 bytes instead of 16) but avoids extraction code, which overall
is a win--I've tested several variations of the patch, and the appended
is the best performer for OpenSSL's bntest by a very small margin:
Before:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
[...]
Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
10538.479833 task-clock (msec) # 0.999 CPUs utilized ( +- 0.38% )
772 context-switches # 0.073 K/sec ( +- 2.03% )
0 cpu-migrations # 0.000 K/sec ( +-100.00% )
2,207 page-faults # 0.209 K/sec ( +- 0.08% )
10.552871687 seconds time elapsed ( +- 0.39% )
After:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
739 context-switches # 0.071 K/sec ( +- 1.71% )
0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
2,204 page-faults # 0.211 K/sec ( +- 0.10% )
10.473900411 seconds time elapsed ( +- 0.30% )
Suggested-by: Stefan Weil <sw@weilnetz.de>
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
tcg/tcg.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..7f95132 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
}
}
-#define TEMP_VAL_DEAD 0
-#define TEMP_VAL_REG 1
-#define TEMP_VAL_MEM 2
-#define TEMP_VAL_CONST 3
+typedef enum TCGTempVal {
+ TEMP_VAL_DEAD,
+ TEMP_VAL_REG,
+ TEMP_VAL_MEM,
+ TEMP_VAL_CONST,
+} TCGTempVal;
-/* XXX: optimize memory layout */
typedef struct TCGTemp {
- TCGType base_type;
- TCGType type;
- int val_type;
- int reg;
- tcg_target_long val;
- int mem_reg;
- intptr_t mem_offset;
+ unsigned int reg:8;
+ unsigned int mem_reg:8;
+ TCGTempVal val_type:8;
+ TCGType base_type:8;
+ TCGType type:8;
unsigned int fixed_reg:1;
unsigned int mem_coherent:1;
unsigned int mem_allocated:1;
@@ -438,6 +437,9 @@ typedef struct TCGTemp {
basic blocks. Otherwise, it is not
preserved across basic blocks. */
unsigned int temp_allocated:1; /* never used for code gen */
+
+ tcg_target_long val;
+ intptr_t mem_offset;
const char *name;
} TCGTemp;
--
1.9.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
@ 2015-04-03 0:07 ` Emilio G. Cota
0 siblings, 0 replies; 17+ messages in thread
From: Emilio G. Cota @ 2015-04-03 0:07 UTC (permalink / raw)
To: Stefan Weil
Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel,
Richard Henderson
This brings down the size of the struct from 56 to 32 bytes on 64-bit,
and to 20 bytes on 32-bit. This leads to memory savings:
Before:
$ find . -name 'tcg.o' | xargs size
text data bss dec hex filename
41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
After:
$ find . -name 'tcg.o' | xargs size
text data bss dec hex filename
40883 29800 88 70771 11473 ./aarch64-softmmu/tcg/tcg.o
37473 29416 96 66985 105a9 ./x86_64-linux-user/tcg/tcg.o
38858 28816 96 67770 108ba ./arm-linux-user/tcg/tcg.o
40554 29096 88 69738 1106a ./arm-softmmu/tcg/tcg.o
39169 29672 88 68929 10d41 ./x86_64-softmmu/tcg/tcg.o
Note that using an entire byte for some enums that need less than
that wastes a few bits (noticeable in 32 bits, where we use
20 bytes instead of 16) but avoids extraction code, which overall
is a win--I've tested several variations of the patch, and the appended
is the best performer for OpenSSL's bntest by a very small margin:
Before:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
[...]
Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
10538.479833 task-clock (msec) # 0.999 CPUs utilized ( +- 0.38% )
772 context-switches # 0.073 K/sec ( +- 2.03% )
0 cpu-migrations # 0.000 K/sec ( +-100.00% )
2,207 page-faults # 0.209 K/sec ( +- 0.08% )
10.552871687 seconds time elapsed ( +- 0.39% )
After:
$ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
739 context-switches # 0.071 K/sec ( +- 1.71% )
0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
2,204 page-faults # 0.211 K/sec ( +- 0.10% )
10.473900411 seconds time elapsed ( +- 0.30% )
Suggested-by: Stefan Weil <sw@weilnetz.de>
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
tcg/tcg.h | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..7f95132 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
}
}
-#define TEMP_VAL_DEAD 0
-#define TEMP_VAL_REG 1
-#define TEMP_VAL_MEM 2
-#define TEMP_VAL_CONST 3
+typedef enum TCGTempVal {
+ TEMP_VAL_DEAD,
+ TEMP_VAL_REG,
+ TEMP_VAL_MEM,
+ TEMP_VAL_CONST,
+} TCGTempVal;
-/* XXX: optimize memory layout */
typedef struct TCGTemp {
- TCGType base_type;
- TCGType type;
- int val_type;
- int reg;
- tcg_target_long val;
- int mem_reg;
- intptr_t mem_offset;
+ unsigned int reg:8;
+ unsigned int mem_reg:8;
+ TCGTempVal val_type:8;
+ TCGType base_type:8;
+ TCGType type:8;
unsigned int fixed_reg:1;
unsigned int mem_coherent:1;
unsigned int mem_allocated:1;
@@ -438,6 +437,9 @@ typedef struct TCGTemp {
basic blocks. Otherwise, it is not
preserved across basic blocks. */
unsigned int temp_allocated:1; /* never used for code gen */
+
+ tcg_target_long val;
+ intptr_t mem_offset;
const char *name;
} TCGTemp;
--
1.9.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [Qemu-trivial] [PATCH v2] tcg: optimise memory layout of TCGTemp
2015-04-03 0:07 ` [Qemu-devel] " Emilio G. Cota
@ 2015-04-03 8:13 ` Stefan Weil
-1 siblings, 0 replies; 17+ messages in thread
From: Stefan Weil @ 2015-04-03 8:13 UTC (permalink / raw)
To: Emilio G. Cota
Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel,
Richard Henderson
Am 03.04.2015 um 02:07 schrieb Emilio G. Cota:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 20 bytes on 32-bit. This leads to memory savings:
>
> Before:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
> 37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
> 39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
> 40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
> 39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 40883 29800 88 70771 11473 ./aarch64-softmmu/tcg/tcg.o
> 37473 29416 96 66985 105a9 ./x86_64-linux-user/tcg/tcg.o
> 38858 28816 96 67770 108ba ./arm-linux-user/tcg/tcg.o
> 40554 29096 88 69738 1106a ./arm-softmmu/tcg/tcg.o
> 39169 29672 88 68929 10d41 ./x86_64-softmmu/tcg/tcg.o
>
> Note that using an entire byte for some enums that need less than
> that wastes a few bits (noticeable in 32 bits, where we use
> 20 bytes instead of 16) but avoids extraction code, which overall
> is a win--I've tested several variations of the patch, and the appended
> is the best performer for OpenSSL's bntest by a very small margin:
>
> Before:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> [...]
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10538.479833 task-clock (msec) # 0.999 CPUs utilized ( +- 0.38% )
> 772 context-switches # 0.073 K/sec ( +- 2.03% )
> 0 cpu-migrations # 0.000 K/sec ( +-100.00% )
> 2,207 page-faults # 0.209 K/sec ( +- 0.08% )
> 10.552871687 seconds time elapsed ( +- 0.39% )
>
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
> 739 context-switches # 0.071 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
> 2,204 page-faults # 0.211 K/sec ( +- 0.10% )
> 10.473900411 seconds time elapsed ( +- 0.30% )
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..7f95132 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> }
> }
>
> -#define TEMP_VAL_DEAD 0
> -#define TEMP_VAL_REG 1
> -#define TEMP_VAL_MEM 2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> + TEMP_VAL_DEAD,
> + TEMP_VAL_REG,
> + TEMP_VAL_MEM,
> + TEMP_VAL_CONST,
> +} TCGTempVal;
>
> -/* XXX: optimize memory layout */
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + TCGTempVal val_type:8;
> + TCGType base_type:8;
> + TCGType type:8;
> unsigned int fixed_reg:1;
> unsigned int mem_coherent:1;
> unsigned int mem_allocated:1;
> @@ -438,6 +437,9 @@ typedef struct TCGTemp {
> basic blocks. Otherwise, it is not
> preserved across basic blocks. */
> unsigned int temp_allocated:1; /* never used for code gen */
> +
> + tcg_target_long val;
> + intptr_t mem_offset;
> const char *name;
> } TCGTemp;
Thanks for doing those tests. There are some smaller cosmetics which
might be changed, too (uint8_t for unsigned int with 8 bit, bool for
boolean bit values), but I think your patch is a real gain.
Reviewed-by: Stefan Weil <sw@weilnetz.de>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
@ 2015-04-03 8:13 ` Stefan Weil
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Weil @ 2015-04-03 8:13 UTC (permalink / raw)
To: Emilio G. Cota
Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel,
Richard Henderson
Am 03.04.2015 um 02:07 schrieb Emilio G. Cota:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 20 bytes on 32-bit. This leads to memory savings:
>
> Before:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
> 37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
> 39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
> 40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
> 39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 40883 29800 88 70771 11473 ./aarch64-softmmu/tcg/tcg.o
> 37473 29416 96 66985 105a9 ./x86_64-linux-user/tcg/tcg.o
> 38858 28816 96 67770 108ba ./arm-linux-user/tcg/tcg.o
> 40554 29096 88 69738 1106a ./arm-softmmu/tcg/tcg.o
> 39169 29672 88 68929 10d41 ./x86_64-softmmu/tcg/tcg.o
>
> Note that using an entire byte for some enums that need less than
> that wastes a few bits (noticeable in 32 bits, where we use
> 20 bytes instead of 16) but avoids extraction code, which overall
> is a win--I've tested several variations of the patch, and the appended
> is the best performer for OpenSSL's bntest by a very small margin:
>
> Before:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> [...]
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10538.479833 task-clock (msec) # 0.999 CPUs utilized ( +- 0.38% )
> 772 context-switches # 0.073 K/sec ( +- 2.03% )
> 0 cpu-migrations # 0.000 K/sec ( +-100.00% )
> 2,207 page-faults # 0.209 K/sec ( +- 0.08% )
> 10.552871687 seconds time elapsed ( +- 0.39% )
>
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
> 739 context-switches # 0.071 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
> 2,204 page-faults # 0.211 K/sec ( +- 0.10% )
> 10.473900411 seconds time elapsed ( +- 0.30% )
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..7f95132 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> }
> }
>
> -#define TEMP_VAL_DEAD 0
> -#define TEMP_VAL_REG 1
> -#define TEMP_VAL_MEM 2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> + TEMP_VAL_DEAD,
> + TEMP_VAL_REG,
> + TEMP_VAL_MEM,
> + TEMP_VAL_CONST,
> +} TCGTempVal;
>
> -/* XXX: optimize memory layout */
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + TCGTempVal val_type:8;
> + TCGType base_type:8;
> + TCGType type:8;
> unsigned int fixed_reg:1;
> unsigned int mem_coherent:1;
> unsigned int mem_allocated:1;
> @@ -438,6 +437,9 @@ typedef struct TCGTemp {
> basic blocks. Otherwise, it is not
> preserved across basic blocks. */
> unsigned int temp_allocated:1; /* never used for code gen */
> +
> + tcg_target_long val;
> + intptr_t mem_offset;
> const char *name;
> } TCGTemp;
Thanks for doing those tests. There are some smaller cosmetics which
might be changed, too (uint8_t for unsigned int with 8 bit, bool for
boolean bit values), but I think your patch is a real gain.
Reviewed-by: Stefan Weil <sw@weilnetz.de>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-trivial] [PATCH v2] tcg: optimise memory layout of TCGTemp
2015-04-03 0:07 ` [Qemu-devel] " Emilio G. Cota
@ 2015-04-03 14:17 ` Richard Henderson
-1 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2015-04-03 14:17 UTC (permalink / raw)
To: Emilio G. Cota, Stefan Weil
Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel
On 04/02/2015 05:07 PM, Emilio G. Cota wrote:
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
> 739 context-switches # 0.071 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
> 2,204 page-faults # 0.211 K/sec ( +- 0.10% )
> 10.473900411 seconds time elapsed ( +- 0.30% )
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
Reviewed-by: Richard Henderson <rth@twiddle.net>
I'll put this in a queue for 2.4.
r~
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
@ 2015-04-03 14:17 ` Richard Henderson
0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2015-04-03 14:17 UTC (permalink / raw)
To: Emilio G. Cota, Stefan Weil
Cc: qemu-trivial, Laurent Desnogues, Alex Bennée, qemu-devel
On 04/02/2015 05:07 PM, Emilio G. Cota wrote:
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
> 739 context-switches # 0.071 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
> 2,204 page-faults # 0.211 K/sec ( +- 0.10% )
> 10.473900411 seconds time elapsed ( +- 0.30% )
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
Reviewed-by: Richard Henderson <rth@twiddle.net>
I'll put this in a queue for 2.4.
r~
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-trivial] [PATCH v2] tcg: optimise memory layout of TCGTemp
2015-04-03 0:07 ` [Qemu-devel] " Emilio G. Cota
@ 2015-04-07 14:59 ` Alex Bennée
-1 siblings, 0 replies; 17+ messages in thread
From: Alex Bennée @ 2015-04-07 14:59 UTC (permalink / raw)
To: Emilio G. Cota
Cc: qemu-trivial, Stefan Weil, Laurent Desnogues, qemu-devel,
Richard Henderson
Emilio G. Cota <cota@braap.org> writes:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 20 bytes on 32-bit. This leads to memory savings:
>
> Before:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
> 37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
> 39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
> 40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
> 39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 40883 29800 88 70771 11473 ./aarch64-softmmu/tcg/tcg.o
> 37473 29416 96 66985 105a9 ./x86_64-linux-user/tcg/tcg.o
> 38858 28816 96 67770 108ba ./arm-linux-user/tcg/tcg.o
> 40554 29096 88 69738 1106a ./arm-softmmu/tcg/tcg.o
> 39169 29672 88 68929 10d41 ./x86_64-softmmu/tcg/tcg.o
>
> Note that using an entire byte for some enums that need less than
> that wastes a few bits (noticeable in 32 bits, where we use
> 20 bytes instead of 16) but avoids extraction code, which overall
> is a win--I've tested several variations of the patch, and the appended
> is the best performer for OpenSSL's bntest by a very small margin:
>
> Before:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> [...]
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10538.479833 task-clock (msec) # 0.999 CPUs utilized ( +- 0.38% )
> 772 context-switches # 0.073 K/sec ( +- 2.03% )
> 0 cpu-migrations # 0.000 K/sec ( +-100.00% )
> 2,207 page-faults # 0.209 K/sec ( +- 0.08% )
> 10.552871687 seconds time elapsed ( +- 0.39% )
>
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
> 739 context-switches # 0.071 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
> 2,204 page-faults # 0.211 K/sec ( +- 0.10% )
> 10.473900411 seconds time elapsed
> ( +- 0.30% )
I'll take that as a win condition ;-)
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..7f95132 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> }
> }
>
> -#define TEMP_VAL_DEAD 0
> -#define TEMP_VAL_REG 1
> -#define TEMP_VAL_MEM 2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> + TEMP_VAL_DEAD,
> + TEMP_VAL_REG,
> + TEMP_VAL_MEM,
> + TEMP_VAL_CONST,
> +} TCGTempVal;
>
> -/* XXX: optimize memory layout */
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + TCGTempVal val_type:8;
> + TCGType base_type:8;
> + TCGType type:8;
> unsigned int fixed_reg:1;
> unsigned int mem_coherent:1;
> unsigned int mem_allocated:1;
> @@ -438,6 +437,9 @@ typedef struct TCGTemp {
> basic blocks. Otherwise, it is not
> preserved across basic blocks. */
> unsigned int temp_allocated:1; /* never used for code gen */
> +
> + tcg_target_long val;
> + intptr_t mem_offset;
> const char *name;
> } TCGTemp;
--
Alex Bennée
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [Qemu-devel] [PATCH v2] tcg: optimise memory layout of TCGTemp
@ 2015-04-07 14:59 ` Alex Bennée
0 siblings, 0 replies; 17+ messages in thread
From: Alex Bennée @ 2015-04-07 14:59 UTC (permalink / raw)
To: Emilio G. Cota
Cc: qemu-trivial, Stefan Weil, Laurent Desnogues, qemu-devel,
Richard Henderson
Emilio G. Cota <cota@braap.org> writes:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 20 bytes on 32-bit. This leads to memory savings:
>
> Before:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
> 37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
> 39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
> 40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
> 39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 40883 29800 88 70771 11473 ./aarch64-softmmu/tcg/tcg.o
> 37473 29416 96 66985 105a9 ./x86_64-linux-user/tcg/tcg.o
> 38858 28816 96 67770 108ba ./arm-linux-user/tcg/tcg.o
> 40554 29096 88 69738 1106a ./arm-softmmu/tcg/tcg.o
> 39169 29672 88 68929 10d41 ./x86_64-softmmu/tcg/tcg.o
>
> Note that using an entire byte for some enums that need less than
> that wastes a few bits (noticeable in 32 bits, where we use
> 20 bytes instead of 16) but avoids extraction code, which overall
> is a win--I've tested several variations of the patch, and the appended
> is the best performer for OpenSSL's bntest by a very small margin:
>
> Before:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> [...]
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10538.479833 task-clock (msec) # 0.999 CPUs utilized ( +- 0.38% )
> 772 context-switches # 0.073 K/sec ( +- 2.03% )
> 0 cpu-migrations # 0.000 K/sec ( +-100.00% )
> 2,207 page-faults # 0.209 K/sec ( +- 0.08% )
> 10.552871687 seconds time elapsed ( +- 0.39% )
>
> After:
> $ taskset -c 0 perf stat -r 15 -- x86_64-linux-user/qemu-x86_64 img/bntest-x86_64 >/dev/null
> Performance counter stats for 'x86_64-linux-user/qemu-x86_64 img/bntest-x86_64' (15 runs):
>
> 10459.968847 task-clock (msec) # 0.999 CPUs utilized ( +- 0.30% )
> 739 context-switches # 0.071 K/sec ( +- 1.71% )
> 0 cpu-migrations # 0.000 K/sec ( +- 68.14% )
> 2,204 page-faults # 0.211 K/sec ( +- 0.10% )
> 10.473900411 seconds time elapsed
> ( +- 0.30% )
I'll take that as a win condition ;-)
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..7f95132 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -417,20 +417,19 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> }
> }
>
> -#define TEMP_VAL_DEAD 0
> -#define TEMP_VAL_REG 1
> -#define TEMP_VAL_MEM 2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> + TEMP_VAL_DEAD,
> + TEMP_VAL_REG,
> + TEMP_VAL_MEM,
> + TEMP_VAL_CONST,
> +} TCGTempVal;
>
> -/* XXX: optimize memory layout */
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + TCGTempVal val_type:8;
> + TCGType base_type:8;
> + TCGType type:8;
> unsigned int fixed_reg:1;
> unsigned int mem_coherent:1;
> unsigned int mem_allocated:1;
> @@ -438,6 +437,9 @@ typedef struct TCGTemp {
> basic blocks. Otherwise, it is not
> preserved across basic blocks. */
> unsigned int temp_allocated:1; /* never used for code gen */
> +
> + tcg_target_long val;
> + intptr_t mem_offset;
> const char *name;
> } TCGTemp;
--
Alex Bennée
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Qemu-trivial] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-25 19:50 ` [Qemu-trivial] [PATCH] tcg: optimise memory layout of TCGTemp Emilio G. Cota
@ 2015-03-27 14:58 Richard Henderson
0 siblings, 0 replies; 17+ messages in thread
From: Richard Henderson @ 2015-03-27 14:58 UTC (permalink / raw)
To: Emilio G. Cota, Stefan Weil; +Cc: qemu-trivial, qemu-devel
On 03/25/2015 12:50 PM, Emilio G. Cota wrote:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 16 bytes on 32-bit.
>
> The appended adds macros to prevent us from mistakenly overflowing
> the bitfields when more elements are added to the corresponding
> enums/macros.
>
> Note that reg/mem_reg need only 6 bits (for ia64) but for performance
> is probably better to align them to a byte address.
>
> Given that TCGTemp is used in large arrays this leads to a few KBs
> of savings. However, unpacking the bits takes additional code, so
> the net effect depends on the target (host is x86_64):
>
> Before:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
> 37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
> 39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
> 40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
> 39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41187 29800 88 71075 115a3 ./aarch64-softmmu/tcg/tcg.o
> 37777 29416 96 67289 106d9 ./x86_64-linux-user/tcg/tcg.o
> 39162 28816 96 68074 109ea ./arm-linux-user/tcg/tcg.o
> 40858 29096 88 70042 1119a ./arm-softmmu/tcg/tcg.o
> 39473 29672 88 69233 10e71 ./x86_64-softmmu/tcg/tcg.o
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 22 +++++++++++++---------
> 1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..71ae7b2 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -193,7 +193,7 @@ typedef struct TCGPool {
> typedef enum TCGType {
> TCG_TYPE_I32,
> TCG_TYPE_I64,
> - TCG_TYPE_COUNT, /* number of different types */
> + TCG_TYPE_COUNT, /* number of different types, see TCG_TYPE_NR_BITS */
>
> /* An alias for the size of the host register. */
> #if TCG_TARGET_REG_BITS == 32
> @@ -217,6 +217,9 @@ typedef enum TCGType {
> #endif
> } TCGType;
>
> +/* used for bitfield packing to save space */
> +#define TCG_TYPE_NR_BITS 1
I'd rather you moved TCG_TYPE_COUNT out of the enum and into a define. Perhaps
even as (1 << TCG_TYPE_NR_BITS).
> @@ -421,16 +424,14 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> #define TEMP_VAL_REG 1
> #define TEMP_VAL_MEM 2
> #define TEMP_VAL_CONST 3
> +#define TEMP_VAL_NR_BITS 2
And make this an enumeration.
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + unsigned int val_type:TEMP_VAL_NR_BITS;
> + unsigned int base_type:TCG_TYPE_NR_BITS;
> + unsigned int type:TCG_TYPE_NR_BITS;
And do *not* change these from the enumeration to an unsigned int.
I know why you did this -- to keep the compiler from warning that the TCGType
enum didn't fit in the bitfield, because of TCG_TYPE_COUNT being an enumerator,
rather than an unrelated number. Except that's exactly the warning we want to
keep, on the off-chance that someone modifies the enums without modifying the
_NR_BITS defines.
r~
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: pack TCGTemp to reduce size by 8 bytes
@ 2015-03-24 1:07 Richard Henderson
2015-03-25 19:50 ` [Qemu-trivial] [PATCH] tcg: optimise memory layout of TCGTemp Emilio G. Cota
0 siblings, 1 reply; 17+ messages in thread
From: Richard Henderson @ 2015-03-24 1:07 UTC (permalink / raw)
To: Stefan Weil, Emilio G. Cota, qemu-devel; +Cc: qemu-trivial
On 03/23/2015 02:42 PM, Stefan Weil wrote:
> Further optimizations are possible. TCGTemp can be reduced to 32 bytes as the
> output
> of pahole shows:
>
> struct TCGTemp {
> TCGTempVal val_type:8; /* 0:24 4 */
Need only be 2 bits.
> unsigned int reg:8; /* 0:16 4 */
> unsigned int mem_reg:8; /* 0: 8 4 */
Need only be 6 (ia64) bits, but an aligned 8-bit slot probably performs best.
>
> /* Bitfield combined with next fields */
>
> _Bool fixed_reg:1; /* 3: 7 1 */
> _Bool mem_coherent:1; /* 3: 6 1 */
> _Bool mem_allocated:1; /* 3: 5 1 */
> _Bool temp_local:1; /* 3: 4 1 */
> _Bool temp_allocated:1; /* 3: 3 1 */
>
> /* XXX 3 bits hole, try to pack */
>
> TCGType base_type:16; /* 4:16 4 */
> TCGType type:16; /* 4: 0 4 */
Need only be 1 bit, honestly, but 2 bits might be easier to arrange. Anyway,
you're down to 23 bits from the word, or 16 bytes on a 32-bit host. It's no
better than the 32 bytes you got for a 64-bit host though.
> tcg_target_long val; /* 8 8 */
> intptr_t mem_offset; /* 16 8 */
> const char * name; /* 24 8 */
>
> /* size: 32, cachelines: 1, members: 13 */
> /* bit holes: 1, sum bit holes: 3 bits */
> /* last cacheline: 32 bytes */
> };
>
> Here I used a new enum type for val_type and reduced some values to 8 or 16 bit.
> I also put the two most often used values at the beginning, so they can be
> addressed without or with a small offset ("often" in the code, no runtime
> data available).
>
> Are such optimizations useful?
Yes, I think so. Especially because of the rather large arrays we build.
r~
^ permalink raw reply [flat|nested] 17+ messages in thread* [Qemu-trivial] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-24 1:07 [Qemu-trivial] [Qemu-devel] [PATCH] tcg: pack TCGTemp to reduce size by 8 bytes Richard Henderson
@ 2015-03-25 19:50 ` Emilio G. Cota
2015-03-27 9:55 ` [Qemu-trivial] [Qemu-devel] " Alex Bennée
0 siblings, 1 reply; 17+ messages in thread
From: Emilio G. Cota @ 2015-03-25 19:50 UTC (permalink / raw)
To: Stefan Weil, Richard Henderson; +Cc: qemu-trivial, qemu-devel
This brings down the size of the struct from 56 to 32 bytes on 64-bit,
and to 16 bytes on 32-bit.
The appended adds macros to prevent us from mistakenly overflowing
the bitfields when more elements are added to the corresponding
enums/macros.
Note that reg/mem_reg need only 6 bits (for ia64) but for performance
is probably better to align them to a byte address.
Given that TCGTemp is used in large arrays this leads to a few KBs
of savings. However, unpacking the bits takes additional code, so
the net effect depends on the target (host is x86_64):
Before:
$ find . -name 'tcg.o' | xargs size
text data bss dec hex filename
41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
After:
$ find . -name 'tcg.o' | xargs size
text data bss dec hex filename
41187 29800 88 71075 115a3 ./aarch64-softmmu/tcg/tcg.o
37777 29416 96 67289 106d9 ./x86_64-linux-user/tcg/tcg.o
39162 28816 96 68074 109ea ./arm-linux-user/tcg/tcg.o
40858 29096 88 70042 1119a ./arm-softmmu/tcg/tcg.o
39473 29672 88 69233 10e71 ./x86_64-softmmu/tcg/tcg.o
Suggested-by: Stefan Weil <sw@weilnetz.de>
Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
tcg/tcg.h | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..71ae7b2 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -193,7 +193,7 @@ typedef struct TCGPool {
typedef enum TCGType {
TCG_TYPE_I32,
TCG_TYPE_I64,
- TCG_TYPE_COUNT, /* number of different types */
+ TCG_TYPE_COUNT, /* number of different types, see TCG_TYPE_NR_BITS */
/* An alias for the size of the host register. */
#if TCG_TARGET_REG_BITS == 32
@@ -217,6 +217,9 @@ typedef enum TCGType {
#endif
} TCGType;
+/* used for bitfield packing to save space */
+#define TCG_TYPE_NR_BITS 1
+
/* Constants for qemu_ld and qemu_st for the Memory Operation field. */
typedef enum TCGMemOp {
MO_8 = 0,
@@ -421,16 +424,14 @@ static inline TCGCond tcg_high_cond(TCGCond c)
#define TEMP_VAL_REG 1
#define TEMP_VAL_MEM 2
#define TEMP_VAL_CONST 3
+#define TEMP_VAL_NR_BITS 2
-/* XXX: optimize memory layout */
typedef struct TCGTemp {
- TCGType base_type;
- TCGType type;
- int val_type;
- int reg;
- tcg_target_long val;
- int mem_reg;
- intptr_t mem_offset;
+ unsigned int reg:8;
+ unsigned int mem_reg:8;
+ unsigned int val_type:TEMP_VAL_NR_BITS;
+ unsigned int base_type:TCG_TYPE_NR_BITS;
+ unsigned int type:TCG_TYPE_NR_BITS;
unsigned int fixed_reg:1;
unsigned int mem_coherent:1;
unsigned int mem_allocated:1;
@@ -438,6 +439,9 @@ typedef struct TCGTemp {
basic blocks. Otherwise, it is not
preserved across basic blocks. */
unsigned int temp_allocated:1; /* never used for code gen */
+
+ tcg_target_long val;
+ intptr_t mem_offset;
const char *name;
} TCGTemp;
--
1.9.1
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-25 19:50 ` [Qemu-trivial] [PATCH] tcg: optimise memory layout of TCGTemp Emilio G. Cota
@ 2015-03-27 9:55 ` Alex Bennée
2015-03-27 21:09 ` Emilio G. Cota
0 siblings, 1 reply; 17+ messages in thread
From: Alex Bennée @ 2015-03-27 9:55 UTC (permalink / raw)
To: Emilio G. Cota; +Cc: qemu-trivial, Stefan Weil, qemu-devel, Richard Henderson
Emilio G. Cota <cota@braap.org> writes:
> This brings down the size of the struct from 56 to 32 bytes on 64-bit,
> and to 16 bytes on 32-bit.
Have you been able to measure any performance improvement with these new
structures? In theory, if aligned with cache lines, performance should
improve but real numbers would be nice.
>
> The appended adds macros to prevent us from mistakenly overflowing
> the bitfields when more elements are added to the corresponding
> enums/macros.
I can see the defines but I can't see any checks. Should we be able to
do a compile time check if TCG_TYPE_COUNT doesn't fit into
TCG_TYPE_NR_BITS?
>
> Note that reg/mem_reg need only 6 bits (for ia64) but for performance
> is probably better to align them to a byte address.
>
> Given that TCGTemp is used in large arrays this leads to a few KBs
> of savings. However, unpacking the bits takes additional code, so
> the net effect depends on the target (host is x86_64):
>
> Before:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41131 29800 88 71019 1156b ./aarch64-softmmu/tcg/tcg.o
> 37969 29416 96 67481 10799 ./x86_64-linux-user/tcg/tcg.o
> 39354 28816 96 68266 10aaa ./arm-linux-user/tcg/tcg.o
> 40802 29096 88 69986 11162 ./arm-softmmu/tcg/tcg.o
> 39417 29672 88 69177 10e39 ./x86_64-softmmu/tcg/tcg.o
>
> After:
> $ find . -name 'tcg.o' | xargs size
> text data bss dec hex filename
> 41187 29800 88 71075 115a3 ./aarch64-softmmu/tcg/tcg.o
> 37777 29416 96 67289 106d9 ./x86_64-linux-user/tcg/tcg.o
> 39162 28816 96 68074 109ea ./arm-linux-user/tcg/tcg.o
> 40858 29096 88 70042 1119a ./arm-softmmu/tcg/tcg.o
> 39473 29672 88 69233 10e71 ./x86_64-softmmu/tcg/tcg.o
>
> Suggested-by: Stefan Weil <sw@weilnetz.de>
> Suggested-by: Richard Henderson <rth@twiddle.net>
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
> tcg/tcg.h | 22 +++++++++++++---------
> 1 file changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..71ae7b2 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -193,7 +193,7 @@ typedef struct TCGPool {
> typedef enum TCGType {
> TCG_TYPE_I32,
> TCG_TYPE_I64,
> - TCG_TYPE_COUNT, /* number of different types */
> + TCG_TYPE_COUNT, /* number of different types, see TCG_TYPE_NR_BITS */
>
> /* An alias for the size of the host register. */
> #if TCG_TARGET_REG_BITS == 32
> @@ -217,6 +217,9 @@ typedef enum TCGType {
> #endif
> } TCGType;
>
> +/* used for bitfield packing to save space */
> +#define TCG_TYPE_NR_BITS 1
> +
> /* Constants for qemu_ld and qemu_st for the Memory Operation field. */
> typedef enum TCGMemOp {
> MO_8 = 0,
> @@ -421,16 +424,14 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> #define TEMP_VAL_REG 1
> #define TEMP_VAL_MEM 2
> #define TEMP_VAL_CONST 3
> +#define TEMP_VAL_NR_BITS 2
A similar compile time check could be added here.
>
> -/* XXX: optimize memory layout */
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + unsigned int val_type:TEMP_VAL_NR_BITS;
> + unsigned int base_type:TCG_TYPE_NR_BITS;
> + unsigned int type:TCG_TYPE_NR_BITS;
> unsigned int fixed_reg:1;
> unsigned int mem_coherent:1;
> unsigned int mem_allocated:1;
> @@ -438,6 +439,9 @@ typedef struct TCGTemp {
> basic blocks. Otherwise, it is not
> preserved across basic blocks. */
> unsigned int temp_allocated:1; /* never used for code gen */
> +
> + tcg_target_long val;
> + intptr_t mem_offset;
> const char *name;
> } TCGTemp;
--
Alex Bennée
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-27 9:55 ` [Qemu-trivial] [Qemu-devel] " Alex Bennée
@ 2015-03-27 21:09 ` Emilio G. Cota
2015-03-30 9:55 ` Laurent Desnogues
0 siblings, 1 reply; 17+ messages in thread
From: Emilio G. Cota @ 2015-03-27 21:09 UTC (permalink / raw)
To: Alex Bennée, Richard Henderson; +Cc: qemu-trivial, Stefan Weil, qemu-devel
On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
> Have you been able to measure any performance improvement with these new
> structures? In theory, if aligned with cache lines, performance should
> improve but real numbers would be nice.
I haven't benchmarked anything, which makes me very uneasy. All
I've checked is that the system boots, and FWIW I appreciate no
difference in boot time.
Is there a benchmark suite to test TCG changes?
Until proper benchmarking I wouldn't want to see this merged. For now I
propose to merge the initial change (remove 8-byte hole in 64-bit),
which is uncontroversial.
> > The appended adds macros to prevent us from mistakenly overflowing
> > the bitfields when more elements are added to the corresponding
> > enums/macros.
>
> I can see the defines but I can't see any checks. Should we be able to
> do a compile time check if TCG_TYPE_COUNT doesn't fit into
> TCG_TYPE_NR_BITS?
> > +#define TEMP_VAL_NR_BITS 2
>
> A similar compile time check could be added here.
Ack, addressed below.
On Fri, Mar 27, 2015 at 07:58:06 -0700, Richard Henderson wrote:
> On 03/25/2015 12:50 PM, Emilio G. Cota wrote:
> > +#define TCG_TYPE_NR_BITS 1
>
> I'd rather you moved TCG_TYPE_COUNT out of the enum and into a define. Perhaps
> even as (1 << TCG_TYPE_NR_BITS).
(snip)
> > +#define TEMP_VAL_NR_BITS 2
>
> And make this an enumeration.
>
> > typedef struct TCGTemp {
(snip)
> > + unsigned int base_type:TCG_TYPE_NR_BITS;
> > + unsigned int type:TCG_TYPE_NR_BITS;
>
> And do *not* change these from the enumeration to an unsigned int.
>
> I know why you did this -- to keep the compiler from warning that the TCGType
> enum didn't fit in the bitfield, because of TCG_TYPE_COUNT being an enumerator,
> rather than an unrelated number. Except that's exactly the warning we want to
> keep, on the off-chance that someone modifies the enums without modifying the
> _NR_BITS defines.
Agreed, please see below.
Thanks,
E.
[No signoff due to lack of provable perf improvement, see above.]
diff --git a/tcg/tcg.h b/tcg/tcg.h
index add7f75..afd3f94 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -193,7 +193,6 @@ typedef struct TCGPool {
typedef enum TCGType {
TCG_TYPE_I32,
TCG_TYPE_I64,
- TCG_TYPE_COUNT, /* number of different types */
/* An alias for the size of the host register. */
#if TCG_TARGET_REG_BITS == 32
@@ -217,6 +216,10 @@ typedef enum TCGType {
#endif
} TCGType;
+/* used for bitfield packing to save space */
+#define TCG_TYPE_NR_BITS 1
+#define TCG_TYPE_COUNT BIT(TCG_TYPE_NR_BITS)
+
/* Constants for qemu_ld and qemu_st for the Memory Operation field. */
typedef enum TCGMemOp {
MO_8 = 0,
@@ -417,20 +420,21 @@ static inline TCGCond tcg_high_cond(TCGCond c)
}
}
-#define TEMP_VAL_DEAD 0
-#define TEMP_VAL_REG 1
-#define TEMP_VAL_MEM 2
-#define TEMP_VAL_CONST 3
+typedef enum TCGTempVal {
+ TEMP_VAL_DEAD,
+ TEMP_VAL_REG,
+ TEMP_VAL_MEM,
+ TEMP_VAL_CONST,
+} TCGTempVal;
+
+#define TEMP_VAL_NR_BITS 2
-/* XXX: optimize memory layout */
typedef struct TCGTemp {
- TCGType base_type;
- TCGType type;
- int val_type;
- int reg;
- tcg_target_long val;
- int mem_reg;
- intptr_t mem_offset;
+ unsigned int reg:8;
+ unsigned int mem_reg:8;
+ TCGTempVal val_type:TEMP_VAL_NR_BITS;
+ TCGType base_type:TCG_TYPE_NR_BITS;
+ TCGType type:TCG_TYPE_NR_BITS;
unsigned int fixed_reg:1;
unsigned int mem_coherent:1;
unsigned int mem_allocated:1;
@@ -438,6 +442,9 @@ typedef struct TCGTemp {
basic blocks. Otherwise, it is not
preserved across basic blocks. */
unsigned int temp_allocated:1; /* never used for code gen */
+
+ tcg_target_long val;
+ intptr_t mem_offset;
const char *name;
} TCGTemp;
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp
2015-03-27 21:09 ` Emilio G. Cota
@ 2015-03-30 9:55 ` Laurent Desnogues
0 siblings, 0 replies; 17+ messages in thread
From: Laurent Desnogues @ 2015-03-30 9:55 UTC (permalink / raw)
To: Emilio G. Cota
Cc: qemu-trivial, Stefan Weil, Alex Bennée,
qemu-devel@nongnu.org, Richard Henderson
Hello,
On Fri, Mar 27, 2015 at 10:09 PM, Emilio G. Cota <cota@braap.org> wrote:
> On Fri, Mar 27, 2015 at 09:55:03 +0000, Alex Bennée wrote:
>> Have you been able to measure any performance improvement with these new
>> structures? In theory, if aligned with cache lines, performance should
>> improve but real numbers would be nice.
>
> I haven't benchmarked anything, which makes me very uneasy. All
> I've checked is that the system boots, and FWIW I appreciate no
> difference in boot time.
>
> Is there a benchmark suite to test TCG changes?
>
> Until proper benchmarking I wouldn't want to see this merged. For now I
> propose to merge the initial change (remove 8-byte hole in 64-bit),
> which is uncontroversial.
I tested the patch attached to your mail and saw no performance
difference on an ARM image booting Linux and then running
Sunspider with Google v8. I also tested on one of the 176.gcc
inputs with QEMU ARM user mode and again saw no difference.
Thanks,
Laurent
>> > The appended adds macros to prevent us from mistakenly overflowing
>> > the bitfields when more elements are added to the corresponding
>> > enums/macros.
>>
>> I can see the defines but I can't see any checks. Should we be able to
>> do a compile time check if TCG_TYPE_COUNT doesn't fit into
>> TCG_TYPE_NR_BITS?
>
>> > +#define TEMP_VAL_NR_BITS 2
>>
>> A similar compile time check could be added here.
>
> Ack, addressed below.
>
> On Fri, Mar 27, 2015 at 07:58:06 -0700, Richard Henderson wrote:
>> On 03/25/2015 12:50 PM, Emilio G. Cota wrote:
>> > +#define TCG_TYPE_NR_BITS 1
>>
>> I'd rather you moved TCG_TYPE_COUNT out of the enum and into a define. Perhaps
>> even as (1 << TCG_TYPE_NR_BITS).
> (snip)
>> > +#define TEMP_VAL_NR_BITS 2
>>
>> And make this an enumeration.
>>
>> > typedef struct TCGTemp {
> (snip)
>> > + unsigned int base_type:TCG_TYPE_NR_BITS;
>> > + unsigned int type:TCG_TYPE_NR_BITS;
>>
>> And do *not* change these from the enumeration to an unsigned int.
>>
>> I know why you did this -- to keep the compiler from warning that the TCGType
>> enum didn't fit in the bitfield, because of TCG_TYPE_COUNT being an enumerator,
>> rather than an unrelated number. Except that's exactly the warning we want to
>> keep, on the off-chance that someone modifies the enums without modifying the
>> _NR_BITS defines.
>
> Agreed, please see below.
>
> Thanks,
>
> E.
>
> [No signoff due to lack of provable perf improvement, see above.]
>
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index add7f75..afd3f94 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -193,7 +193,6 @@ typedef struct TCGPool {
> typedef enum TCGType {
> TCG_TYPE_I32,
> TCG_TYPE_I64,
> - TCG_TYPE_COUNT, /* number of different types */
>
> /* An alias for the size of the host register. */
> #if TCG_TARGET_REG_BITS == 32
> @@ -217,6 +216,10 @@ typedef enum TCGType {
> #endif
> } TCGType;
>
> +/* used for bitfield packing to save space */
> +#define TCG_TYPE_NR_BITS 1
> +#define TCG_TYPE_COUNT BIT(TCG_TYPE_NR_BITS)
> +
> /* Constants for qemu_ld and qemu_st for the Memory Operation field. */
> typedef enum TCGMemOp {
> MO_8 = 0,
> @@ -417,20 +420,21 @@ static inline TCGCond tcg_high_cond(TCGCond c)
> }
> }
>
> -#define TEMP_VAL_DEAD 0
> -#define TEMP_VAL_REG 1
> -#define TEMP_VAL_MEM 2
> -#define TEMP_VAL_CONST 3
> +typedef enum TCGTempVal {
> + TEMP_VAL_DEAD,
> + TEMP_VAL_REG,
> + TEMP_VAL_MEM,
> + TEMP_VAL_CONST,
> +} TCGTempVal;
> +
> +#define TEMP_VAL_NR_BITS 2
>
> -/* XXX: optimize memory layout */
> typedef struct TCGTemp {
> - TCGType base_type;
> - TCGType type;
> - int val_type;
> - int reg;
> - tcg_target_long val;
> - int mem_reg;
> - intptr_t mem_offset;
> + unsigned int reg:8;
> + unsigned int mem_reg:8;
> + TCGTempVal val_type:TEMP_VAL_NR_BITS;
> + TCGType base_type:TCG_TYPE_NR_BITS;
> + TCGType type:TCG_TYPE_NR_BITS;
> unsigned int fixed_reg:1;
> unsigned int mem_coherent:1;
> unsigned int mem_allocated:1;
> @@ -438,6 +442,9 @@ typedef struct TCGTemp {
> basic blocks. Otherwise, it is not
> preserved across basic blocks. */
> unsigned int temp_allocated:1; /* never used for code gen */
> +
> + tcg_target_long val;
> + intptr_t mem_offset;
> const char *name;
> } TCGTemp;
>
>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2015-04-07 14:59 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-29 21:52 [Qemu-trivial] [Qemu-devel] [PATCH] tcg: optimise memory layout of TCGTemp Richard Henderson
2015-03-29 21:52 ` Richard Henderson
2015-03-30 5:33 ` [Qemu-trivial] " Stefan Weil
2015-03-30 5:33 ` Stefan Weil
2015-03-30 5:43 ` [Qemu-trivial] " Stefan Weil
2015-03-30 5:43 ` Stefan Weil
2015-04-03 0:07 ` [Qemu-trivial] [PATCH v2] " Emilio G. Cota
2015-04-03 0:07 ` [Qemu-devel] " Emilio G. Cota
2015-04-03 8:13 ` [Qemu-trivial] " Stefan Weil
2015-04-03 8:13 ` [Qemu-devel] " Stefan Weil
2015-04-03 14:17 ` [Qemu-trivial] " Richard Henderson
2015-04-03 14:17 ` [Qemu-devel] " Richard Henderson
2015-04-07 14:59 ` [Qemu-trivial] " Alex Bennée
2015-04-07 14:59 ` [Qemu-devel] " Alex Bennée
-- strict thread matches above, loose matches on Subject: below --
2015-03-27 14:58 [Qemu-trivial] [PATCH] " Richard Henderson
2015-03-24 1:07 [Qemu-trivial] [Qemu-devel] [PATCH] tcg: pack TCGTemp to reduce size by 8 bytes Richard Henderson
2015-03-25 19:50 ` [Qemu-trivial] [PATCH] tcg: optimise memory layout of TCGTemp Emilio G. Cota
2015-03-27 9:55 ` [Qemu-trivial] [Qemu-devel] " Alex Bennée
2015-03-27 21:09 ` Emilio G. Cota
2015-03-30 9:55 ` Laurent Desnogues
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.