* [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
@ 2016-08-11 15:23 Alex Bennée
2016-08-11 17:22 ` Alex Bennée
2016-09-06 9:24 ` Alex Bennée
0 siblings, 2 replies; 8+ messages in thread
From: Alex Bennée @ 2016-08-11 15:23 UTC (permalink / raw)
To: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, nikunj
Cc: mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell,
claudio.fontana, Alex Bennée
This is the fourth iteration of the RFC patch set which aims to
provide the basic framework for MTTCG. I hope this will provide a good
base for discussion at KVM Forum later this month.
Prerequisites
=============
This tree has been built on top of two other series of patches:
- Reduce lock contention on TCG hot-path (v5, in Paolo's tree)
- cpu-exec: Safe work in quiescent state (v5, in my tree)
You can find the base tree (based off -rc0) at:
https://github.com/stsquad/qemu/tree/mttcg/async-safe-work-v5
Changes
=======
Since the last posting there have been a number of updates to the
original patches:
- more updates to docs/multi-thread-tcg.txt design document
- clean ups of sleep handling (and safe work integration)
- split the big enable-multi-thread patch
- split some re-factoring movement stuff into individual patches
As usual the patches themselves have a revision summary under the ---
In addition I've brought forward a number of changes from the original
ARM enabling patches to support the various cputlb operations which
are basically generic anyway. These include:
- making cross-vCPU tlb_flush operations use async_run_on_cpu
- making tlb_reset_dirty_range atomically apply the TLB_NOTDIRTY flag
A copy of the tree can be found at:
https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4
The series includes all the generic work needed and in theory just
needs MTTCG aware atomics and memory barriers for the various
host/guest combinations to be enabled by default.
In practice the memory barrier problems don't show up with an x86
host. In fact I have created a tree which merges in the Emilio's
cmpxchg atomics which happily boots ARMv7 Debian systems without any
additional changes. You can find that at:
https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2
Testing
=======
I've tested this boots ARMv7 Debian and all both ARMv7 and v8 kvm-unit-tests with:
-accel tcg,thread=single
In addition I've tested ARMv7 and ARMv8 kvm-unit-tests of the tcg and
tlbflush group with:
-accel tcg,thread=multi
These tests are safe as they don't rely on atomics to be work but do
exercise the parallel execution, invalidation and flushing of code.
The full invocation of all the tests is:
echo "Running all tests in Single Thread Mode"
./run_tests.sh -t -o "-accel tcg,thread=single -name debug-threads=on"
echo "Running tlbflush in Multi Thread Mode"
./run_tests.sh -t -g tlbflush -o "-accel tcg,thread=multi -name debug-threads=on"
echo "Running TCG in Multi Thread Mode"
./run_tests.sh -t -g tcg -o "-accel tcg,thread=multi -name debug-threads=on"
Performance
===========
You can't do full work-load testing on this tree due to the lack of
atomic support (but I will run some numbers on
mttcg/base-patches-v4-with-cmpxchg-atomics-v2). However you certainly
see a run time improvement with the kvm-unit-tests TCG group.
retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o', '-accel tcg,thread=single']
run 1: ret=0 (PASS), time=1047.147924 (1/1)
run 2: ret=0 (PASS), time=1071.921204 (2/2)
run 3: ret=0 (PASS), time=1048.141600 (3/3)
Results summary:
0: 3 times (100.00%), avg time 1055.737 (196.70 varience/14.02 deviation)
Ran command 3 times, 3 passes
retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o', '-accel tcg,thread=multi']
run 1: ret=0 (PASS), time=303.074210 (1/1)
run 2: ret=0 (PASS), time=304.574991 (2/2)
run 3: ret=0 (PASS), time=303.327408 (3/3)
Results summary:
0: 3 times (100.00%), avg time 303.659 (0.65 varience/0.80 deviation)
Ran command 3 times, 3 passes
The TCG tests run with -smp 4 on my system. While the TCG tests are
purely CPU bound they do exercise the hot and cold paths of TCG
execution (especially when triggering SMC detection). However there is
still a benefit even with a 50% overhead compared to the ideal 263
second elapsed time.
Alex
Alex Bennée (23):
cpus: make all_vcpus_paused() return bool
translate_all: DEBUG_FLUSH -> DEBUG_TB_FLUSH
translate-all: add DEBUG_LOCKING asserts
cpu-exec: include cpu_index in CPU_LOG_EXEC messages
docs: new design document multi-thread-tcg.txt (DRAFTING)
linux-user/elfload: ensure mmap_lock() held while setting up
translate-all: Add assert_(memory|tb)_lock annotations
target-arm/arm-powerctl: wake up sleeping CPUs
tcg: move tcg_exec_all and helpers above thread fn
tcg: cpus rm tcg_exec_all()
tcg: add kick timer for single-threaded vCPU emulation
tcg: rename tcg_current_cpu to tcg_current_rr_cpu
cpus: re-factor out handle_icount_deadline
tcg: remove global exit_request
tcg: move locking for tb_invalidate_phys_page_range up
cpus: tweak sleeping and safe_work rules for MTTCG
tcg: enable tb_lock() for SoftMMU
tcg: enable thread-per-vCPU
atomic: introduce cmpxchg_bool
cputlb: add assert_cpu_is_self checks
cputlb: tweak qemu_ram_addr_from_host_nofail reporting
cputlb: make tlb_reset_dirty safe for MTTCG
cputlb: make tlb_flush_by_mmuidx safe for MTTCG
Jan Kiszka (1):
tcg: drop global lock during TCG code execution
KONRAD Frederic (3):
tcg: protect TBContext with tb_lock.
tcg: add options for enabling MTTCG
cputlb: introduce tlb_flush_* async work.
Paolo Bonzini (1):
tcg: comment on which functions have to be called with tb_lock held
bsd-user/mmap.c | 5 +
cpu-exec-common.c | 19 +-
cpu-exec.c | 41 ++--
cpus.c | 510 +++++++++++++++++++++++++++++-----------------
cputlb.c | 279 ++++++++++++++++++-------
docs/multi-thread-tcg.txt | 310 ++++++++++++++++++++++++++++
exec.c | 28 +++
hw/i386/kvmvapic.c | 4 +
include/exec/cputlb.h | 2 -
include/exec/exec-all.h | 5 +-
include/qemu/atomic.h | 9 +
include/qom/cpu.h | 27 +++
include/sysemu/cpus.h | 2 +
linux-user/elfload.c | 4 +
linux-user/mmap.c | 5 +
memory.c | 2 +
qemu-options.hx | 20 ++
qom/cpu.c | 10 +
softmmu_template.h | 17 ++
target-arm/Makefile.objs | 2 +-
target-arm/arm-powerctl.c | 2 +
target-i386/smm_helper.c | 7 +
tcg/tcg.h | 2 +
translate-all.c | 175 +++++++++++++---
vl.c | 48 ++++-
25 files changed, 1227 insertions(+), 308 deletions(-)
create mode 100644 docs/multi-thread-tcg.txt
--
2.7.4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
[not found] <mailman.11856.1470929072.26858.qemu-devel@nongnu.org>
@ 2016-08-11 16:43 ` G 3
2016-08-12 13:19 ` Alex Bennée
0 siblings, 1 reply; 8+ messages in thread
From: G 3 @ 2016-08-11 16:43 UTC (permalink / raw)
To: alex.bennee, qemu-devel
On Aug 11, 2016, at 11:24 AM, qemu-devel-request@nongnu.org wrote:
>
> Performance
> ===========
>
> You can't do full work-load testing on this tree due to the lack of
> atomic support (but I will run some numbers on
> mttcg/base-patches-v4-with-cmpxchg-atomics-v2). However you certainly
> see a run time improvement with the kvm-unit-tests TCG group.
>
> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o',
> '-accel tcg,thread=single']
> run 1: ret=0 (PASS), time=1047.147924 (1/1)
> run 2: ret=0 (PASS), time=1071.921204 (2/2)
> run 3: ret=0 (PASS), time=1048.141600 (3/3)
> Results summary:
> 0: 3 times (100.00%), avg time 1055.737 (196.70 varience/14.02
> deviation)
> Ran command 3 times, 3 passes
> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o',
> '-accel tcg,thread=multi']
> run 1: ret=0 (PASS), time=303.074210 (1/1)
> run 2: ret=0 (PASS), time=304.574991 (2/2)
> run 3: ret=0 (PASS), time=303.327408 (3/3)
> Results summary:
> 0: 3 times (100.00%), avg time 303.659 (0.65 varience/0.80
> deviation)
> Ran command 3 times, 3 passes
>
> The TCG tests run with -smp 4 on my system. While the TCG tests are
> purely CPU bound they do exercise the hot and cold paths of TCG
> execution (especially when triggering SMC detection). However there is
> still a benefit even with a 50% overhead compared to the ideal 263
> second elapsed time.
>
> Alex
>
Your tests results look very promising. It looks like you saw a 3x
speed improvement over single threading. Excellent. I wonder what the
numbers would be for a 22 core Xeon or 72 core Xeon Phi...
Do you think you could some test with an x86 guest like Windows XP?
There are plenty of benchmark tests for this platform. Video
encoding, Youtube video playback, and number crunching programs'
results would be very interesting to see.
Thanks.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
2016-08-11 15:23 Alex Bennée
@ 2016-08-11 17:22 ` Alex Bennée
2016-08-12 8:02 ` Alex Bennée
2016-09-06 9:24 ` Alex Bennée
1 sibling, 1 reply; 8+ messages in thread
From: Alex Bennée @ 2016-08-11 17:22 UTC (permalink / raw)
To: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, nikunj
Cc: mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell,
claudio.fontana
Alex Bennée <alex.bennee@linaro.org> writes:
> This is the fourth iteration of the RFC patch set which aims to
> provide the basic framework for MTTCG. I hope this will provide a good
> base for discussion at KVM Forum later this month.
>
<snip>
>
> In practice the memory barrier problems don't show up with an x86
> host. In fact I have created a tree which merges in the Emilio's
> cmpxchg atomics which happily boots ARMv7 Debian systems without any
> additional changes. You can find that at:
>
> https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2
>
<snip>
> Performance
> ===========
>
> You can't do full work-load testing on this tree due to the lack of
> atomic support (but I will run some numbers on
> mttcg/base-patches-v4-with-cmpxchg-atomics-v2).
So here is a more real world work load run:
retry.py called with ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 'virtio-net-device,netdev=unet', '-drive', 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=single']
run 1: ret=0 (PASS), time=261.794911 (1/1)
run 2: ret=0 (PASS), time=257.290045 (2/2)
run 3: ret=0 (PASS), time=256.536991 (3/3)
run 4: ret=0 (PASS), time=254.036260 (4/4)
run 5: ret=0 (PASS), time=256.539165 (5/5)
Results summary:
0: 5 times (100.00%), avg time 257.239 (8.00 varience/2.83 deviation)
Ran command 5 times, 5 passes
retry.py called with ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 'virtio-net-device,netdev=unet', '-drive', 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=multi']
run 1: ret=0 (PASS), time=86.597459 (1/1)
run 2: ret=0 (PASS), time=82.843904 (2/2)
run 3: ret=0 (PASS), time=84.095910 (3/3)
run 4: ret=0 (PASS), time=83.844595 (4/4)
run 5: ret=0 (PASS), time=83.594768 (5/5)
Results summary:
0: 5 times (100.00%), avg time 84.195 (2.02 varience/1.42 deviation)
Ran command 5 times, 5 passes
This shows a 30% overhead over the ideal for running multi-threaded but
still seeing a decent improvement in wall time.
So the test itself is booting the system, running the
benchmark-build.service:
# A benchmark target
#
# This shutsdown once the boot has completed
[Unit]
Description=Default
Requires=basic.target
After=basic.target
AllowIsolate=yes
[Service]
Type=oneshot
ExecStart=/root/mysrc/testcases.git/build-dir.sh
/root/src/stress-ng.git/
ExecStartPost=/sbin/poweroff
[Install]
WantedBy=multi-user.target
And the build-dir script is a simple:
#!/bin/sh
#
NR_CPUS=$(grep -c ^processor /proc/cpuinfo)
set -e
cd $1
make clean
make -j${NR_CPUS}
cd -
Measuring this over increasing -smp
| -smp | time | time as bar | theoretical | % of -smp 1 |
|------+---------+--------------+-------------+-------------|
| 1 | 238.184 | WWWWWWWWWWWW | 238.184 | |
| 2 | 133.402 | WWWWWWh | 119.092 | |
| 3 | 99.531 | WWWWH | 79.394667 | |
| 4 | 82.760 | WWWW: | 59.546 | |
#+TBLFM: $3='(orgtbl-ascii-draw $2 0 238.184 12)::$4=@2$2/$1
--
Alex Bennée
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
2016-08-11 17:22 ` Alex Bennée
@ 2016-08-12 8:02 ` Alex Bennée
0 siblings, 0 replies; 8+ messages in thread
From: Alex Bennée @ 2016-08-12 8:02 UTC (permalink / raw)
To: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, nikunj
Cc: mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell,
claudio.fontana
Alex Bennée <alex.bennee@linaro.org> writes:
> Alex Bennée <alex.bennee@linaro.org> writes:
>
>> This is the fourth iteration of the RFC patch set which aims to
>> provide the basic framework for MTTCG. I hope this will provide a good
>> base for discussion at KVM Forum later this month.
>>
> <snip>
>>
>> In practice the memory barrier problems don't show up with an x86
>> host. In fact I have created a tree which merges in the Emilio's
>> cmpxchg atomics which happily boots ARMv7 Debian systems without any
>> additional changes. You can find that at:
>>
>> https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2
>>
> <snip>
>> Performance
>> ===========
>>
>> You can't do full work-load testing on this tree due to the lack of
>> atomic support (but I will run some numbers on
>> mttcg/base-patches-v4-with-cmpxchg-atomics-v2).
>
> So here is a more real world work load run:
>
> retry.py called with ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 'virtio-net-device,netdev=unet', '-drive', 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=single']
> run 1: ret=0 (PASS), time=261.794911 (1/1)
> run 2: ret=0 (PASS), time=257.290045 (2/2)
> run 3: ret=0 (PASS), time=256.536991 (3/3)
> run 4: ret=0 (PASS), time=254.036260 (4/4)
> run 5: ret=0 (PASS), time=256.539165 (5/5)
> Results summary:
> 0: 5 times (100.00%), avg time 257.239 (8.00 varience/2.83 deviation)
> Ran command 5 times, 5 passes
>
> retry.py called with ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 'virtio-net-device,netdev=unet', '-drive', 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=multi']
> run 1: ret=0 (PASS), time=86.597459 (1/1)
> run 2: ret=0 (PASS), time=82.843904 (2/2)
> run 3: ret=0 (PASS), time=84.095910 (3/3)
> run 4: ret=0 (PASS), time=83.844595 (4/4)
> run 5: ret=0 (PASS), time=83.594768 (5/5)
> Results summary:
> 0: 5 times (100.00%), avg time 84.195 (2.02 varience/1.42 deviation)
> Ran command 5 times, 5 passes
>
> This shows a 30% overhead over the ideal for running multi-threaded but
> still seeing a decent improvement in wall time.
>
> So the test itself is booting the system, running the
> benchmark-build.service:
>
> # A benchmark target
> #
> # This shutsdown once the boot has completed
>
> [Unit]
> Description=Default
> Requires=basic.target
> After=basic.target
> AllowIsolate=yes
>
> [Service]
> Type=oneshot
> ExecStart=/root/mysrc/testcases.git/build-dir.sh
> /root/src/stress-ng.git/
> ExecStartPost=/sbin/poweroff
>
> [Install]
> WantedBy=multi-user.target
>
> And the build-dir script is a simple:
>
> #!/bin/sh
> #
> NR_CPUS=$(grep -c ^processor /proc/cpuinfo)
> set -e
> cd $1
> make clean
> make -j${NR_CPUS}
> cd -
>
> Measuring this over increasing -smp
Measuring this over increasing -smp
-smp time -smp 1 / smp time as bar x faster
-----------------------------------------------------
1 238.184 238.184 WWWWWWWWWWWW 1.000
2 133.402 119.092 WWWWWWh 1.785
3 99.531 79.395 WWWWW 2.393
4 82.760 59.546 WWWW. 2.878
5 82.513 47.637 WWWW. 2.887
6 78.922 39.697 WWWH 3.018
7 87.181 34.026 WWWW; 2.732
8 87.098 29.773 WWWW; 2.735
So a more complete analysis shows the benefits start to tail off as we
push past 4 vCPUs. However on my machine which is 4+4 hyperthreads that
could be just as much a feature of the host system. Indeed the results
start getting noisy at 7/8 vCPUs.
Interestingly a perf run against -smp 6 shows gic_update topping the
graph (3.14% of total execution time). That function does have a big
TODO for optimisation on it ;-)
--
Alex Bennée
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
2016-08-11 16:43 ` [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG G 3
@ 2016-08-12 13:19 ` Alex Bennée
2016-08-12 13:31 ` G 3
0 siblings, 1 reply; 8+ messages in thread
From: Alex Bennée @ 2016-08-12 13:19 UTC (permalink / raw)
To: G 3; +Cc: QEMU Developers
On 11 August 2016 at 17:43, G 3 <programmingkidx@gmail.com> wrote:
>
> On Aug 11, 2016, at 11:24 AM, qemu-devel-request@nongnu.org wrote:
>
>
> Performance
>
> ===========
>
>
> You can't do full work-load testing on this tree due to the lack of
>
> atomic support (but I will run some numbers on
>
> mttcg/base-patches-v4-with-cmpxchg-atomics-v2). However you certainly
>
> see a run time improvement with the kvm-unit-tests TCG group.
>
>
> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o', '-accel
> tcg,thread=single']
>
> run 1: ret=0 (PASS), time=1047.147924 (1/1)
>
> run 2: ret=0 (PASS), time=1071.921204 (2/2)
>
> run 3: ret=0 (PASS), time=1048.141600 (3/3)
>
> Results summary:
>
> 0: 3 times (100.00%), avg time 1055.737 (196.70 varience/14.02 deviation)
>
> Ran command 3 times, 3 passes
>
> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o', '-accel
> tcg,thread=multi']
>
> run 1: ret=0 (PASS), time=303.074210 (1/1)
>
> run 2: ret=0 (PASS), time=304.574991 (2/2)
>
> run 3: ret=0 (PASS), time=303.327408 (3/3)
>
> Results summary:
>
> 0: 3 times (100.00%), avg time 303.659 (0.65 varience/0.80 deviation)
>
> Ran command 3 times, 3 passes
>
>
> The TCG tests run with -smp 4 on my system. While the TCG tests are
>
> purely CPU bound they do exercise the hot and cold paths of TCG
>
> execution (especially when triggering SMC detection). However there is
>
> still a benefit even with a 50% overhead compared to the ideal 263
>
> second elapsed time.
>
>
> Alex
>
>
>
> Your tests results look very promising. It looks like you saw a 3x speed
> improvement over single threading. Excellent. I wonder what the numbers
> would be for a 22 core Xeon or 72 core Xeon Phi...
Well the initial results look like they tail off but I need to test on a more
capable machine. I'm going to package up the test case first so people
can easily
replicate the test.
> Do you think you could some test with an x86 guest like Windows XP? There
> are plenty of benchmark tests for this platform. Video encoding, Youtube
> video playback, and number crunching programs' results would be very
> interesting to see.
I don't have any Windows images to hand I'm afraid. Besides Windows is a fairly
boring guest from this point of view because:
- it's x86, so why use TCG over KVM
- QEMU TCG generally sucks at media bencmarks due to SIMD emulation
--
Alex Bennée
KVM/QEMU Hacker for Linaro
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
2016-08-12 13:19 ` Alex Bennée
@ 2016-08-12 13:31 ` G 3
2016-08-12 15:01 ` Alex Bennée
0 siblings, 1 reply; 8+ messages in thread
From: G 3 @ 2016-08-12 13:31 UTC (permalink / raw)
To: Alex Bennée; +Cc: QEMU Developers
On Aug 12, 2016, at 9:19 AM, Alex Bennée wrote:
> On 11 August 2016 at 17:43, G 3 <programmingkidx@gmail.com> wrote:
>>
>> On Aug 11, 2016, at 11:24 AM, qemu-devel-request@nongnu.org wrote:
>>
>>
>> Performance
>>
>> ===========
>>
>>
>> You can't do full work-load testing on this tree due to the lack of
>>
>> atomic support (but I will run some numbers on
>>
>> mttcg/base-patches-v4-with-cmpxchg-atomics-v2). However you certainly
>>
>> see a run time improvement with the kvm-unit-tests TCG group.
>>
>>
>> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o',
>> '-accel
>> tcg,thread=single']
>>
>> run 1: ret=0 (PASS), time=1047.147924 (1/1)
>>
>> run 2: ret=0 (PASS), time=1071.921204 (2/2)
>>
>> run 3: ret=0 (PASS), time=1048.141600 (3/3)
>>
>> Results summary:
>>
>> 0: 3 times (100.00%), avg time 1055.737 (196.70 varience/14.02
>> deviation)
>>
>> Ran command 3 times, 3 passes
>>
>> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o',
>> '-accel
>> tcg,thread=multi']
>>
>> run 1: ret=0 (PASS), time=303.074210 (1/1)
>>
>> run 2: ret=0 (PASS), time=304.574991 (2/2)
>>
>> run 3: ret=0 (PASS), time=303.327408 (3/3)
>>
>> Results summary:
>>
>> 0: 3 times (100.00%), avg time 303.659 (0.65 varience/0.80
>> deviation)
>>
>> Ran command 3 times, 3 passes
>>
>>
>> The TCG tests run with -smp 4 on my system. While the TCG tests are
>>
>> purely CPU bound they do exercise the hot and cold paths of TCG
>>
>> execution (especially when triggering SMC detection). However
>> there is
>>
>> still a benefit even with a 50% overhead compared to the ideal 263
>>
>> second elapsed time.
>>
>>
>> Alex
>>
>>
>>
>> Your tests results look very promising. It looks like you saw a 3x
>> speed
>> improvement over single threading. Excellent. I wonder what the
>> numbers
>> would be for a 22 core Xeon or 72 core Xeon Phi...
>
> Well the initial results look like they tail off but I need to test
> on a more
> capable machine. I'm going to package up the test case first so people
> can easily
> replicate the test.
>
>> Do you think you could some test with an x86 guest like Windows
>> XP? There
>> are plenty of benchmark tests for this platform. Video encoding,
>> Youtube
>> video playback, and number crunching programs' results would be very
>> interesting to see.
>
> I don't have any Windows images to hand I'm afraid. Besides Windows
> is a fairly
> boring guest from this point of view because:
>
> - it's x86, so why use TCG over KVM
> - QEMU TCG generally sucks at media bencmarks due to SIMD emulation
Mac OS X host don't have a hypervisor that QEMU supports (VirtualBox
isn't supported), so TCG is the only thing that can be used. Maybe a
free x86 guest like Linux could be used?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
2016-08-12 13:31 ` G 3
@ 2016-08-12 15:01 ` Alex Bennée
0 siblings, 0 replies; 8+ messages in thread
From: Alex Bennée @ 2016-08-12 15:01 UTC (permalink / raw)
To: G 3; +Cc: QEMU Developers
G 3 <programmingkidx@gmail.com> writes:
> On Aug 12, 2016, at 9:19 AM, Alex Bennée wrote:
>
>> On 11 August 2016 at 17:43, G 3 <programmingkidx@gmail.com> wrote:
>>>
>>> On Aug 11, 2016, at 11:24 AM, qemu-devel-request@nongnu.org wrote:
>>>
>>>
>>> Performance
>>>
>>> ===========
>>>
>>>
>>> You can't do full work-load testing on this tree due to the lack of
>>>
>>> atomic support (but I will run some numbers on
>>>
>>> mttcg/base-patches-v4-with-cmpxchg-atomics-v2). However you certainly
>>>
>>> see a run time improvement with the kvm-unit-tests TCG group.
>>>
>>>
>>> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o',
>>> '-accel
>>> tcg,thread=single']
>>>
>>> run 1: ret=0 (PASS), time=1047.147924 (1/1)
>>>
>>> run 2: ret=0 (PASS), time=1071.921204 (2/2)
>>>
>>> run 3: ret=0 (PASS), time=1048.141600 (3/3)
>>>
>>> Results summary:
>>>
>>> 0: 3 times (100.00%), avg time 1055.737 (196.70 varience/14.02
>>> deviation)
>>>
>>> Ran command 3 times, 3 passes
>>>
>>> retry.py called with ['./run_tests.sh', '-t', '-g', 'tcg', '-o',
>>> '-accel
>>> tcg,thread=multi']
>>>
>>> run 1: ret=0 (PASS), time=303.074210 (1/1)
>>>
>>> run 2: ret=0 (PASS), time=304.574991 (2/2)
>>>
>>> run 3: ret=0 (PASS), time=303.327408 (3/3)
>>>
>>> Results summary:
>>>
>>> 0: 3 times (100.00%), avg time 303.659 (0.65 varience/0.80
>>> deviation)
>>>
>>> Ran command 3 times, 3 passes
>>>
>>>
>>> The TCG tests run with -smp 4 on my system. While the TCG tests are
>>>
>>> purely CPU bound they do exercise the hot and cold paths of TCG
>>>
>>> execution (especially when triggering SMC detection). However
>>> there is
>>>
>>> still a benefit even with a 50% overhead compared to the ideal 263
>>>
>>> second elapsed time.
>>>
>>>
>>> Alex
>>>
>>>
>>>
>>> Your tests results look very promising. It looks like you saw a 3x
>>> speed
>>> improvement over single threading. Excellent. I wonder what the
>>> numbers
>>> would be for a 22 core Xeon or 72 core Xeon Phi...
>>
>> Well the initial results look like they tail off but I need to test
>> on a more
>> capable machine. I'm going to package up the test case first so people
>> can easily
>> replicate the test.
>>
>>> Do you think you could some test with an x86 guest like Windows
>>> XP? There
>>> are plenty of benchmark tests for this platform. Video encoding,
>>> Youtube
>>> video playback, and number crunching programs' results would be very
>>> interesting to see.
>>
>> I don't have any Windows images to hand I'm afraid. Besides Windows
>> is a fairly
>> boring guest from this point of view because:
>>
>> - it's x86, so why use TCG over KVM
>> - QEMU TCG generally sucks at media bencmarks due to SIMD emulation
>
> Mac OS X host don't have a hypervisor that QEMU supports (VirtualBox
> isn't supported), so TCG is the only thing that can be used. Maybe a
> free x86 guest like Linux could be used?
Sounds like you have the kit for this test case. Let me know if the
branch boots your test images?
--
Alex Bennée
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG
2016-08-11 15:23 Alex Bennée
2016-08-11 17:22 ` Alex Bennée
@ 2016-09-06 9:24 ` Alex Bennée
1 sibling, 0 replies; 8+ messages in thread
From: Alex Bennée @ 2016-09-06 9:24 UTC (permalink / raw)
To: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, nikunj
Cc: mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell,
claudio.fontana
Alex Bennée <alex.bennee@linaro.org> writes:
> This is the fourth iteration of the RFC patch set which aims to
> provide the basic framework for MTTCG. I hope this will provide a good
> base for discussion at KVM Forum later this month.
Review ping?
It would be nice to get some review feedback before I re-spin against
the latest async safe work. Unless everyone already thinks the code is
perfect as it is ;-)
--
Alex Bennée
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-09-06 9:24 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <mailman.11856.1470929072.26858.qemu-devel@nongnu.org>
2016-08-11 16:43 ` [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG G 3
2016-08-12 13:19 ` Alex Bennée
2016-08-12 13:31 ` G 3
2016-08-12 15:01 ` Alex Bennée
2016-08-11 15:23 Alex Bennée
2016-08-11 17:22 ` Alex Bennée
2016-08-12 8:02 ` Alex Bennée
2016-09-06 9:24 ` Alex Bennée
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.