* [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans @ 2016-08-15 10:46 Alex Bennée 2016-08-15 11:00 ` Peter Maydell 2016-08-15 15:46 ` Emilio G. Cota 0 siblings, 2 replies; 21+ messages in thread From: Alex Bennée @ 2016-08-15 10:46 UTC (permalink / raw) To: mttcg, qemu-devel, fred.konrad, a.rigo, cota, bobby.prani, nikunj Cc: mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite Hi, Numbers! ======== First things first, I ran some more benchmarks on the base patches + cmpxchg branch over the weekend when I had access to some bigger boxen which weren't being used. I also added some KVM runs for comparison: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -smp on overdrive01 [1] x -smp 1 on desktop [2] x -smp 1 on hackbox [3] x -smp 1 ──────────────────────────────────────────────────────────────────────────────────────── 1 36.995 1.000 243.723 1.000 377.035 1.000 2 21.480 1.722 134.854 1.807 216.337 1.743 3 16.474 2.246 100.090 2.435 163.316 2.309 4 13.671 2.706 83.512 2.918 136.180 2.769 5 12.269 3.015 82.519 2.954 119.261 3.161 6 11.268 3.283 79.589 3.062 110.393 3.415 7 n/a n/a 78.338 3.111 105.244 3.582 8 n/a n/a 81.091 3.006 103.032 3.659 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Footnotes ───────── [1] pre-production A57, only 6 cores, KVM with -cpu host,aarch64=off [2] i7-4770 @ 3.4 Ghz, past -smp 5 there is much greater deviation plus some hangs, best times taken [3] Xeon X5690 @ 3.47Ghz, 24 cores, -smp 7 number manually calculated So comparing the numbers on the Xeon monster to my desktop seem to show we still get a beneficial scaling when the extra cores are real cores instead of fake hyperthread cores. I only ran up to -smp 8 as that is as much as the -m virt model will actually accept. I have noticed some instability in the test though for high -smp values which caused the test runners timeout protection to kick in. These look like guest hangs and maybe barrier related (store-after-load re-ordering can happen). I plan to apply the barrier patches and see if this improves the stability of the tests. All in all however the results are pretty promising I'm now running -smp 4 -accel tcg,thread=multi on a fairly regular basis and appreciating the more snappy response on heavy operations. MTTCG Call ========== We've missed a number of the MTTCG calls of late and given the spread of developers actively working on MTTCG stuff I wonder if we should just shelve the call and move to regular status updates on the list? I'm happy to prompt a status thread every couple of weeks if wanted. As far as I'm aware the following work is still ongoing: Emilo: cmpxchg atomics Alvise: LL/SC modelling Pranith: Memory barrier work (GSoC coming to an end this month) Nikunj: PPC support for MTTCG Anyone want to add their status updates? Is anyone else secretly working on MTTCG related bits who want to make themselves known? KVM Forum ========= I'll be at KVM Forum on Toronto next week. Feel free to grab me at anytime but I'm planning to sign up for a BoF slot on Thursday afternoon to discuss any outstanding issues for MTTCG and discuss any outstanding work that needs to be done to be ready for merging when the 2.8 development cycle opens. From my point of view I think we are looking pretty good for merging but I would like to get input from the TCG maintainers who are the ones that will need to accept the work into their tree. The only current issue I'm aware of is thread safety of the GDB stub. In theory it is not currently MTTCG safe but it tends to get away with it because the system is halted when updates are made to the break/watchpoint lists. I did post a series to RCUify these few months ago but I dropped it (and the debug asserts) from the base patches series as it felt a little orthogonal to the main work. My feeling is this shouldn't be a blocker to MTTCG going in (as it doesn't get any worse) but we can fix it up in a later series. However I would like to get the opinions of the maintainers to this approach. Are there any other issues we should be aware of? Looking forward to meeting up with other QEMU hackers in the flesh next week! Cheers, -- Alex Bennée ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans 2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée @ 2016-08-15 11:00 ` Peter Maydell 2016-08-15 11:16 ` Alex Bennée 2016-08-15 15:46 ` Emilio G. Cota 1 sibling, 1 reply; 21+ messages in thread From: Peter Maydell @ 2016-08-15 11:00 UTC (permalink / raw) To: Alex Bennée Cc: MTTCG Devel, QEMU Developers, KONRAD Frédéric, Alvise Rigo, Emilio G. Cota, pranith kumar, Nikunj A Dadhania, Mark Burton, Paolo Bonzini, J. Kiszka, Sergey Fedorov, Richard Henderson, Claudio Fontana, Dr. David Alan Gilbert, Peter Crosthwaite On 15 August 2016 at 11:46, Alex Bennée <alex.bennee@linaro.org> wrote: > I only ran up to -smp 8 as that is as > much as the -m virt model will actually accept. FWIW, -machine gic-version=3 should allow you more than 8 cores. > I have noticed some instability in the test though for high -smp values > which caused the test runners timeout protection to kick in. These look > like guest hangs and maybe barrier related (store-after-load re-ordering > can happen). I plan to apply the barrier patches and see if this > improves the stability of the tests. > From my point of view I think we are looking pretty good for merging but > I would like to get input from the TCG maintainers who are the ones that > will need to accept the work into their tree. Your note above about instability and hangs is the main thing that makes me nervous about merging... thanks -- PMM ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans 2016-08-15 11:00 ` Peter Maydell @ 2016-08-15 11:16 ` Alex Bennée 0 siblings, 0 replies; 21+ messages in thread From: Alex Bennée @ 2016-08-15 11:16 UTC (permalink / raw) To: Peter Maydell Cc: MTTCG Devel, QEMU Developers, KONRAD Frédéric, Alvise Rigo, Emilio G. Cota, pranith kumar, Nikunj A Dadhania, Mark Burton, Paolo Bonzini, J. Kiszka, Sergey Fedorov, Richard Henderson, Claudio Fontana, Dr. David Alan Gilbert, Peter Crosthwaite Peter Maydell <peter.maydell@linaro.org> writes: > On 15 August 2016 at 11:46, Alex Bennée <alex.bennee@linaro.org> wrote: >> I only ran up to -smp 8 as that is as >> much as the -m virt model will actually accept. > > FWIW, -machine gic-version=3 should allow you more than 8 cores. Good to know. Thanks. >> I have noticed some instability in the test though for high -smp values >> which caused the test runners timeout protection to kick in. These look >> like guest hangs and maybe barrier related (store-after-load re-ordering >> can happen). I plan to apply the barrier patches and see if this >> improves the stability of the tests. > >> From my point of view I think we are looking pretty good for merging but >> I would like to get input from the TCG maintainers who are the ones that >> will need to accept the work into their tree. > > Your note above about instability and hangs is the main thing that > makes me nervous about merging... Don't worry I won't be proposing any merge while I can still provoke hangs in the guest! My point is you actually have to work quite hard to trigger these and they are subtle emulation failures that trip up the guest rather than crashes that take down QEMU itself. I wanted to post the numbers I'd collected so far because I feel we are shaping up quite well. Just one more hill... ;-) -- Alex Bennée ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans 2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée 2016-08-15 11:00 ` Peter Maydell @ 2016-08-15 15:46 ` Emilio G. Cota 2016-08-15 15:49 ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota 2016-08-16 11:16 ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée 1 sibling, 2 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-15 15:46 UTC (permalink / raw) To: Alex Bennée Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On Mon, Aug 15, 2016 at 11:46:32 +0100, Alex Bennée wrote: > As far as I'm aware the following work is still ongoing: > > Emilo: cmpxchg atomics > Alvise: LL/SC modelling I've been tinkering with an experimental patch to do proper LL/SC. The idea is to rely on hardware transactional memory, so that stores don't have to be tracked. The trickiest thing is the fallback path, for which I'm trying to (ab)use EXCP_ATOMIC to execute exclusively from the ldrex all the way to the strex. To test it, I'm using aarch64-linux-user running qht-bench compiled on an aarch64 machine. I'm running on an Intel Skylake host (Skylake has no known TSX bugs) However, I'm finding issues that might not have to do with the patch itself. - On the latest MTTCG+cmpxchg tree (45c11751ed7 a.k.a. bennee/mttcg/base-patches-v4-with-cmpxchg-atomics-v2), QEMU loops forever without making progress in the instruction stream, even with taskset -c 0. - On the cmpxchg tree (rth's atomic-2 branch [1]), it works more reliably, although tb_lock is held around tb_find_fast so parallelism isn't very high. Still, it sometimes triggers the assert below. - Applying the "remove tb_lock around hot path" patch makes it easier to trigger this assert in cpu-exec.c:650 (approx.): /* Assert that the compiler does not smash local variables. */ g_assert(cpu == current_cpu) I've also seen triggered the assert immediately after that one, as well as the rcu_read_unlock depth assert. The asserts are usually triggered when all threads exit (by returning NULL) at roughly the same time. However, they cannot be triggered with taskset -c 0, which makes me suspect that somehow start_exclusive isn't working as intended. Any tips would be appreciated! I'll reply with a patch that uses RTM, the one below is fallback path all the way, and the best to reproduce the above. Thanks, Emilio [1] https://github.com/rth7680/qemu/commits/atomic-2 >From ed6af6eb364e5a36e81d7cc8143c0e9783c50587 Mon Sep 17 00:00:00 2001 From: "Emilio G. Cota" <cota@braap.org> Date: Mon, 15 Aug 2016 00:27:42 +0200 Subject: [PATCH] aarch64: use TSX for ldrex/strex (fallback path only) Signed-off-by: Emilio G. Cota <cota@braap.org> --- linux-user/main.c | 5 +++-- target-arm/helper-a64.c | 23 +++++++++++++++++++++++ target-arm/helper-a64.h | 4 ++++ target-arm/translate-a64.c | 15 +++++++++------ 4 files changed, 39 insertions(+), 8 deletions(-) diff --git a/linux-user/main.c b/linux-user/main.c index 9880505..6922faa 100644 --- a/linux-user/main.c +++ b/linux-user/main.c @@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu) /* Since we got here, we know that parallel_cpus must be true. */ parallel_cpus = false; - cpu_exec_step(cpu); - parallel_cpus = true; + while (!parallel_cpus) { + cpu_exec_step(cpu); + } end_exclusive(); } diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c index 8ce518b..a97b631 100644 --- a/target-arm/helper-a64.c +++ b/target-arm/helper-a64.c @@ -579,3 +579,26 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, return !success; } + +void HELPER(xbegin)(CPUARMState *env) +{ + uintptr_t ra = GETPC(); + + if (parallel_cpus) { + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); + } +} + +void HELPER(xend)(void) +{ + assert(!parallel_cpus); + parallel_cpus = true; +} + +uint64_t HELPER(x_ok)(void) +{ + if (!parallel_cpus) { + return 1; + } + return 0; +} diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h index dd32000..e7ede43 100644 --- a/target-arm/helper-a64.h +++ b/target-arm/helper-a64.h @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64) DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64) + +DEF_HELPER_1(xbegin, void, env) +DEF_HELPER_0(x_ok, i64) +DEF_HELPER_0(xend, void) diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c index 450c359..cfcf440 100644 --- a/target-arm/translate-a64.c +++ b/target-arm/translate-a64.c @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, TCGv_i64 tmp = tcg_temp_new_i64(); TCGMemOp be = s->be_data; + gen_helper_xbegin(cpu_env); + g_assert(size <= 3); if (is_pair) { TCGv_i64 hitmp = tcg_temp_new_i64(); @@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label); tmp = tcg_temp_new_i64(); + gen_helper_x_ok(tmp); + tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label); + if (is_pair) { if (size == 2) { TCGv_i64 val = tcg_temp_new_i64(); @@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, } } else { TCGv_i64 val = cpu_reg(s, rt); - tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val, - get_mem_index(s), - size | MO_ALIGN | s->be_data); - tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val); + tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size); } tcg_temp_free_i64(addr); - - tcg_gen_mov_i64(cpu_reg(s, rd), tmp); tcg_temp_free_i64(tmp); + + tcg_gen_movi_i64(cpu_reg(s, rd), 0); + gen_helper_xend(); tcg_gen_br(done_label); gen_set_label(fail_label); -- 2.7.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-15 15:46 ` Emilio G. Cota @ 2016-08-15 15:49 ` Emilio G. Cota 2016-08-17 17:22 ` Richard Henderson 2016-08-16 11:16 ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée 1 sibling, 1 reply; 21+ messages in thread From: Emilio G. Cota @ 2016-08-15 15:49 UTC (permalink / raw) To: Alex Bennée Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite Configure with --extra-cflags="-mrtm" Signed-off-by: Emilio G. Cota <cota@braap.org> --- linux-user/main.c | 5 +++-- target-arm/helper-a64.c | 42 ++++++++++++++++++++++++++++++++++++++++++ target-arm/helper-a64.h | 4 ++++ target-arm/translate-a64.c | 15 +++++++++------ 4 files changed, 58 insertions(+), 8 deletions(-) diff --git a/linux-user/main.c b/linux-user/main.c index 9880505..6922faa 100644 --- a/linux-user/main.c +++ b/linux-user/main.c @@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu) /* Since we got here, we know that parallel_cpus must be true. */ parallel_cpus = false; - cpu_exec_step(cpu); - parallel_cpus = true; + while (!parallel_cpus) { + cpu_exec_step(cpu); + } end_exclusive(); } diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c index 8ce518b..af45694 100644 --- a/target-arm/helper-a64.c +++ b/target-arm/helper-a64.c @@ -33,6 +33,8 @@ #include "tcg.h" #include <zlib.h> /* For crc32 */ +#include <immintrin.h> + /* C2.4.7 Multiply and divide */ /* special cases for 0 and LLONG_MIN are mandated by the standard */ uint64_t HELPER(udiv64)(uint64_t num, uint64_t den) @@ -579,3 +581,43 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, return !success; } + +void HELPER(xbegin)(CPUARMState *env) +{ + uintptr_t ra = GETPC(); + int status; + int retries = 100; + + retry: + status = _xbegin(); + if (status != _XBEGIN_STARTED) { + if (status && retries) { + retries--; + goto retry; + } + if (parallel_cpus) { + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); + } + } +} + +void HELPER(xend)(void) +{ + if (_xtest()) { + _xend(); + } else { + assert(!parallel_cpus); + parallel_cpus = true; + } +} + +uint64_t HELPER(x_ok)(void) +{ + if (_xtest()) { + return 1; + } + if (!parallel_cpus) { + return 1; + } + return 0; +} diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h index dd32000..e7ede43 100644 --- a/target-arm/helper-a64.h +++ b/target-arm/helper-a64.h @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64) DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64) + +DEF_HELPER_1(xbegin, void, env) +DEF_HELPER_0(x_ok, i64) +DEF_HELPER_0(xend, void) diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c index 450c359..cfcf440 100644 --- a/target-arm/translate-a64.c +++ b/target-arm/translate-a64.c @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, TCGv_i64 tmp = tcg_temp_new_i64(); TCGMemOp be = s->be_data; + gen_helper_xbegin(cpu_env); + g_assert(size <= 3); if (is_pair) { TCGv_i64 hitmp = tcg_temp_new_i64(); @@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label); tmp = tcg_temp_new_i64(); + gen_helper_x_ok(tmp); + tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label); + if (is_pair) { if (size == 2) { TCGv_i64 val = tcg_temp_new_i64(); @@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, } } else { TCGv_i64 val = cpu_reg(s, rt); - tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val, - get_mem_index(s), - size | MO_ALIGN | s->be_data); - tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val); + tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size); } tcg_temp_free_i64(addr); - - tcg_gen_mov_i64(cpu_reg(s, rd), tmp); tcg_temp_free_i64(tmp); + + tcg_gen_movi_i64(cpu_reg(s, rd), 0); + gen_helper_xend(); tcg_gen_br(done_label); gen_set_label(fail_label); -- 2.7.4 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-15 15:49 ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota @ 2016-08-17 17:22 ` Richard Henderson 2016-08-17 17:58 ` Emilio G. Cota 0 siblings, 1 reply; 21+ messages in thread From: Richard Henderson @ 2016-08-17 17:22 UTC (permalink / raw) To: Emilio G. Cota, Alex Bennée Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On 08/15/2016 08:49 AM, Emilio G. Cota wrote: > +void HELPER(xbegin)(CPUARMState *env) > +{ > + uintptr_t ra = GETPC(); > + int status; > + int retries = 100; > + > + retry: > + status = _xbegin(); > + if (status != _XBEGIN_STARTED) { > + if (status && retries) { > + retries--; > + goto retry; > + } > + if (parallel_cpus) { > + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); > + } > + } > +} > + > +void HELPER(xend)(void) > +{ > + if (_xtest()) { > + _xend(); > + } else { > + assert(!parallel_cpus); > + parallel_cpus = true; > + } > +} > + Interesting idea. FWIW, there are two other extant HTM implementations: ppc64 and s390x. As I recall, the s390 (but not the ppc64) transactions do not roll back the fp registers. Which suggests that we need special support within the TCG proglogue. Perhaps folding these operations into special TCG opcodes. I believe that power8 has HTM, and there's one of those in the gcc compile farm, so this should be relatively easy to try out. We increase the chances of success of the transaction if we minimize the amount of non-target code that's executed while the transaction is running. That suggests two things: (1) that it would be doubly helpful to incorporate the transaction start directly into TCG code generation rather than as a helper and (2) that we should start a new TB upon encountering a load-exclusive, so that we maximize the chance of the store-exclusive being a part of the same TB and thus have *nothing* extra between the beginning and commit of the transaction. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-17 17:22 ` Richard Henderson @ 2016-08-17 17:58 ` Emilio G. Cota 2016-08-17 18:18 ` Emilio G. Cota 2016-08-17 18:41 ` Richard Henderson 0 siblings, 2 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-17 17:58 UTC (permalink / raw) To: Richard Henderson Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On Wed, Aug 17, 2016 at 10:22:05 -0700, Richard Henderson wrote: > On 08/15/2016 08:49 AM, Emilio G. Cota wrote: > >+void HELPER(xbegin)(CPUARMState *env) > >+{ > >+ uintptr_t ra = GETPC(); > >+ int status; > >+ int retries = 100; > >+ > >+ retry: > >+ status = _xbegin(); > >+ if (status != _XBEGIN_STARTED) { > >+ if (status && retries) { > >+ retries--; > >+ goto retry; > >+ } > >+ if (parallel_cpus) { > >+ cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); > >+ } > >+ } > >+} > >+ > >+void HELPER(xend)(void) > >+{ > >+ if (_xtest()) { > >+ _xend(); > >+ } else { > >+ assert(!parallel_cpus); > >+ parallel_cpus = true; > >+ } > >+} > >+ > > Interesting idea. > > FWIW, there are two other extant HTM implementations: ppc64 and s390x. As I > recall, the s390 (but not the ppc64) transactions do not roll back the fp > registers. Which suggests that we need special support within the TCG > proglogue. Perhaps folding these operations into special TCG opcodes. I'm not familiar with s390, but as long as the hardware implements 'strong atomicity' ["strong atomicity guarantees atomicity between transactions and non-transactional code", see http://acg.cis.upenn.edu/papers/cal06_atomic_semantics.pdf ] then this approach would work, in the sense that stores wouldn't have to be instrumented. Of course architecture issues like saving the fp registers as you mention for s390 would have to be taken into account. > I believe that power8 has HTM, and there's one of those in the gcc compile > farm, so this should be relatively easy to try out. Good point! I had forgotten about power8. So far my tests have been on a 4-core Skylake. I have an account on the gcc compile farm so I will make use of it. The power8 machine in the farm has a lot of cores, so this is pretty exciting. > We increase the chances of success of the transaction if we minimize the > amount of non-target code that's executed while the transaction is running. > That suggests two things: > > (1) that it would be doubly helpful to incorporate the transaction start > directly into TCG code generation rather than as a helper and This (and leaving the fallback path in a helper) is simple enough that even I could do it :-) > (2) that we should start a new TB upon encountering a load-exclusive, so > that we maximize the chance of the store-exclusive being a part of the same > TB and thus have *nothing* extra between the beginning and commit of the > transaction. I don't know how to do this. If it's easy to do, please let me know how (for aarch64 at least, since that's the target I'm using). I've run some more tests on the Intel machine, and noticed that failed transactions are very common (up to 50% abort rate for some SPEC workloads, and I count these aborts as "retrying doesn't help" kind of aborts), so bringing that down should definitely help. Another thing I found out is that abusing tcg_exec_step (as is right now) for the fallback path is a bad idea: when there are many failed transactions, performance drops dramatically (up to 5x overall slowdown). Turns out that all this overhead comes from re-translating the code between ldrex/strex. Would it be possible to cache this step-by-step code? If not, then an alternative would be to have a way to stop the world *without* leaving the CPU loop for the calling thread. I'm more comfortable doing the latter due to my glaring lack of TCG competence. Thanks, Emilio ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-17 17:58 ` Emilio G. Cota @ 2016-08-17 18:18 ` Emilio G. Cota 2016-08-17 18:41 ` Richard Henderson 1 sibling, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-17 18:18 UTC (permalink / raw) To: Richard Henderson Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On Wed, Aug 17, 2016 at 13:58:00 -0400, Emilio G. Cota wrote: > due to my glaring lack of TCG competence. A related note that might be of interest. I benchmarked an alternative implementation that *does* instrument stores. I wrapped every tcg_gen_qemu_st_i64 (those are enough, right? tcg_gen_st_i64 are stores for the host memory, which I presume are not "explicit" guest stores and therefore would not go through the soft TLB) with a pre/post pair of helpers. These helpers first check a bitmap given a masked subset of the physical address of the access, and if the bit is set, then check a QHT with the full physaddr. If an entry exists, they lock/unlock the entry's spinlock around the store, so that no race is possible with an ongoing atomic (atomics always take their corresponding lock). Overhead is not too bad over cmpxchg, but most of it comes from the helpers--see these numbers for SPEC: (NB. the "QEMU" baseline does *not* include QHT for tb_htable and therefore takes tb_lock around tb_find_fast, that's why it's so slow) http://imgur.com/a/SoSHQ "QHT only" means a QHT lookup is performed on every guest store. The win of having the bitmap before hitting the QHT is quite large. I wonder if things could be sped up further by performing the bitmap check in TCG code. Would that be worth exploring? If so, any help on that would be appreciated (i386 host at least)--I tried, but I'm way out of my element. E. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-17 17:58 ` Emilio G. Cota 2016-08-17 18:18 ` Emilio G. Cota @ 2016-08-17 18:41 ` Richard Henderson 2016-08-18 15:38 ` Richard Henderson 1 sibling, 1 reply; 21+ messages in thread From: Richard Henderson @ 2016-08-17 18:41 UTC (permalink / raw) To: Emilio G. Cota Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On 08/17/2016 10:58 AM, Emilio G. Cota wrote: >> (2) that we should start a new TB upon encountering a load-exclusive, so >> that we maximize the chance of the store-exclusive being a part of the same >> TB and thus have *nothing* extra between the beginning and commit of the >> transaction. > > I don't know how to do this. If it's easy to do, please let me know how > (for aarch64 at least, since that's the target I'm using). It's a simple matter of peeking at the next instruction. One way is to partially decode the insn before advancing the PC. static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns) { uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b); + + if (num_insns > 1 && (insn & xxx) == yyy) { + /* Start load-exclusive in a new TB. */ + s->is_jmp = DISAS_UPDATE; + return; + } s->insn = insn; s->pc += 4; ... Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-17 18:41 ` Richard Henderson @ 2016-08-18 15:38 ` Richard Henderson 2016-08-24 21:12 ` Emilio G. Cota 0 siblings, 1 reply; 21+ messages in thread From: Richard Henderson @ 2016-08-18 15:38 UTC (permalink / raw) To: Emilio G. Cota Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On 08/17/2016 11:41 AM, Richard Henderson wrote: > On 08/17/2016 10:58 AM, Emilio G. Cota wrote: >>> (2) that we should start a new TB upon encountering a load-exclusive, so >>> that we maximize the chance of the store-exclusive being a part of the same >>> TB and thus have *nothing* extra between the beginning and commit of the >>> transaction. >> >> I don't know how to do this. If it's easy to do, please let me know how >> (for aarch64 at least, since that's the target I'm using). > > It's a simple matter of peeking at the next instruction. > > One way is to partially decode the insn before advancing the PC. > > static void disas_a64_insn (CPUARMState *env, DisasContext *s, int num_insns) > { > uint32_t insn = arm_ldl_code(env, s->pc, s->sctlr_b); > + > + if (num_insns > 1 && (insn & xxx) == yyy) { > + /* Start load-exclusive in a new TB. */ > + s->is_jmp = DISAS_UPDATE; > + return; > + } > s->insn = insn; > s->pc += 4; > ... > > > Alternately, store num_insns into DisasContext, and do pc -= 4 in disas_ldst_excl. Actually, the mask check is the only really viable solution, and it needs to happen before we do the tcg_gen_insn_start thing. A couple of other notes, as I've thought about this some more. If the start and end of the transaction are not in the same TB, the likelihood of transaction failure should be very near 100%. Consider: * TB with ldrex ends before the strex. * Since the next TB hasn't been built yet, we'll definitely go through tb_find_physical, through the translator, and through the tcg compiler. (a) Which I think we can definitely assume will exhaust any resources associated with the transaction. (b) Which will abort the transaction, (c) Which, with the current code, will retry N times, with identical results, failing within the compiler each time, (d) Which, with the current code, will single-step through to the strex, as you saw. * Since we proceed to (d) the first time, we'll never succeed to create the next TB, so we'll always iterate compilation N times, resulting in the single-step. This is probably the real slow-down that you see. Therefore, we must abort any transaction when we exit tcg-generated code. Both through cpu_exit_loop or through the tcg epilogue. We should be able to use the software controlled bits associated with the abort to tell what kind of event lead to the abort. However, we must bear in mind that (for both x86 and ppc at least) we only have an 8-bit abort code. So we can't pass back a pointer, for instance. We should think about what kinds of limitations we should accept for handling ll/sc via transactions. * How do we handle unpaired ldrexd / ldxp? This is used by the compiler, as it's the only way to perform a double-word atomic load. This implies that we need some sort of counter, beyond which we stop trying to succeed via transaction. * In order to make normal cmpxchg patterns work, we have to be able to handle a branch within a ll/sc sequence. Options: * Less complex way is to build a TB, including branches, with a max of N insns along the branch-not-taken path, searching for the strex. But of course this fails to handle legitimate patterns for arm (and other ll/sc guests). However, gcc code generation will generally annotate the cmpxchg failure branch as not-taken, so perhaps this will work well enough in practice. * More complex way is to build a TB, including branches, with a max of N insns along *all* paths, searching for the strex. This runs into problems with, among other things, branches crossing pages. * Most complex way is to somehow get all of the TBs built, and linked together, preferably before we even try executing (and failing the transaction in) the first TB. r~ ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex 2016-08-18 15:38 ` Richard Henderson @ 2016-08-24 21:12 ` Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota 0 siblings, 1 reply; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 21:12 UTC (permalink / raw) To: Richard Henderson Cc: Alex Bennée, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote: > A couple of other notes, as I've thought about this some more. Thanks for spending time on this. I have a new patchset (will send as a reply to this e-mail in a few minutes) that has good performance. Its main ideas: - Use transactions that start on ldrex and finish on strex. On an exception, end (instead of abort) the ongoing transaction, if any. There's little point in aborting, since the subsequent retries will end up in the same exception anyway. This means the translation of the corresponding blocks might happen via the fallback path. That's OK, given that subsequent executions of the TBs will (likely) complete via HTM. - For the fallback path, add a stop-the-world primitive that stops all other CPUs, without requiring the calling CPU to exit the CPU loop. Not breaking from the loop keeps the code simple--we can just keep translating/executing normally, with the guarantee that no other CPU can run until we're done. - The fallback path of the transaction stops the world and then continues execution (from ldrex) as the only running CPU. - Only retry when the hardware hints that we may do so. This ends up being rare (I can only get dozens of retries under heavy contention, for instance with 'atomic_add-bench -r 1') Limitations: for now user-mode only, and I have paid no attention to paired atomics. Also, I'm making no checks for unusual (undefined?) guest code, such as stray ldrex/strex thrown in there. Performance optimizations like you suggest (e.g. starting a TB on ldrex, or using TCG ops for beginning/ending the transaction) could be implemented, but at least on Intel TSX (the only one I've tried so far[*]), the transaction buffer seems big enough to not make these optimizations a necessity. [*] I tried running HTM primitives on the gcc compile farm's Power8, but I get an illegal instruction fault on tbegin. I've filed an issue here to report it: https://gna.org/support/?3369 ] Some observations: - The peak number of retries I see is for atomic_add-bench -r 1 -n 16 (on an 8-thread machine) at about ~90 retries. So I set the limit to 100. - The lowest success rate I've seen is ~98%, again for atomic_add-bench under high contention. Some numbers: - atomic_add's performance is lower for HTM vs cmpxchg, although under contention performance gets very similar. The reason for the perf gap is that xbegin/xend takes more cycles than cmpxchg, especially under little or no contention; this explains the large difference for threads=1. http://imgur.com/5kiT027 As a side note, contended transactions seem to scale worse than contended cmpxchg when exploiting SMT. But anyway I wouldn't read much into that. - For more realistic workloads that gap goes away, as the relative impact of cmpxchg or transaction delays is lower. For QHT, 1000 keys: http://imgur.com/l6vcowu And for SPEC (note that despite being single-threaded, SPEC executes a lot of atomics, e.g. from mutexes and from forking): http://imgur.com/W49YMhJ Performance is essentially identical to that of cmpxchg, but of course with HTM we get correct emulation. Thanks for reading this far! Emilio ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST 2016-08-24 21:12 ` Emilio G. Cota @ 2016-08-24 22:17 ` Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota ` (6 more replies) 0 siblings, 7 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter This avoids the chance of reading a corrupted list of CPUs in usermode. Note: this breaks hw/ppc/spapr due to the removal of CPU_FOREACH_REVERSE. Signed-off-by: Emilio G. Cota <cota@braap.org> --- cpus.c | 2 +- exec.c | 18 +++++++++++++++--- include/qom/cpu.h | 16 +++++++--------- linux-user/main.c | 2 +- linux-user/syscall.c | 2 +- 5 files changed, 25 insertions(+), 15 deletions(-) diff --git a/cpus.c b/cpus.c index a01bbbd..bc573be 100644 --- a/cpus.c +++ b/cpus.c @@ -1177,7 +1177,7 @@ static void *qemu_tcg_cpu_thread_fn(void *arg) qemu_clock_notify(QEMU_CLOCK_VIRTUAL); } } - qemu_tcg_wait_io_event(QTAILQ_FIRST(&cpus)); + qemu_tcg_wait_io_event(first_cpu); CPU_FOREACH(cpu) { if (cpu->unplug && !cpu_can_run(cpu)) { remove_cpu = cpu; diff --git a/exec.c b/exec.c index 806e2fe..70dd869 100644 --- a/exec.c +++ b/exec.c @@ -93,7 +93,7 @@ static MemoryRegion io_mem_unassigned; #endif -struct CPUTailQ cpus = QTAILQ_HEAD_INITIALIZER(cpus); +struct CPUTailQ cpus = QLIST_HEAD_INITIALIZER(cpus); /* current CPU in the current thread. It is only valid inside cpu_exec() */ __thread CPUState *current_cpu; @@ -651,7 +651,7 @@ void cpu_exec_exit(CPUState *cpu) return; } - QTAILQ_REMOVE(&cpus, cpu, node); + QLIST_REMOVE_RCU(cpu, node); cpu_release_index(cpu); cpu->cpu_index = -1; #if defined(CONFIG_USER_ONLY) @@ -703,7 +703,19 @@ void cpu_exec_init(CPUState *cpu, Error **errp) #endif return; } - QTAILQ_INSERT_TAIL(&cpus, cpu, node); + /* poor man's QLIST_INSERT_TAIL_RCU */ + if (QLIST_EMPTY_RCU(&cpus)) { + QLIST_INSERT_HEAD_RCU(&cpus, cpu, node); + } else { + CPUState *some_cpu; + + CPU_FOREACH(some_cpu) { + if (QLIST_NEXT_RCU(some_cpu, node) == NULL) { + QLIST_INSERT_AFTER_RCU(some_cpu, cpu, node); + break; + } + } + } #if defined(CONFIG_USER_ONLY) (void) cc; cpu_list_unlock(); diff --git a/include/qom/cpu.h b/include/qom/cpu.h index 32f3af3..eba48ed 100644 --- a/include/qom/cpu.h +++ b/include/qom/cpu.h @@ -24,7 +24,7 @@ #include "disas/bfd.h" #include "exec/hwaddr.h" #include "exec/memattrs.h" -#include "qemu/queue.h" +#include "qemu/rcu_queue.h" #include "qemu/thread.h" typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size, @@ -319,7 +319,7 @@ struct CPUState { struct GDBRegisterState *gdb_regs; int gdb_num_regs; int gdb_num_g_regs; - QTAILQ_ENTRY(CPUState) node; + QLIST_ENTRY(CPUState) node; /* ice debug support */ QTAILQ_HEAD(breakpoints_head, CPUBreakpoint) breakpoints; @@ -362,15 +362,13 @@ struct CPUState { uint32_t tcg_exit_req; }; -QTAILQ_HEAD(CPUTailQ, CPUState); +QLIST_HEAD(CPUTailQ, CPUState); extern struct CPUTailQ cpus; -#define CPU_NEXT(cpu) QTAILQ_NEXT(cpu, node) -#define CPU_FOREACH(cpu) QTAILQ_FOREACH(cpu, &cpus, node) +#define CPU_NEXT(cpu) QLIST_NEXT_RCU(cpu, node) +#define CPU_FOREACH(cpu) QLIST_FOREACH_RCU(cpu, &cpus, node) #define CPU_FOREACH_SAFE(cpu, next_cpu) \ - QTAILQ_FOREACH_SAFE(cpu, &cpus, node, next_cpu) -#define CPU_FOREACH_REVERSE(cpu) \ - QTAILQ_FOREACH_REVERSE(cpu, &cpus, CPUTailQ, node) -#define first_cpu QTAILQ_FIRST(&cpus) + QLIST_FOREACH_SAFE_RCU(cpu, &cpus, node, next_cpu) +#define first_cpu QLIST_FIRST_RCU(&cpus) extern __thread CPUState *current_cpu; diff --git a/linux-user/main.c b/linux-user/main.c index f2f7422..9880505 100644 --- a/linux-user/main.c +++ b/linux-user/main.c @@ -131,7 +131,7 @@ void fork_end(int child) Discard information about the parent threads. */ CPU_FOREACH_SAFE(cpu, next_cpu) { if (cpu != thread_cpu) { - QTAILQ_REMOVE(&cpus, cpu, node); + QLIST_REMOVE_RCU(cpu, node); } } pending_cpus = 0; diff --git a/linux-user/syscall.c b/linux-user/syscall.c index 1c17b74..2911319 100644 --- a/linux-user/syscall.c +++ b/linux-user/syscall.c @@ -6710,7 +6710,7 @@ abi_long do_syscall(void *cpu_env, int num, abi_long arg1, cpu_list_lock(); /* Remove the CPU from the list. */ - QTAILQ_REMOVE(&cpus, cpu, node); + QLIST_REMOVE_RCU(cpu, node); cpu_list_unlock(); ts = cpu->opaque; if (ts->child_tidptr) { -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota @ 2016-08-24 22:17 ` Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota ` (5 subsequent siblings) 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter Signed-off-by: Emilio G. Cota <cota@braap.org> --- cpu-exec.c | 38 +++++++++++--------------------------- 1 file changed, 11 insertions(+), 27 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 041f8b7..63d739a 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -309,34 +309,18 @@ static TranslationBlock *tb_find_slow(CPUState *cpu, TranslationBlock *tb; tb = tb_find_physical(cpu, pc, cs_base, flags); - if (tb) { - goto found; - } - -#ifdef CONFIG_USER_ONLY - /* mmap_lock is needed by tb_gen_code, and mmap_lock must be - * taken outside tb_lock. Since we're momentarily dropping - * tb_lock, there's a chance that our desired tb has been - * translated. - */ - tb_unlock(); - mmap_lock(); - tb_lock(); - tb = tb_find_physical(cpu, pc, cs_base, flags); - if (tb) { + if (!tb) { + mmap_lock(); + tb_lock(); + tb = tb_find_physical(cpu, pc, cs_base, flags); + if (!tb) { + /* if no translated code available, then translate it now */ + tb = tb_gen_code(cpu, pc, cs_base, flags, 0); + } + tb_unlock(); mmap_unlock(); - goto found; } -#endif - - /* if no translated code available, then translate it now */ - tb = tb_gen_code(cpu, pc, cs_base, flags, 0); - -#ifdef CONFIG_USER_ONLY - mmap_unlock(); -#endif -found: /* we add the TB in the virtual pc hash table */ cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)] = tb; return tb; @@ -355,7 +339,6 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu, always be the same before a given translated block is executed. */ cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); - tb_lock(); tb = cpu->tb_jmp_cache[tb_jmp_cache_hash_func(pc)]; if (unlikely(!tb || tb->pc != pc || tb->cs_base != cs_base || tb->flags != flags)) { @@ -379,9 +362,10 @@ static inline TranslationBlock *tb_find_fast(CPUState *cpu, #endif /* See if we can patch the calling TB. */ if (*last_tb && !qemu_loglevel_mask(CPU_LOG_TB_NOCHAIN)) { + tb_lock(); tb_add_jump(*last_tb, tb_exit, tb); + tb_unlock(); } - tb_unlock(); return tb; } -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota @ 2016-08-24 22:17 ` Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota ` (4 subsequent siblings) 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/qemu/rcu.h | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/include/qemu/rcu.h b/include/qemu/rcu.h index 83ae280..0f6e467 100644 --- a/include/qemu/rcu.h +++ b/include/qemu/rcu.h @@ -98,6 +98,13 @@ static inline void rcu_read_unlock(void) } } +static inline bool rcu_read_lock_held(void) +{ + struct rcu_reader_data *p_rcu_reader = &rcu_reader; + + return p_rcu_reader->depth > 0; +} + extern void synchronize_rcu(void); /* -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota @ 2016-08-24 22:17 ` Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota ` (3 subsequent siblings) 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:17 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter Signed-off-by: Emilio G. Cota <cota@braap.org> --- target-arm/helper-a64.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c index 8ce518b..6f3fd17 100644 --- a/target-arm/helper-a64.c +++ b/target-arm/helper-a64.c @@ -453,7 +453,7 @@ uint64_t HELPER(crc32c_64)(uint64_t acc, uint64_t val, uint32_t bytes) uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr, uint64_t new_lo, uint64_t new_hi) { -#ifndef CONFIG_USER_ONLY +#if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_ATOMIC128) uintptr_t ra = GETPC(); #endif Int128 oldv, cmpv, newv; @@ -518,7 +518,7 @@ uint64_t HELPER(paired_cmpxchg64_le)(CPUARMState *env, uint64_t addr, uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, uint64_t new_lo, uint64_t new_hi) { -#ifndef CONFIG_USER_ONLY +#if !defined(CONFIG_USER_ONLY) || !defined(CONFIG_ATOMIC128) uintptr_t ra = GETPC(); #endif Int128 oldv, cmpv, newv; -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota ` (2 preceding siblings ...) 2016-08-24 22:17 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota @ 2016-08-24 22:18 ` Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota ` (2 subsequent siblings) 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter Signed-off-by: Emilio G. Cota <cota@braap.org> --- cpu-exec.c | 1 + include/exec/exec-all.h | 5 +++ linux-user/main.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++ linux-user/syscall.c | 1 + 4 files changed, 96 insertions(+) diff --git a/cpu-exec.c b/cpu-exec.c index 63d739a..8f1adc4 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -649,6 +649,7 @@ int cpu_exec(CPUState *cpu) g_assert(cc == CPU_GET_CLASS(cpu)); #endif /* buggy compiler */ cpu->can_do_io = 1; + stop_the_world_reset(); tb_lock_reset(); } } /* for(;;) */ diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h index ec72c5a..c483d80 100644 --- a/include/exec/exec-all.h +++ b/include/exec/exec-all.h @@ -61,6 +61,11 @@ void QEMU_NORETURN cpu_loop_exit(CPUState *cpu); void QEMU_NORETURN cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc); void QEMU_NORETURN cpu_loop_exit_atomic(CPUState *cpu, uintptr_t pc); +void stop_the_world_lock(CPUState *cpu); +void stop_the_world_unlock(void); +void stop_the_world_reset(void); +extern __thread bool stw_held; + #if !defined(CONFIG_USER_ONLY) void cpu_reloading_memory_map(void); /** diff --git a/linux-user/main.c b/linux-user/main.c index 9880505..94c6625 100644 --- a/linux-user/main.c +++ b/linux-user/main.c @@ -114,11 +114,19 @@ static pthread_cond_t exclusive_cond = PTHREAD_COND_INITIALIZER; static pthread_cond_t exclusive_resume = PTHREAD_COND_INITIALIZER; static int pending_cpus; +static pthread_cond_t stw_sleep_cond = PTHREAD_COND_INITIALIZER; +static pthread_cond_t stw_request_cond = PTHREAD_COND_INITIALIZER; +static pthread_mutex_t stw_lock = PTHREAD_MUTEX_INITIALIZER; +static int stw_requests; +static bool stw_ongoing; +__thread bool stw_held; + /* Make sure everything is in a consistent state for calling fork(). */ void fork_start(void) { qemu_mutex_lock(&tcg_ctx.tb_ctx.tb_lock); pthread_mutex_lock(&exclusive_lock); + pthread_mutex_lock(&stw_lock); mmap_fork_start(); } @@ -137,11 +145,17 @@ void fork_end(int child) pending_cpus = 0; pthread_mutex_init(&exclusive_lock, NULL); pthread_mutex_init(&cpu_list_mutex, NULL); + pthread_mutex_init(&stw_lock, NULL); + stw_held = false; + stw_ongoing = false; pthread_cond_init(&exclusive_cond, NULL); pthread_cond_init(&exclusive_resume, NULL); + pthread_cond_init(&stw_sleep_cond, NULL); + pthread_cond_init(&stw_request_cond, NULL); qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock); gdbserver_fork(thread_cpu); } else { + pthread_mutex_unlock(&stw_lock); pthread_mutex_unlock(&exclusive_lock); qemu_mutex_unlock(&tcg_ctx.tb_ctx.tb_lock); } @@ -198,6 +212,79 @@ static void step_atomic(CPUState *cpu) end_exclusive(); } +void stop_the_world_lock(CPUState *cpu) +{ + CPUState *other; + + if (stw_held) { + return; + } + rcu_read_unlock(); + assert(!rcu_read_lock_held()); + + pthread_mutex_lock(&stw_lock); + if (stw_ongoing) { + stw_requests++; + /* wait for ongoing stops to occur */ + while (stw_ongoing) { + pthread_cond_wait(&stw_request_cond, &stw_lock); + } + stw_requests--; + } + + /* it's our turn! */ + stw_ongoing = true; + stw_held = true; + CPU_FOREACH(other) { + if (other != cpu) { + cpu_exit(other); + } + } + synchronize_rcu(); +} + +void stop_the_world_unlock(void) +{ + if (!stw_held) { + return; + } + assert(stw_ongoing); + assert(!rcu_read_lock_held()); + + if (stw_requests) { + pthread_cond_signal(&stw_request_cond); + } else { + pthread_cond_broadcast(&stw_sleep_cond); + } + /* + * Make sure the next STW requester (if any) will perceive that we're + * in an RCU read critical section + */ + rcu_read_lock(); + stw_ongoing = false; + stw_held = false; + pthread_mutex_unlock(&stw_lock); +} + +void stop_the_world_reset(void) +{ + if (likely(!stw_held)) { + return; + } + stop_the_world_unlock(); +} + +static inline void stop_the_world_sleep(void) +{ + pthread_mutex_lock(&stw_lock); + if (unlikely(stw_ongoing)) { + while (stw_ongoing) { + pthread_cond_wait(&stw_sleep_cond, &stw_lock); + } + } + pthread_mutex_unlock(&stw_lock); +} + /* Wait for exclusive ops to finish, and begin cpu execution. */ static inline void cpu_exec_start(CPUState *cpu) { @@ -205,6 +292,8 @@ static inline void cpu_exec_start(CPUState *cpu) exclusive_idle(); cpu->running = true; pthread_mutex_unlock(&exclusive_lock); + + stop_the_world_sleep(); } /* Mark cpu as not executing, and release pending exclusive ops. */ diff --git a/linux-user/syscall.c b/linux-user/syscall.c index 2911319..740af23 100644 --- a/linux-user/syscall.c +++ b/linux-user/syscall.c @@ -5403,6 +5403,7 @@ static void *clone_func(void *arg) /* Wait until the parent has finshed initializing the tls state. */ pthread_mutex_lock(&clone_lock); pthread_mutex_unlock(&clone_lock); + stw_held = false; cpu_loop(env); /* never exits */ return NULL; -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota ` (3 preceding siblings ...) 2016-08-24 22:18 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota @ 2016-08-24 22:18 ` Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/qemu/htm.h | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 include/qemu/htm.h diff --git a/include/qemu/htm.h b/include/qemu/htm.h new file mode 100644 index 0000000..dc84bc1 --- /dev/null +++ b/include/qemu/htm.h @@ -0,0 +1,43 @@ +#ifndef HTM_H +#define HTM_H + +enum htm { + HTM_OK, + HTM_ABORT_RETRY, + HTM_ABORT_NORETRY, +}; + +#if defined(__x86_64__) +/* compile with -mrtm */ +#include <immintrin.h> + +static inline enum htm htm_begin(void) +{ + int status; + + status = _xbegin(); + if (unlikely(status != _XBEGIN_STARTED)) { + if (status & _XABORT_RETRY) { + return HTM_ABORT_RETRY; + } + return HTM_ABORT_NORETRY; + } + return HTM_OK; +} + +static inline void htm_end(void) +{ + _xend(); +} + +static inline bool htm_test(void) +{ + return _xtest(); +} + +static inline void htm_abort(void) +{ + _xabort(0); +} +#endif /* ISA */ +#endif /* HTM_H */ -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota ` (4 preceding siblings ...) 2016-08-24 22:18 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota @ 2016-08-24 22:18 ` Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter Signed-off-by: Emilio G. Cota <cota@braap.org> --- include/qemu/htm.h | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/include/qemu/htm.h b/include/qemu/htm.h index dc84bc1..f367ee4 100644 --- a/include/qemu/htm.h +++ b/include/qemu/htm.h @@ -39,5 +39,44 @@ static inline void htm_abort(void) { _xabort(0); } + +#elif defined(__powerpc64__) +/* compile with -mhtm */ +#include <htmintrin.h> + +static inline int htm_begin(void) +{ + unsigned int status; + + status = __builtin_tbegin(0); + if (likely(status)) { + return HTM_OK; + } + if (_TEXASRU_FAILURE_PERSISTENT(__builtin_get_texasru())) { + return HTM_ABORT_NORETRY; + } + return HTM_ABORT_RETRY; +} + +static inline void htm_end(void) +{ + __builtin_tend(0); +} + +static inline int htm_test(void) +{ + unsigned char state = _HTM_STATE(__builtin_ttest()); + + if (likely(state == _HTM_TRANSACTIONAL)) { + return 1; + } + return 0; +} + +static inline void htm_abort(void) +{ + __builtin_tabort(0); +} + #endif /* ISA */ #endif /* HTM_H */ -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota ` (5 preceding siblings ...) 2016-08-24 22:18 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota @ 2016-08-24 22:18 ` Emilio G. Cota 6 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-24 22:18 UTC (permalink / raw) To: Richard Henderson Cc: alex.bennee, mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, peter.maydell, claudio.fontana, dgilbert, crosthwaite.peter TODO: convert paired atomics as well. Signed-off-by: Emilio G. Cota <cota@braap.org> --- cpu-exec.c | 4 ++++ target-arm/helper-a64.c | 31 +++++++++++++++++++++++++++++++ target-arm/helper-a64.h | 4 ++++ target-arm/op_helper.c | 4 ++++ target-arm/translate-a64.c | 16 ++++++++++------ 5 files changed, 53 insertions(+), 6 deletions(-) diff --git a/cpu-exec.c b/cpu-exec.c index 8f1adc4..6e2531f 100644 --- a/cpu-exec.c +++ b/cpu-exec.c @@ -26,6 +26,7 @@ #include "sysemu/qtest.h" #include "qemu/timer.h" #include "exec/address-spaces.h" +#include "qemu/htm.h" #include "qemu/rcu.h" #include "exec/tb-hash.h" #include "exec/log.h" @@ -651,6 +652,9 @@ int cpu_exec(CPUState *cpu) cpu->can_do_io = 1; stop_the_world_reset(); tb_lock_reset(); + if (unlikely(htm_test())) { + htm_end(); + } } } /* for(;;) */ diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c index 6f3fd17..741e6de 100644 --- a/target-arm/helper-a64.c +++ b/target-arm/helper-a64.c @@ -25,6 +25,7 @@ #include "qemu/log.h" #include "sysemu/sysemu.h" #include "qemu/bitops.h" +#include "qemu/htm.h" #include "internals.h" #include "qemu/crc32c.h" #include "exec/exec-all.h" @@ -579,3 +580,33 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, return !success; } + +void HELPER(xbegin)(CPUARMState *env) +{ + int status; + int retries = 100; + + retry: + status = htm_begin(); + if (unlikely(status != HTM_OK)) { + if ((status & HTM_ABORT_RETRY) && retries) { + retries--; + goto retry; + } + stop_the_world_lock(ENV_GET_CPU(env)); + } +} + +void HELPER(xend)(void) +{ + if (likely(htm_test())) { + htm_end(); + } else { + stop_the_world_unlock(); + } +} + +uint64_t HELPER(x_ok)(void) +{ + return likely(htm_test()) || stw_held; +} diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h index dd32000..e7ede43 100644 --- a/target-arm/helper-a64.h +++ b/target-arm/helper-a64.h @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64) DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64) + +DEF_HELPER_1(xbegin, void, env) +DEF_HELPER_0(x_ok, i64) +DEF_HELPER_0(xend, void) diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c index 73da759..91b1413 100644 --- a/target-arm/op_helper.c +++ b/target-arm/op_helper.c @@ -17,6 +17,7 @@ * License along with this library; if not, see <http://www.gnu.org/licenses/>. */ #include "qemu/osdep.h" +#include "qemu/htm.h" #include "cpu.h" #include "exec/helper-proto.h" #include "internals.h" @@ -31,6 +32,9 @@ static void raise_exception(CPUARMState *env, uint32_t excp, { CPUState *cs = CPU(arm_env_get_cpu(env)); + if (unlikely(htm_test())) { + htm_end(); + } assert(!excp_is_internal(excp)); cs->exception_index = excp; env->exception.syndrome = syndrome; diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c index 450c359..cc3baa0 100644 --- a/target-arm/translate-a64.c +++ b/target-arm/translate-a64.c @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, TCGv_i64 tmp = tcg_temp_new_i64(); TCGMemOp be = s->be_data; + gen_helper_xbegin(cpu_env); + g_assert(size <= 3); if (is_pair) { TCGv_i64 hitmp = tcg_temp_new_i64(); @@ -1825,6 +1827,10 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label); tmp = tcg_temp_new_i64(); + /* strex without a prior ldrex should just fail */ + gen_helper_x_ok(tmp); + tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label); + if (is_pair) { if (size == 2) { TCGv_i64 val = tcg_temp_new_i64(); @@ -1844,16 +1850,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, } } else { TCGv_i64 val = cpu_reg(s, rt); - tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val, - get_mem_index(s), - size | MO_ALIGN | s->be_data); - tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val); + tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size); } tcg_temp_free_i64(addr); - - tcg_gen_mov_i64(cpu_reg(s, rd), tmp); tcg_temp_free_i64(tmp); + + tcg_gen_movi_i64(cpu_reg(s, rd), 0); + gen_helper_xend(); tcg_gen_br(done_label); gen_set_label(fail_label); -- 2.5.0 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans 2016-08-15 15:46 ` Emilio G. Cota 2016-08-15 15:49 ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota @ 2016-08-16 11:16 ` Alex Bennée 2016-08-16 21:51 ` Emilio G. Cota 1 sibling, 1 reply; 21+ messages in thread From: Alex Bennée @ 2016-08-16 11:16 UTC (permalink / raw) To: Emilio G. Cota Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite Emilio G. Cota <cota@braap.org> writes: > On Mon, Aug 15, 2016 at 11:46:32 +0100, Alex Bennée wrote: >> As far as I'm aware the following work is still ongoing: >> >> Emilo: cmpxchg atomics >> Alvise: LL/SC modelling > > I've been tinkering with an experimental patch to do proper LL/SC. The idea > is to rely on hardware transactional memory, so that stores don't have > to be tracked. The trickiest thing is the fallback path, for which I'm > trying to (ab)use EXCP_ATOMIC to execute exclusively from the ldrex > all the way to the strex. > > To test it, I'm using aarch64-linux-user running qht-bench compiled on > an aarch64 machine. I'm running on an Intel Skylake host (Skylake has > no known TSX bugs) > > However, I'm finding issues that might not have to do with the > patch itself. > > - On the latest MTTCG+cmpxchg tree (45c11751ed7 a.k.a. > bennee/mttcg/base-patches-v4-with-cmpxchg-atomics-v2), QEMU loops > forever without making progress in the instruction stream, even > with taskset -c 0. Could this be a store-after-load barrier issue? I have a branch that adds Pranith's work: https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2-and-barriers-v4 This seems to have eliminated some of the failure modes (usually kernel complaining about stalled tasks) but I'm still seeing my test case fail from time to time starting the benchmark task. Currently I'm not seeing much information about why its failing to start though. > - On the cmpxchg tree (rth's atomic-2 branch [1]), it works more > reliably, although tb_lock is held around tb_find_fast so parallelism isn't > very high. Still, it sometimes triggers the assert below. > - Applying the "remove tb_lock around hot path" patch makes it > easier to trigger this assert in cpu-exec.c:650 (approx.): > /* Assert that the compiler does not smash local variables. */ > g_assert(cpu == current_cpu) > I've also seen triggered the assert immediately after that one, as well > as the rcu_read_unlock depth assert. Odd - these are remnants of a dodgy compiler. > The asserts are usually triggered when all threads exit (by returning > NULL) at roughly the same time. > However, they cannot be triggered with taskset -c 0, which makes me > suspect that somehow start_exclusive isn't working as intended. > > Any tips would be appreciated! I'll reply with a patch that uses RTM, > the one below is fallback path all the way, and the best to reproduce > the above. I'll see if I can reproduce the errors your seeing on my setup. > > Thanks, > > Emilio > > [1] https://github.com/rth7680/qemu/commits/atomic-2 > > From ed6af6eb364e5a36e81d7cc8143c0e9783c50587 Mon Sep 17 00:00:00 2001 > From: "Emilio G. Cota" <cota@braap.org> > Date: Mon, 15 Aug 2016 00:27:42 +0200 > Subject: [PATCH] aarch64: use TSX for ldrex/strex (fallback path only) > > Signed-off-by: Emilio G. Cota <cota@braap.org> > --- > linux-user/main.c | 5 +++-- > target-arm/helper-a64.c | 23 +++++++++++++++++++++++ > target-arm/helper-a64.h | 4 ++++ > target-arm/translate-a64.c | 15 +++++++++------ > 4 files changed, 39 insertions(+), 8 deletions(-) > > diff --git a/linux-user/main.c b/linux-user/main.c > index 9880505..6922faa 100644 > --- a/linux-user/main.c > +++ b/linux-user/main.c > @@ -192,8 +192,9 @@ static void step_atomic(CPUState *cpu) > > /* Since we got here, we know that parallel_cpus must be true. */ > parallel_cpus = false; > - cpu_exec_step(cpu); > - parallel_cpus = true; > + while (!parallel_cpus) { > + cpu_exec_step(cpu); > + } > > end_exclusive(); > } > diff --git a/target-arm/helper-a64.c b/target-arm/helper-a64.c > index 8ce518b..a97b631 100644 > --- a/target-arm/helper-a64.c > +++ b/target-arm/helper-a64.c > @@ -579,3 +579,26 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr, > > return !success; > } > + > +void HELPER(xbegin)(CPUARMState *env) > +{ > + uintptr_t ra = GETPC(); > + > + if (parallel_cpus) { > + cpu_loop_exit_atomic(ENV_GET_CPU(env), ra); > + } > +} > + > +void HELPER(xend)(void) > +{ > + assert(!parallel_cpus); > + parallel_cpus = true; > +} > + > +uint64_t HELPER(x_ok)(void) > +{ > + if (!parallel_cpus) { > + return 1; > + } > + return 0; > +} > diff --git a/target-arm/helper-a64.h b/target-arm/helper-a64.h > index dd32000..e7ede43 100644 > --- a/target-arm/helper-a64.h > +++ b/target-arm/helper-a64.h > @@ -48,3 +48,7 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) > DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32) > DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64) > DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64) > + > +DEF_HELPER_1(xbegin, void, env) > +DEF_HELPER_0(x_ok, i64) > +DEF_HELPER_0(xend, void) > diff --git a/target-arm/translate-a64.c b/target-arm/translate-a64.c > index 450c359..cfcf440 100644 > --- a/target-arm/translate-a64.c > +++ b/target-arm/translate-a64.c > @@ -1760,6 +1760,8 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2, > TCGv_i64 tmp = tcg_temp_new_i64(); > TCGMemOp be = s->be_data; > > + gen_helper_xbegin(cpu_env); > + > g_assert(size <= 3); > if (is_pair) { > TCGv_i64 hitmp = tcg_temp_new_i64(); > @@ -1825,6 +1827,9 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, > tcg_gen_brcond_i64(TCG_COND_NE, addr, cpu_exclusive_addr, fail_label); > > tmp = tcg_temp_new_i64(); > + gen_helper_x_ok(tmp); > + tcg_gen_brcondi_i64(TCG_COND_EQ, tmp, 0, fail_label); > + > if (is_pair) { > if (size == 2) { > TCGv_i64 val = tcg_temp_new_i64(); > @@ -1844,16 +1849,14 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2, > } > } else { > TCGv_i64 val = cpu_reg(s, rt); > - tcg_gen_atomic_cmpxchg_i64(tmp, addr, cpu_exclusive_val, val, > - get_mem_index(s), > - size | MO_ALIGN | s->be_data); > - tcg_gen_setcond_i64(TCG_COND_NE, tmp, tmp, cpu_exclusive_val); > + tcg_gen_qemu_st_i64(val, addr, get_mem_index(s), s->be_data + size); > } > > tcg_temp_free_i64(addr); > - > - tcg_gen_mov_i64(cpu_reg(s, rd), tmp); > tcg_temp_free_i64(tmp); > + > + tcg_gen_movi_i64(cpu_reg(s, rd), 0); > + gen_helper_xend(); > tcg_gen_br(done_label); > > gen_set_label(fail_label); -- Alex Bennée ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans 2016-08-16 11:16 ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée @ 2016-08-16 21:51 ` Emilio G. Cota 0 siblings, 0 replies; 21+ messages in thread From: Emilio G. Cota @ 2016-08-16 21:51 UTC (permalink / raw) To: Alex Bennée Cc: mttcg, qemu-devel, fred.konrad, a.rigo, bobby.prani, nikunj, mark.burton, pbonzini, jan.kiszka, serge.fdrv, rth, peter.maydell, claudio.fontana, Dr. David Alan Gilbert, Peter Crosthwaite On Tue, Aug 16, 2016 at 12:16:26 +0100, Alex Bennée wrote: > Emilio G. Cota <cota@braap.org> writes: > > However, I'm finding issues that might not have to do with the > > patch itself. I had some time today to dig deeper -- turns out the issues *have* to do with my patch, see below. (And sorry for hijacking this thread.) > > - Applying the "remove tb_lock around hot path" patch makes it > > easier to trigger this assert in cpu-exec.c:650 (approx.): > > /* Assert that the compiler does not smash local variables. */ > > g_assert(cpu == current_cpu) > > I've also seen triggered the assert immediately after that one, as well > > as the rcu_read_unlock depth assert. > > Odd - these are remnants of a dodgy compiler. The problem is that by calling cpu_exec_step() in a loop, we don't know what instructions we might execute. Thus, when one of those instructions (sandwiched between an ldrex and strex) causes an exception (e.g. SVC in A64) we take the longjmp that lands into cpu_exec_loop, from which we did *not* come from. That explains those odd asserts being triggered. The reason why this is only triggered when pthreads are joined, is because the code there is particularly tricky, with branches and SVC between ldrex/strex pairs. The good news is that this still allows me to benchmark the TSX code vs cmpxchg (I just print out the results before joining); for 4 cores (8 HW threads), qht-bench performs just as well with TSX and cmpxchg (but with TSX we get full correctness). For 1 thread, atomic_add is faster with cmpxchg, but the gap is greatly reduced as contention increases. This gap is due to the fixed cost of calling _xstart/_xend, which is quite a few more instructions than just emitting an atomic. Emilio ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-08-24 22:18 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée 2016-08-15 11:00 ` Peter Maydell 2016-08-15 11:16 ` Alex Bennée 2016-08-15 15:46 ` Emilio G. Cota 2016-08-15 15:49 ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota 2016-08-17 17:22 ` Richard Henderson 2016-08-17 17:58 ` Emilio G. Cota 2016-08-17 18:18 ` Emilio G. Cota 2016-08-17 18:41 ` Richard Henderson 2016-08-18 15:38 ` Richard Henderson 2016-08-24 21:12 ` Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota 2016-08-24 22:17 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota 2016-08-24 22:18 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota 2016-08-16 11:16 ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée 2016-08-16 21:51 ` Emilio G. Cota
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).