qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Emilio G. Cota" <cota@braap.org>
To: Richard Henderson <rth@twiddle.net>
Cc: "Alex Bennée" <alex.bennee@linaro.org>,
	mttcg@greensocs.com, qemu-devel@nongnu.org,
	fred.konrad@greensocs.com, a.rigo@virtualopensystems.com,
	bobby.prani@gmail.com, nikunj@linux.vnet.ibm.com,
	mark.burton@greensocs.com, pbonzini@redhat.com,
	jan.kiszka@siemens.com, serge.fdrv@gmail.com,
	peter.maydell@linaro.org, claudio.fontana@huawei.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Peter Crosthwaite" <crosthwaite.peter@gmail.com>
Subject: Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex
Date: Wed, 24 Aug 2016 17:12:40 -0400	[thread overview]
Message-ID: <20160824211240.GA26546@flamenco> (raw)
In-Reply-To: <5b81580e-0b6a-7b30-60a1-3c34548e7997@twiddle.net>

On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote:
> A couple of other notes, as I've thought about this some more.

Thanks for spending time on this.

I have a new patchset (will send as a reply to this e-mail in a few
minutes) that has good performance. Its main ideas:

- Use transactions that start on ldrex and finish on strex. On
  an exception, end (instead of abort) the ongoing transaction,
  if any. There's little point in aborting, since the subsequent
  retries will end up in the same exception anyway. This means
  the translation of the corresponding blocks might happen via
  the fallback path. That's OK, given that subsequent executions
  of the TBs will (likely) complete via HTM.

- For the fallback path, add a stop-the-world primitive that stops
  all other CPUs, without requiring the calling CPU to exit the CPU loop.
  Not breaking from the loop keeps the code simple--we can just
  keep translating/executing normally, with the guarantee that
  no other CPU can run until we're done.

- The fallback path of the transaction stops the world and then
  continues execution (from ldrex) as the only running CPU.

- Only retry when the hardware hints that we may do so. This
  ends up being rare (I can only get dozens of retries under
  heavy contention, for instance with 'atomic_add-bench -r 1')

Limitations: for now user-mode only, and I have paid no attention
to paired atomics. Also, I'm making no checks for unusual (undefined?)
guest code, such as stray ldrex/strex thrown in there.

Performance optimizations like you suggest (e.g. starting a TB
on ldrex, or using TCG ops for beginning/ending the transaction)
could be implemented, but at least on Intel TSX (the only one I've
tried so far[*]), the transaction buffer seems big enough to not
make these optimizations a necessity.

[*] I tried running HTM primitives on the gcc compile farm's Power8,
  but I get an illegal instruction fault on tbegin. I've filed
  an issue here to report it: https://gna.org/support/?3369 ]

Some observations:

- The peak number of retries I see is for atomic_add-bench -r 1 -n 16
  (on an 8-thread machine) at about ~90 retries. So I set the limit
  to 100.

- The lowest success rate I've seen is ~98%, again for atomic_add-bench
  under high contention.

Some numbers:

- atomic_add's performance is lower for HTM vs cmpxchg, although under
  contention performance gets very similar. The reason for the perf
  gap is that xbegin/xend takes more cycles than cmpxchg, especially
  under little or no contention; this explains the large difference
  for threads=1.
  http://imgur.com/5kiT027
  As a side note, contended transactions seem to scale worse than contended
  cmpxchg when exploiting SMT. But anyway I wouldn't read much into
  that.

- For more realistic workloads that gap goes away, as the relative impact
  of cmpxchg or transaction delays is lower. For QHT, 1000 keys:
  http://imgur.com/l6vcowu
  And for SPEC (note that despite being single-threaded, SPEC executes
  a lot of atomics, e.g. from mutexes and from forking):
  http://imgur.com/W49YMhJ
  Performance is essentially identical to that of cmpxchg, but of course
  with HTM we get correct emulation.

Thanks for reading this far!

		Emilio

  reply	other threads:[~2016-08-24 21:13 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-15 10:46 [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
2016-08-15 11:00 ` Peter Maydell
2016-08-15 11:16   ` Alex Bennée
2016-08-15 15:46 ` Emilio G. Cota
2016-08-15 15:49   ` [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex Emilio G. Cota
2016-08-17 17:22     ` Richard Henderson
2016-08-17 17:58       ` Emilio G. Cota
2016-08-17 18:18         ` Emilio G. Cota
2016-08-17 18:41         ` Richard Henderson
2016-08-18 15:38           ` Richard Henderson
2016-08-24 21:12             ` Emilio G. Cota [this message]
2016-08-24 22:17               ` [Qemu-devel] [PATCH 1/8] cpu list: convert to RCU QLIST Emilio G. Cota
2016-08-24 22:17                 ` [Qemu-devel] [PATCH 2/8] cpu-exec: remove tb_lock from hot path Emilio G. Cota
2016-08-24 22:17                 ` [Qemu-devel] [PATCH 3/8] rcu: add rcu_read_lock_held() Emilio G. Cota
2016-08-24 22:17                 ` [Qemu-devel] [PATCH 4/8] target-arm: helper fixup for paired atomics Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 5/8] linux-user: add stop-the-world to be called from CPU loop Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 6/8] htm: add header to abstract Hardware Transactional Memory intrinsics Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 7/8] htm: add powerpc64 intrinsics Emilio G. Cota
2016-08-24 22:18                 ` [Qemu-devel] [PATCH 8/8] target-arm/a64: use HTM with stop-the-world fall-back path Emilio G. Cota
2016-08-16 11:16   ` [Qemu-devel] MTTCG status updates, benchmark results and KVM forum plans Alex Bennée
2016-08-16 21:51     ` Emilio G. Cota

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160824211240.GA26546@flamenco \
    --to=cota@braap.org \
    --cc=a.rigo@virtualopensystems.com \
    --cc=alex.bennee@linaro.org \
    --cc=bobby.prani@gmail.com \
    --cc=claudio.fontana@huawei.com \
    --cc=crosthwaite.peter@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=fred.konrad@greensocs.com \
    --cc=jan.kiszka@siemens.com \
    --cc=mark.burton@greensocs.com \
    --cc=mttcg@greensocs.com \
    --cc=nikunj@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    --cc=serge.fdrv@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).