From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:46803)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1bmpml-0004Jh-Ux
	for qemu-devel@nongnu.org; Wed, 21 Sep 2016 18:14:37 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1bmpmh-0003Qi-0r
	for qemu-devel@nongnu.org; Wed, 21 Sep 2016 18:14:35 -0400
Received: from out2-smtp.messagingengine.com ([66.111.4.26]:34341)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <cota@braap.org>) id 1bmpme-0003ME-KY
	for qemu-devel@nongnu.org; Wed, 21 Sep 2016 18:14:30 -0400
Date: Wed, 21 Sep 2016 18:14:19 -0400
From: "Emilio G. Cota" <cota@braap.org>
Message-ID: <20160921221419.GA30386@flamenco>
References: <1474289459-15242-1-git-send-email-pbonzini@redhat.com>
	<1474289459-15242-17-git-send-email-pbonzini@redhat.com>
	<20160921172444.GF13385@flamenco>
	<803ceaca-088a-99b8-1a43-821d3507fd9a@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <803ceaca-088a-99b8-1a43-821d3507fd9a@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 16/16] cpus-common: lock-free fast path for
 cpu_exec_start/end
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org, serge.fdrv@gmail.com, alex.bennee@linaro.org, sergey.fedorov@linaro.org

On Wed, Sep 21, 2016 at 20:19:18 +0200, Paolo Bonzini wrote:
(snip)
> No, this is not true.  Barriers order stores and loads within a thread
> _and_ establish synchronizes-with edges.
> 
> In the example above you are violating causality:
> 
> - cpu0 stores cpu->running before loading pending_cpus
> 
> - because pending_cpus == 0, cpu1 stores pending_cpus = 1 after cpu0
> loads it
> 
> - cpu1 loads cpu->running after it stores pending_cpus

OK. So I simplified the example to understand this better:

cpu0			cpu1
----			----
   { A = B = 0, r0 and r1 are private variables }
x = 1			y = 1
smp_mb()		smp_mb()
r0 = y			r1 = x

Turns out this is scenario 10 here: https://lwn.net/Articles/573436/

The source of my confusion was not paying due attention to smp_mb,
which is necessary for maintaining transitivity.

> > Is there a performance (scalability) reason behind this patch?
> 
> Yes: it speeds up all cpu_exec_start/end, _not_ start/end_exclusive.
> 
> With this patch, as long as there are no start/end_exclusive (which are
> supposed to be rare) there is no contention on multiple CPUs doing
> cpu_exec_start/end.
> 
> Without it, as CPUs increase, the global cpu_list_mutex is going to
> become a bottleneck.

I see. Scalability-wise I wouldn't expect much improvement with MTTCG
full-system, given that the iothread lock is still acquired on every
CPU loop exit (just like in KVM). However, for user-mode this should
yield measurable improvements =D

Thanks,

		E.