From mboxrd@z Thu Jan  1 00:00:00 1970
From: Malcolm Crossley <malcolm.crossley@citrix.com>
Subject: Re: [PATCH v2] Xen: Spread boot time page scrubbing
 across all available CPU's
Date: Mon, 30 Sep 2013 14:56:48 +0100
Message-ID: <52498320.7080405@citrix.com>
References: <ee1108d26fc5f8c2b44a.1380544517@malcolmc.uk.xensource.com>
	<5249981902000078000F80F2@nat28.tlf.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <malcolm.crossley@citrix.com>) id 1VQdy8-0007WR-DD
	for xen-devel@lists.xenproject.org; Mon, 30 Sep 2013 13:57:00 +0000
In-Reply-To: <5249981902000078000F80F2@nat28.tlf.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: andrew.cooper3@citrix.com, tim@xen.org, keir@xen.org, xen-devel <xen-devel@lists.xenproject.org>
List-Id: xen-devel@lists.xenproject.org

On 30/09/13 14:26, Jan Beulich wrote:
>>>> On 30.09.13 at 14:35, Malcolm Crossley <malcolm.crossley@citrix.com> wrote:
>> The page scrubbing is done in 128MB chunks in lockstep across all the CPU's.
>> This allows for the boot CPU to hold the heap_lock whilst each chunk is being
>> scrubbed and then release the heap_lock when all CPU's are finished scrubing
>> their individual chunk. This allows for the heap_lock to not be held
>> continously and for pending softirqs are to be serviced periodically across
>> all CPU's.
>>
>> The page scrub memory chunks are allocated to the CPU's in a NUMA aware
>> fashion to reduce Socket interconnect overhead and improve performance.
>>
>> This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
>> 6386 machine from 49 seconds to 3 seconds.
> And is this a NUMA system with heavily different access times
> between local and remote memory?
The AMD 64 core system has 8 NUMA nodes with up to 2 hops between the nodes.

This page show's some data on the roundtrip bandwidth between different 
core's
http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpn2YlFp.html

This paper also show the difference memory bandwidth with NUMA aware 
threads:

http://www.cs.uchicago.edu/files/tr_authentic/TR-2011-02.pdf

The unstrided results are the only one's we're interested in.
>
> What I'm trying to understand before reviewing the actual patch
> is whether what you do is really necessary: Generally it ought to
> be sufficient to have one CPU on each node scrub that node's
> memory, as a CPU should be able to saturate the bus if it does
> (almost) nothing but memory write. Hence having multiple cores
> on the same socket (not to speak of multiple threads in a core)
> do this work in parallel is likely not going to be beneficial, and
> hence the logic you're adding here might be more complex than
> necessary.
The second paper cited above shows that between 3 times more core's
than NUMA nodes are required to reach peak NUMA memory bandwidth.

The difference between 1 core per node bandwidth and 3 core's per node 
bandwidth is:

AMD: 30000MB/s (1-core) vs 48000MB/s (3-core)
Intel: 12000MB/s (1-core) vs 38000MB/s (3-core)

So I think it's worth the extra complexity to have multiple core's per 
node scrubbing memory.

Malcolm
> Jan
>