From mboxrd@z Thu Jan 1 00:00:00 1970 From: Malcolm Crossley Subject: Re: [PATCH v2] Xen: Spread boot time page scrubbing across all available CPU's Date: Mon, 30 Sep 2013 14:56:48 +0100 Message-ID: <52498320.7080405@citrix.com> References: <5249981902000078000F80F2@nat28.tlf.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta3.messagelabs.com ([195.245.230.39]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1VQdy8-0007WR-DD for xen-devel@lists.xenproject.org; Mon, 30 Sep 2013 13:57:00 +0000 In-Reply-To: <5249981902000078000F80F2@nat28.tlf.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: andrew.cooper3@citrix.com, tim@xen.org, keir@xen.org, xen-devel List-Id: xen-devel@lists.xenproject.org On 30/09/13 14:26, Jan Beulich wrote: >>>> On 30.09.13 at 14:35, Malcolm Crossley wrote: >> The page scrubbing is done in 128MB chunks in lockstep across all the CPU's. >> This allows for the boot CPU to hold the heap_lock whilst each chunk is being >> scrubbed and then release the heap_lock when all CPU's are finished scrubing >> their individual chunk. This allows for the heap_lock to not be held >> continously and for pending softirqs are to be serviced periodically across >> all CPU's. >> >> The page scrub memory chunks are allocated to the CPU's in a NUMA aware >> fashion to reduce Socket interconnect overhead and improve performance. >> >> This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron >> 6386 machine from 49 seconds to 3 seconds. > And is this a NUMA system with heavily different access times > between local and remote memory? The AMD 64 core system has 8 NUMA nodes with up to 2 hops between the nodes. This page show's some data on the roundtrip bandwidth between different core's http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpn2YlFp.html This paper also show the difference memory bandwidth with NUMA aware threads: http://www.cs.uchicago.edu/files/tr_authentic/TR-2011-02.pdf The unstrided results are the only one's we're interested in. > > What I'm trying to understand before reviewing the actual patch > is whether what you do is really necessary: Generally it ought to > be sufficient to have one CPU on each node scrub that node's > memory, as a CPU should be able to saturate the bus if it does > (almost) nothing but memory write. Hence having multiple cores > on the same socket (not to speak of multiple threads in a core) > do this work in parallel is likely not going to be beneficial, and > hence the logic you're adding here might be more complex than > necessary. The second paper cited above shows that between 3 times more core's than NUMA nodes are required to reach peak NUMA memory bandwidth. The difference between 1 core per node bandwidth and 3 core's per node bandwidth is: AMD: 30000MB/s (1-core) vs 48000MB/s (3-core) Intel: 12000MB/s (1-core) vs 38000MB/s (3-core) So I think it's worth the extra complexity to have multiple core's per node scrubbing memory. Malcolm > Jan >