From mboxrd@z Thu Jan 1 00:00:00 1970 From: Malcolm Crossley Subject: Re: [PATCHv3 0/3] Implement per-cpu reader-writer locks Date: Fri, 18 Dec 2015 10:07:47 +0000 Message-ID: <5673DAF3.2020707@citrix.com> References: <1450356747-29039-1-git-send-email-malcolm.crossley@citrix.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1a9rx5-0001bE-Ea for xen-devel@lists.xenproject.org; Fri, 18 Dec 2015 10:07:55 +0000 In-Reply-To: <1450356747-29039-1-git-send-email-malcolm.crossley@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: JBeulich@suse.com, ian.campbell@citrix.com, andrew.cooper3@citrix.com, Marcos.Matsunaga@oracle.com, keir@xen.org, konrad.wilk@oracle.com, george.dunlap@eu.citrix.com Cc: xen-devel@lists.xenproject.org, dario.faggioli@citrix.com, stefano.stabellini@citrix.com List-Id: xen-devel@lists.xenproject.org I didn't spot the percpu rwlock owner ASSERT being the wrong way round. Please review version 4 of the series. Sorry for the noise. On 17/12/15 12:52, Malcolm Crossley wrote: > This patch series adds per-cpu reader-writer locks as a generic lock > implementation and then converts the grant table and p2m rwlocks to > use the percpu rwlocks, in order to improve multi-socket host performance. > > CPU profiling has revealed the rwlocks themselves suffer from severe cache > line bouncing due to the cmpxchg operation used even when taking a read lock. > Multiqueue paravirtualised I/O results in heavy contention of the grant table > and p2m read locks of a specific domain and so I/O throughput is bottlenecked > by the overhead of the cache line bouncing itself. > > Per-cpu read locks avoid lock cache line bouncing by using a per-cpu data > area to record a CPU has taken the read lock. Correctness is enforced for the > write lock by using a per lock barrier which forces the per-cpu read lock > to revert to using a standard read lock. The write lock then polls all > the percpu data area until active readers for the lock have exited. > > Removing the cache line bouncing on a multi-socket Haswell-EP system > dramatically improves performance, with 16 vCPU network IO performance going > from 15 gb/s to 64 gb/s! The host under test was fully utilising all 40 > logical CPU's at 64 gb/s, so a bigger logical CPU host may see an even better > IO improvement. > > Note: Benchmarking of the these performance improvements should be done with > the non debug version of the hypervisor otherwise the map_domain_page spinlock > is the main bottleneck. > > Changes in V3: > - Add percpu rwlock owner for debug Xen builds > - Validate percpu rwlock owner at runtime for debug Xen builds > - Fix hard tab issues > - Use percpu rwlock wrappers for grant table rwlock users > - Add comments why rw_is_locked ASSERTS have been removed in grant table code > > Changes in V2: > - Add Cover letter > - Convert p2m rwlock to percpu rwlock > - Improve percpu rwlock to safely handle simultaneously holding 2 or more > locks > - Move percpu rwlock barrier from global to per lock > - Move write lock cpumask variable to a percpu variable > - Add macros to help initialise and use percpu rwlocks > - Updated IO benchmark results to cover revised locking implementation > > Malcolm Crossley (3): > rwlock: Add per-cpu reader-writer lock infrastructure > grant_table: convert grant table rwlock to percpu rwlock > p2m: convert p2m rwlock to percpu rwlock > > xen/arch/arm/mm.c | 4 +- > xen/arch/x86/mm.c | 4 +- > xen/arch/x86/mm/mm-locks.h | 12 ++-- > xen/arch/x86/mm/p2m.c | 1 + > xen/common/grant_table.c | 126 +++++++++++++++++++++++------------------- > xen/common/spinlock.c | 46 +++++++++++++++ > xen/include/asm-arm/percpu.h | 5 ++ > xen/include/asm-x86/mm.h | 2 +- > xen/include/asm-x86/percpu.h | 6 ++ > xen/include/xen/grant_table.h | 24 +++++++- > xen/include/xen/percpu.h | 4 ++ > xen/include/xen/spinlock.h | 115 ++++++++++++++++++++++++++++++++++++++ > 12 files changed, 282 insertions(+), 67 deletions(-) >