From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 410zTf4zPgzDqG0 for ; Wed, 6 Jun 2018 16:36:22 +1000 (AEST) Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w566XwJB036724 for ; Wed, 6 Jun 2018 02:36:19 -0400 Received: from e06smtp02.uk.ibm.com (e06smtp02.uk.ibm.com [195.75.94.98]) by mx0a-001b2d01.pphosted.com with ESMTP id 2je3bqed0u-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 06 Jun 2018 02:36:19 -0400 Received: from localhost by e06smtp02.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 6 Jun 2018 07:36:14 +0100 Date: Wed, 06 Jun 2018 12:06:09 +0530 From: "Naveen N. Rao" Subject: Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization To: Michael Ellerman , Simon Guo Cc: Cyril Bur , linuxppc-dev@lists.ozlabs.org References: <1527672063-6953-1-git-send-email-wei.guo.simon@gmail.com> <877eneasg9.fsf@concordia.ellerman.id.au> <20180606062153.GA7342@simonLocalRHEL7.x64> In-Reply-To: <20180606062153.GA7342@simonLocalRHEL7.x64> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Message-Id: <1528266847.dixm3thyfj.naveen@linux.ibm.com> List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Simon Guo wrote: > Hi Michael, > On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote: >> Hi Simon, >>=20 >> wei.guo.simon@gmail.com writes: >> > From: Simon Guo >> > >> > There is some room to optimize memcmp() in powerpc 64 bits version for >> > following 2 cases: >> > (1) Even src/dst addresses are not aligned with 8 bytes at the beginni= ng, >> > memcmp() can align them and go with .Llong comparision mode without >> > fallback to .Lshort comparision mode do compare buffer byte by byte. >> > (2) VMX instructions can be used to speed up for large size comparisio= n, >> > currently the threshold is set for 4K bytes. Notes the VMX instruction= s >> > will lead to VMX regs save/load penalty. This patch set includes a >> > patch to add a 32 bytes pre-checking to minimize the penalty. >> > >> > It did the similar with glibc commit dec4a7105e (powerpc: Improve memc= mp=20 >> > performance for POWER8). Thanks Cyril Bur's information. >> > This patch set also updates memcmp selftest case to make it compiled a= nd >> > incorporate large size comparison case. >>=20 >> I'm seeing a few crashes with this applied, I haven't had time to look >> into what is happening yet, sorry. >>=20 >=20 > The bug is due to memcmp() invokes a C function enter_vmx_ops() who will = load=20 > some PIC value based on r2. >=20 > memcmp() doesn't use r2 and if the memcmp() is invoked from kernel > itself, everything is fine. But if memcmp() is invoked from modules[test_= user_copy],=20 > r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() = will refer=20 > to an incorrect/unexisting data location based on wrong r2 value. >=20 > Following patch will fix this issue: > ------------ > diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S > index 5eba49744a5a..24d093fa89bb 100644 > --- a/arch/powerpc/lib/memcmp_64.S > +++ b/arch/powerpc/lib/memcmp_64.S > @@ -102,7 +102,7 @@ > * 2) src/dst has different offset to the 8 bytes boundary. The handlers > * are named like .Ldiffoffset_xxxx > */ > -_GLOBAL(memcmp) > +_GLOBAL_TOC(memcmp) > cmpdi cr1,r5,0 >=20 > /* Use the short loop if the src/dst addresses are not > ---------- >=20 > It means the memcmp() fun entry will have additional 2 instructions. Is t= here > any way to save these 2 instructions when the memcmp() is actually invoke= d > from kernel itself? That will be the case. We will end up entering the function via the=20 local entry point skipping the first two instructions. The Global entry=20 point is only used for cross-module calls. - Naveen =