From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-x241.google.com (mail-pf0-x241.google.com [IPv6:2607:f8b0:400e:c00::241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40msXB1VVqzF1f2 for ; Thu, 17 May 2018 23:26:34 +1000 (AEST) Received: by mail-pf0-x241.google.com with SMTP id q22-v6so2103204pff.11 for ; Thu, 17 May 2018 06:26:33 -0700 (PDT) Date: Thu, 17 May 2018 23:26:19 +1000 From: Nicholas Piggin To: Christophe Leroy Cc: Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] powerpc/lib: Remove .balign inside string functions for PPC32 Message-ID: <20180517231938.4b1b8172@roar.ozlabs.ibm.com> In-Reply-To: <20180517100413.856096F938@po14934vm.idsi0.si.c-s.fr> References: <20180517100413.856096F938@po14934vm.idsi0.si.c-s.fr> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, 17 May 2018 12:04:13 +0200 (CEST) Christophe Leroy wrote: > commit 87a156fb18fe1 ("Align hot loops of some string functions") > degraded the performance of string functions by adding useless > nops > > A simple benchmark on an 8xx calling 100000x a memchr() that > matches the first byte runs in 41668 TB ticks before this patch > and in 35986 TB ticks after this patch. So this gives an > improvement of approx 10% > > Another benchmark doing the same with a memchr() matching the 128th > byte runs in 1011365 TB ticks before this patch and 1005682 TB ticks > after this patch, so regardless on the number of loops, removing > those useless nops improves the test by 5683 TB ticks. > > Fixes: 87a156fb18fe1 ("Align hot loops of some string functions") > Signed-off-by: Christophe Leroy > --- > Was sent already as part of a serie optimising string functions. > Resending on itself as it is independent of the other changes in the > serie > > arch/powerpc/lib/string.S | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/powerpc/lib/string.S b/arch/powerpc/lib/string.S > index a787776822d8..a026d8fa8a99 100644 > --- a/arch/powerpc/lib/string.S > +++ b/arch/powerpc/lib/string.S > @@ -23,7 +23,9 @@ _GLOBAL(strncpy) > mtctr r5 > addi r6,r3,-1 > addi r4,r4,-1 > +#ifdef CONFIG_PPC64 > .balign 16 > +#endif > 1: lbzu r0,1(r4) > cmpwi 0,r0,0 > stbu r0,1(r6) The ifdefs are a bit ugly, but you can't argue with the numbers. These alignments should be IFETCH_ALIGN_BYTES, which is intended to optimise the ifetch performance when you have such a loop (although there is always a tradeoff for a single iteration). Would it make sense to define that for 32-bit as well, and you could use it here instead of the ifdefs? Small CPUs could just use 0. Thanks, Nick