From mboxrd@z Thu Jan 1 00:00:00 1970 From: jszhang@marvell.com (Jisheng Zhang) Date: Tue, 13 Jan 2015 14:48:51 +0800 Subject: CFT: move outer_cache_sync() out of line In-Reply-To: <20150112163648.GL12302@n2100.arm.linux.org.uk> References: <20150112163648.GL12302@n2100.arm.linux.org.uk> Message-ID: <20150113144851.1944cf0d@xhacker> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org Hi Russell, On Mon, 12 Jan 2015 08:36:48 -0800 Russell King - ARM Linux wrote: > The theory is that moving outer_cache_sync() out of line moves the > conditional test from multiple sites to a single location, which > in turn means that the branch predictor stands a better chance of > making the correct decision. > > However, I'm a hardware pauper - I have the choice of testing this > on iMX6 (which seems to have bus limitations, and seems to produce > a wide variation on measured performance which makes it hard to > evaluate any differences) or the Versatile Express, which really is > not suitable for making IO performance measurements. > > So, without help from people with "good" hardware, I can't evaluate > the performance impact of this change. Anyone willing to do some > performance testing on this and feedback any numbers? I applied your patch into marvell internal bsp tree, then did a basic emmc performance test via. dd cmd, the performance is improved by 0.13% the sdhc host driver is drivers/mmc/host/sdhci-pxav3.c, I'm not sure it make use of a lot of non-relaxed IO accessors. > > Obviously, drivers which make use of a lot of non-relaxed IO accessors > will be most affected by this - so you really have to know your > drivers when deciding how to evaluate this (sorry, I can't say "measure > network bandwidth" or "measure SSD sata performance" is a good test.) > > Theoretically, this should help overall system performance, since the > branch predictor should be able to predict this better, but it's entirely > possible that trying to benchmark a single workload won't be measurably > different. > > In terms of kernel size figures, this change alone saves almost 17K of > 10MB of kernel text on my iMX6 kernels - which is bordering on > insignificant since that's not quite a 0.2% saving. > > So... right now I can't justify this change, but I'm hoping some can come > up with some figures which shows that it benefits their workload without > causing a performance regression for others. > > diff --git a/arch/arm/include/asm/outercache.h > b/arch/arm/include/asm/outercache.h index 891a56b35bcf..807e4e71c8e7 100644 > --- a/arch/arm/include/asm/outercache.h > +++ b/arch/arm/include/asm/outercache.h > @@ -133,11 +133,7 @@ static inline void outer_resume(void) { } > * Ensure that all outer cache operations are complete and any store > * buffers are drained. > */ > -static inline void outer_sync(void) > -{ > - if (outer_cache.sync) > - outer_cache.sync(); > -} > +extern void outer_sync(void); > #else > static inline void outer_sync(void) > { } > diff --git a/arch/arm/mm/l2c-common.c b/arch/arm/mm/l2c-common.c > index 10a3cf28c362..b1c24c8c1eb9 100644 > --- a/arch/arm/mm/l2c-common.c > +++ b/arch/arm/mm/l2c-common.c > @@ -7,9 +7,17 @@ > * published by the Free Software Foundation. > */ > #include > +#include > #include > #include > > +void outer_sync(void) > +{ > + if (outer_cache.sync) > + outer_cache.sync(); > +} > +EXPORT_SYMBOL(outer_sync); > + > void outer_disable(void) > { > WARN_ON(!irqs_disabled()); > >