[PATCH] ARM: implement optimized percpu variable access

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ARM: implement optimized percpu variable access
@ 2012-11-11  3:20 Rob Herring
  2012-11-12 10:23 ` Will Deacon
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Rob Herring @ 2012-11-11  3:20 UTC (permalink / raw)
  To: linux-arm-kernel

From: Rob Herring <rob.herring@calxeda.com>

Use the previously unused TPIDRPRW register to store percpu offsets.
TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.

This saves 2 loads for each percpu variable access which should yield
improved performance, but the improvement has not been quantified.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
---
 arch/arm/include/asm/Kbuild   |    1 -
 arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
 arch/arm/kernel/smp.c         |    3 +++
 3 files changed, 47 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/include/asm/percpu.h

diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
index f70ae17..2ffdaac 100644
--- a/arch/arm/include/asm/Kbuild
+++ b/arch/arm/include/asm/Kbuild
@@ -16,7 +16,6 @@ generic-y += local64.h
 generic-y += msgbuf.h
 generic-y += param.h
 generic-y += parport.h
-generic-y += percpu.h
 generic-y += poll.h
 generic-y += resource.h
 generic-y += sections.h
diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
new file mode 100644
index 0000000..9eb7372
--- /dev/null
+++ b/arch/arm/include/asm/percpu.h
@@ -0,0 +1,44 @@
+/*
+ * Copyright 2012 Calxeda, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program.  If not, see <http://www.gnu.org/licenses/>.
+ */
+#ifndef _ASM_ARM_PERCPU_H_
+#define _ASM_ARM_PERCPU_H_
+
+/*
+ * Same as asm-generic/percpu.h, except that we store the per cpu offset
+ * in the TPIDRPRW.
+ */
+#if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
+
+static inline void set_my_cpu_offset(unsigned long off)
+{
+	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
+}
+
+static inline unsigned long __my_cpu_offset(void)
+{
+	unsigned long off;
+	asm("mrc p15, 0, %0, c13, c0, 4	@ get TPIDRPRW" : "=r" (off) : );
+	return off;
+}
+#define __my_cpu_offset __my_cpu_offset()
+#else
+#define set_my_cpu_offset(x)	do {} while(0)
+
+#endif /* CONFIG_SMP */
+
+#include <asm-generic/percpu.h>
+
+#endif /* _ASM_ARM_PERCPU_H_ */
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index fbc8b26..897ef60 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -313,6 +313,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void)
 	current->active_mm = mm;
 	cpumask_set_cpu(cpu, mm_cpumask(mm));
 
+	set_my_cpu_offset(per_cpu_offset(cpu));
+
 	printk("CPU%u: Booted secondary processor\n", cpu);
 
 	cpu_init();
@@ -371,6 +373,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 void __init smp_prepare_boot_cpu(void)
 {
+	set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
 }
 
 void __init smp_prepare_cpus(unsigned int max_cpus)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-11  3:20 [PATCH] ARM: implement optimized percpu variable access Rob Herring
@ 2012-11-12 10:23 ` Will Deacon
  2012-11-12 13:03   ` Rob Herring
  2012-11-12 14:21   ` Rob Herring
  2012-11-22 11:34 ` Will Deacon
  2012-11-27 17:19 ` Nicolas Pitre
  2 siblings, 2 replies; 35+ messages in thread
From: Will Deacon @ 2012-11-12 10:23 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rob,

On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> From: Rob Herring <rob.herring@calxeda.com>
> 
> Use the previously unused TPIDRPRW register to store percpu offsets.
> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> 
> This saves 2 loads for each percpu variable access which should yield
> improved performance, but the improvement has not been quantified.

The patch looks largely fine to me (one minor comment below), but we should
try and see what the performance difference is like on a few cores before
merging this. Have you tried something like hackbench to see if the
difference is measurable there? If not, I guess we'll need something more
targetted.

> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
> new file mode 100644
> index 0000000..9eb7372
> --- /dev/null
> +++ b/arch/arm/include/asm/percpu.h
> @@ -0,0 +1,44 @@
> +/*
> + * Copyright 2012 Calxeda, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +#ifndef _ASM_ARM_PERCPU_H_
> +#define _ASM_ARM_PERCPU_H_
> +
> +/*
> + * Same as asm-generic/percpu.h, except that we store the per cpu offset
> + * in the TPIDRPRW.
> + */
> +#if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
> +
> +static inline void set_my_cpu_offset(unsigned long off)
> +{
> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
> +}

You don't need the "cc" here.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 10:23 ` Will Deacon
@ 2012-11-12 13:03   ` Rob Herring
  2012-11-12 13:28     ` Will Deacon
  2012-11-27 17:29     ` Nicolas Pitre
  2012-11-12 14:21   ` Rob Herring
  1 sibling, 2 replies; 35+ messages in thread
From: Rob Herring @ 2012-11-12 13:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/12/2012 04:23 AM, Will Deacon wrote:
> Hi Rob,
> 
> On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
>> From: Rob Herring <rob.herring@calxeda.com>
>>
>> Use the previously unused TPIDRPRW register to store percpu offsets.
>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
>>
>> This saves 2 loads for each percpu variable access which should yield
>> improved performance, but the improvement has not been quantified.
> 
> The patch looks largely fine to me (one minor comment below), but we should
> try and see what the performance difference is like on a few cores before
> merging this. Have you tried something like hackbench to see if the
> difference is measurable there? If not, I guess we'll need something more
> targetted.

Thanks for the suggestion. I'll give it a try.

>> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
>> new file mode 100644
>> index 0000000..9eb7372
>> --- /dev/null
>> +++ b/arch/arm/include/asm/percpu.h
>> @@ -0,0 +1,44 @@
>> +/*
>> + * Copyright 2012 Calxeda, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along with
>> + * this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +#ifndef _ASM_ARM_PERCPU_H_
>> +#define _ASM_ARM_PERCPU_H_
>> +
>> +/*
>> + * Same as asm-generic/percpu.h, except that we store the per cpu offset
>> + * in the TPIDRPRW.
>> + */
>> +#if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
>> +
>> +static inline void set_my_cpu_offset(unsigned long off)
>> +{
>> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
>> +}
> 
> You don't need the "cc" here.

You would think so, but the compiler drops this instruction if you
don't. set_cr does the same thing.

Rob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 13:03   ` Rob Herring
@ 2012-11-12 13:28     ` Will Deacon
  2012-11-12 14:03       ` Rob Herring
  2012-11-27 17:29     ` Nicolas Pitre
  1 sibling, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-11-12 13:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 12, 2012 at 01:03:12PM +0000, Rob Herring wrote:
> On 11/12/2012 04:23 AM, Will Deacon wrote:
> >> +static inline void set_my_cpu_offset(unsigned long off)
> >> +{
> >> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
> >> +}
> > 
> > You don't need the "cc" here.
> 
> You would think so, but the compiler drops this instruction if you
> don't. set_cr does the same thing.

Whoa, that sounds suspicious... if the thing is marked volatile GCC
shouldn't optimise it away if it's reachable. Which toolchain are you
using?

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 13:28     ` Will Deacon
@ 2012-11-12 14:03       ` Rob Herring
  0 siblings, 0 replies; 35+ messages in thread
From: Rob Herring @ 2012-11-12 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/12/2012 07:28 AM, Will Deacon wrote:
> On Mon, Nov 12, 2012 at 01:03:12PM +0000, Rob Herring wrote:
>> On 11/12/2012 04:23 AM, Will Deacon wrote:
>>>> +static inline void set_my_cpu_offset(unsigned long off)
>>>> +{
>>>> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
>>>> +}
>>>
>>> You don't need the "cc" here.
>>
>> You would think so, but the compiler drops this instruction if you
>> don't. set_cr does the same thing.
> 
> Whoa, that sounds suspicious... if the thing is marked volatile GCC
> shouldn't optimise it away if it's reachable. Which toolchain are you
> using?

Ubuntu 12.10 ARM cross compiler:
gcc version 4.7.2 (Ubuntu/Linaro 4.7.2-1ubuntu1)

Rob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 10:23 ` Will Deacon
  2012-11-12 13:03   ` Rob Herring
@ 2012-11-12 14:21   ` Rob Herring
  2012-11-12 14:41     ` Will Deacon
  1 sibling, 1 reply; 35+ messages in thread
From: Rob Herring @ 2012-11-12 14:21 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/12/2012 04:23 AM, Will Deacon wrote:
> Hi Rob,
> 
> On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
>> From: Rob Herring <rob.herring@calxeda.com>
>>
>> Use the previously unused TPIDRPRW register to store percpu offsets.
>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
>>
>> This saves 2 loads for each percpu variable access which should yield
>> improved performance, but the improvement has not been quantified.
> 
> The patch looks largely fine to me (one minor comment below), but we should
> try and see what the performance difference is like on a few cores before
> merging this. Have you tried something like hackbench to see if the
> difference is measurable there? If not, I guess we'll need something more
> targetted.

Looks like it's about a 1.4% improvement on Cortex-A9 (highbank) with
hackbench.

Average of 30 runs of "hackbench -l 1000":

Before: 6.2190666667
After: 6.1347666667

I'll add this data to the commit msg.

Rob

> 
>> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
>> new file mode 100644
>> index 0000000..9eb7372
>> --- /dev/null
>> +++ b/arch/arm/include/asm/percpu.h
>> @@ -0,0 +1,44 @@
>> +/*
>> + * Copyright 2012 Calxeda, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along with
>> + * this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +#ifndef _ASM_ARM_PERCPU_H_
>> +#define _ASM_ARM_PERCPU_H_
>> +
>> +/*
>> + * Same as asm-generic/percpu.h, except that we store the per cpu offset
>> + * in the TPIDRPRW.
>> + */
>> +#if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
>> +
>> +static inline void set_my_cpu_offset(unsigned long off)
>> +{
>> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
>> +}
> 
> You don't need the "cc" here.
> 
> Will
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 14:21   ` Rob Herring
@ 2012-11-12 14:41     ` Will Deacon
  2012-11-12 16:51       ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-11-12 14:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 12, 2012 at 02:21:27PM +0000, Rob Herring wrote:
> On 11/12/2012 04:23 AM, Will Deacon wrote:
> > Hi Rob,
> > 
> > On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> >> From: Rob Herring <rob.herring@calxeda.com>
> >>
> >> Use the previously unused TPIDRPRW register to store percpu offsets.
> >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> >>
> >> This saves 2 loads for each percpu variable access which should yield
> >> improved performance, but the improvement has not been quantified.
> > 
> > The patch looks largely fine to me (one minor comment below), but we should
> > try and see what the performance difference is like on a few cores before
> > merging this. Have you tried something like hackbench to see if the
> > difference is measurable there? If not, I guess we'll need something more
> > targetted.
> 
> Looks like it's about a 1.4% improvement on Cortex-A9 (highbank) with
> hackbench.
> 
> Average of 30 runs of "hackbench -l 1000":
> 
> Before: 6.2190666667
> After: 6.1347666667
> 
> I'll add this data to the commit msg.

Wow, that's really cool! I'll take it for a spin on 11MPCore to test the v6
angle...

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 14:41     ` Will Deacon
@ 2012-11-12 16:51       ` Will Deacon
  2012-11-12 21:01         ` Rob Herring
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-11-12 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 12, 2012 at 02:41:22PM +0000, Will Deacon wrote:
> On Mon, Nov 12, 2012 at 02:21:27PM +0000, Rob Herring wrote:
> > On 11/12/2012 04:23 AM, Will Deacon wrote:
> > > Hi Rob,
> > > 
> > > On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> > >> From: Rob Herring <rob.herring@calxeda.com>
> > >>
> > >> Use the previously unused TPIDRPRW register to store percpu offsets.
> > >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> > >>
> > >> This saves 2 loads for each percpu variable access which should yield
> > >> improved performance, but the improvement has not been quantified.
> > > 
> > > The patch looks largely fine to me (one minor comment below), but we should
> > > try and see what the performance difference is like on a few cores before
> > > merging this. Have you tried something like hackbench to see if the
> > > difference is measurable there? If not, I guess we'll need something more
> > > targetted.
> > 
> > Looks like it's about a 1.4% improvement on Cortex-A9 (highbank) with
> > hackbench.
> > 
> > Average of 30 runs of "hackbench -l 1000":
> > 
> > Before: 6.2190666667
> > After: 6.1347666667
> > 
> > I'll add this data to the commit msg.
> 
> Wow, that's really cool! I'll take it for a spin on 11MPCore to test the v6
> angle...

Ok, similar numbers over here so it looks like this is definitely worth
doing. However, I still object to the "cc", particularly after discussion
with the tools guys here who agree that the behaviour you're seeing is
indicative of a buggy compiler. It may even be part of a larger issue with
GCC's definition of `reachability' for kernel entry points. For interest, I
failed to reproduce with:

  gcc version 4.7.3 20121001 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2012.10-20121022 - Linaro GCC 2012.10)
(http://launchpad.net/linaro-toolchain-binaries/trunk/2012.10/+download/gcc-linaro-arm-linux-gnueabihf-4.7-2012.10-20121022_linux.tar.bz2)

which sounds fairly close to the tools that you are using. Please can you
file a bug in launchpad?

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 16:51       ` Will Deacon
@ 2012-11-12 21:01         ` Rob Herring
  2012-11-13 10:40           ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Rob Herring @ 2012-11-12 21:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/12/2012 10:51 AM, Will Deacon wrote:
> On Mon, Nov 12, 2012 at 02:41:22PM +0000, Will Deacon wrote:
>> On Mon, Nov 12, 2012 at 02:21:27PM +0000, Rob Herring wrote:
>>> On 11/12/2012 04:23 AM, Will Deacon wrote:
>>>> Hi Rob,
>>>>
>>>> On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
>>>>> From: Rob Herring <rob.herring@calxeda.com>
>>>>>
>>>>> Use the previously unused TPIDRPRW register to store percpu offsets.
>>>>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
>>>>>
>>>>> This saves 2 loads for each percpu variable access which should yield
>>>>> improved performance, but the improvement has not been quantified.
>>>>
>>>> The patch looks largely fine to me (one minor comment below), but we should
>>>> try and see what the performance difference is like on a few cores before
>>>> merging this. Have you tried something like hackbench to see if the
>>>> difference is measurable there? If not, I guess we'll need something more
>>>> targetted.
>>>
>>> Looks like it's about a 1.4% improvement on Cortex-A9 (highbank) with
>>> hackbench.
>>>
>>> Average of 30 runs of "hackbench -l 1000":
>>>
>>> Before: 6.2190666667
>>> After: 6.1347666667
>>>
>>> I'll add this data to the commit msg.
>>
>> Wow, that's really cool! I'll take it for a spin on 11MPCore to test the v6
>> angle...
> 
> Ok, similar numbers over here so it looks like this is definitely worth
> doing. However, I still object to the "cc", particularly after discussion
> with the tools guys here who agree that the behaviour you're seeing is
> indicative of a buggy compiler. It may even be part of a larger issue with
> GCC's definition of `reachability' for kernel entry points. For interest, I
> failed to reproduce with:
> 
>   gcc version 4.7.3 20121001 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2012.10-20121022 - Linaro GCC 2012.10)
> (http://launchpad.net/linaro-toolchain-binaries/trunk/2012.10/+download/gcc-linaro-arm-linux-gnueabihf-4.7-2012.10-20121022_linux.tar.bz2)
> 
> which sounds fairly close to the tools that you are using. Please can you
> file a bug in launchpad? 

Strangely, I can't reproduce it either now...

Rob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 21:01         ` Rob Herring
@ 2012-11-13 10:40           ` Will Deacon
  0 siblings, 0 replies; 35+ messages in thread
From: Will Deacon @ 2012-11-13 10:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 12, 2012 at 09:01:58PM +0000, Rob Herring wrote:
> On 11/12/2012 10:51 AM, Will Deacon wrote:
> > Ok, similar numbers over here so it looks like this is definitely worth
> > doing. However, I still object to the "cc", particularly after discussion
> > with the tools guys here who agree that the behaviour you're seeing is
> > indicative of a buggy compiler. It may even be part of a larger issue with
> > GCC's definition of `reachability' for kernel entry points. For interest, I
> > failed to reproduce with:
> > 
> >   gcc version 4.7.3 20121001 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2012.10-20121022 - Linaro GCC 2012.10)
> > (http://launchpad.net/linaro-toolchain-binaries/trunk/2012.10/+download/gcc-linaro-arm-linux-gnueabihf-4.7-2012.10-20121022_linux.tar.bz2)
> > 
> > which sounds fairly close to the tools that you are using. Please can you
> > file a bug in launchpad? 
> 
> Strangely, I can't reproduce it either now...

Reproduce what? :)

If you remove the "cc" clobber, you can add my ack:

  Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-11  3:20 [PATCH] ARM: implement optimized percpu variable access Rob Herring
  2012-11-12 10:23 ` Will Deacon
@ 2012-11-22 11:34 ` Will Deacon
  2012-11-22 11:39   ` Russell King - ARM Linux
                     ` (3 more replies)
  2012-11-27 17:19 ` Nicolas Pitre
  2 siblings, 4 replies; 35+ messages in thread
From: Will Deacon @ 2012-11-22 11:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rob,

On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> From: Rob Herring <rob.herring@calxeda.com>
> 
> Use the previously unused TPIDRPRW register to store percpu offsets.
> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> 
> This saves 2 loads for each percpu variable access which should yield
> improved performance, but the improvement has not been quantified.
> 
> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> ---
>  arch/arm/include/asm/Kbuild   |    1 -
>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
>  arch/arm/kernel/smp.c         |    3 +++
>  3 files changed, 47 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm/include/asm/percpu.h

Russell pointed out to me that this patch will break on v6 CPUs if they don't
have the thread ID registers and we're running with SMP_ON_UP=y. Looking at
the TRMs, the only case we care about is 1136 < r1p0, but it does indeed break
there (I have a board on my desk).

There are a few ways to fix this:

(1) Use the SMP alternates to patch the code when running on UP systems. I
    tried this and the code is absolutely diabolical (see below).

(2) Rely on the registers being RAZ/WI rather than undef (which seems to be
    the case on my board) and add on the pcpu delta manually. This is also
    really horrible.

(3) Just make the thing depend on __LINUX_ARM_ARCH__ >= 7. Yes, we lose on
    11MPCore, but we win on A8 and the code is much, much simpler.

As an aside, you also need to make the asm block volatile in
__my_cpu_offset -- I can see it being re-ordered before the set for
secondary CPUs otherwise.

Will

--->8

>From b12ab049b864c2b6f0be1558c934d7213871b223 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Wed, 21 Nov 2012 10:53:28 +0000
Subject: [PATCH 1/2] ARM: smp: move SMP_ON_UP alternate macros to common
 header file

The ALT_{UP,SMP} macros are used to patch instructions at runtime
depending on whether we are running on an SMP platform or not. This is
generally done in out-of-line assembly code, but there is also the need
for this functionality in inline assembly (e.g. spinlocks).

This patch moves the macros into their own header file, which provides
the correct definitions depending on __ASSEMBLY__.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/assembler.h | 29 +------------------
 arch/arm/include/asm/smp_on_up.h | 60 ++++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/spinlock.h  | 21 +++++---------
 arch/arm/kernel/head.S           |  1 +
 arch/arm/kernel/module.c         |  8 ++----
 5 files changed, 71 insertions(+), 48 deletions(-)
 create mode 100644 arch/arm/include/asm/smp_on_up.h

diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
index 2ef9581..7da33e0 100644
--- a/arch/arm/include/asm/assembler.h
+++ b/arch/arm/include/asm/assembler.h
@@ -23,6 +23,7 @@
 #include <asm/ptrace.h>
 #include <asm/domain.h>
 #include <asm/opcodes-virt.h>
+#include <asm/smp_on_up.h>
 
 #define IOMEM(x)	(x)
 
@@ -166,34 +167,6 @@
 	.long	9999b,9001f;			\
 	.popsection
 
-#ifdef CONFIG_SMP
-#define ALT_SMP(instr...)					\
-9998:	instr
-/*
- * Note: if you get assembler errors from ALT_UP() when building with
- * CONFIG_THUMB2_KERNEL, you almost certainly need to use
- * ALT_SMP( W(instr) ... )
- */
-#define ALT_UP(instr...)					\
-	.pushsection ".alt.smp.init", "a"			;\
-	.long	9998b						;\
-9997:	instr							;\
-	.if . - 9997b != 4					;\
-		.error "ALT_UP() content must assemble to exactly 4 bytes";\
-	.endif							;\
-	.popsection
-#define ALT_UP_B(label)					\
-	.equ	up_b_offset, label - 9998b			;\
-	.pushsection ".alt.smp.init", "a"			;\
-	.long	9998b						;\
-	W(b)	. + up_b_offset					;\
-	.popsection
-#else
-#define ALT_SMP(instr...)
-#define ALT_UP(instr...) instr
-#define ALT_UP_B(label) b label
-#endif
-
 /*
  * Instruction barrier
  */
diff --git a/arch/arm/include/asm/smp_on_up.h b/arch/arm/include/asm/smp_on_up.h
new file mode 100644
index 0000000..cc9c527
--- /dev/null
+++ b/arch/arm/include/asm/smp_on_up.h
@@ -0,0 +1,60 @@
+#ifndef __ASM_ARM_SMP_ON_UP_H_
+#define __ASM_ARM_SMP_ON_UP_H_
+
+#ifndef __ASSEMBLY__
+#ifdef CONFIG_SMP_ON_UP
+extern int fixup_smp(const void *addr, unsigned long size);
+#else
+static inline int fixup_smp(const void *addr, unsigned long size)
+{
+	return -EINVAL;
+}
+#endif	/* CONFIG_SMP_ON_UP */
+#endif	/* __ASSEMBLY__ */
+
+/*
+ * Note: if you get assembler errors from ALT_UP() when building with
+ * CONFIG_THUMB2_KERNEL, you almost certainly need to use
+ * ALT_SMP( W(instr) ... )
+ */
+#ifdef CONFIG_SMP
+#ifndef __ASSEMBLY__
+
+#define ALT_SMP(instr)						\
+	"9998:	" instr "\n"
+
+#define ALT_UP(instr)						\
+	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
+	"	.long	9998b\n"				\
+	"	" instr "\n"					\
+	"	.popsection\n"
+
+#else
+
+#define ALT_SMP(instr...)					\
+9998:	instr
+
+#define ALT_UP(instr...)					\
+	.pushsection ".alt.smp.init", "a"			;\
+	.long	9998b						;\
+9997:	instr							;\
+	.if . - 9997b != 4					;\
+		.error "ALT_UP() content must assemble to exactly 4 bytes";\
+	.endif							;\
+	.popsection
+
+#define ALT_UP_B(label)						\
+	.equ	up_b_offset, label - 9998b			;\
+	.pushsection ".alt.smp.init", "a"			;\
+	.long	9998b						;\
+	W(b)	. + up_b_offset					;\
+	.popsection
+
+#endif	/* __ASSEMBLY */
+#else
+#define ALT_SMP(instr...)
+#define ALT_UP(instr...) instr
+#define ALT_UP_B(label) b label
+#endif	/* CONFIG_SMP */
+
+#endif	/* __ASM_ARM_SMP_ON_UP_H_ */
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index b4ca707..58d2f42 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -6,20 +6,14 @@
 #endif
 
 #include <asm/processor.h>
+#include <asm/smp_on_up.h>
 
 /*
  * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
  * extensions, so when running on UP, we have to patch these instructions away.
  */
-#define ALT_SMP(smp, up)					\
-	"9998:	" smp "\n"					\
-	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
-	"	.long	9998b\n"				\
-	"	" up "\n"					\
-	"	.popsection\n"
-
 #ifdef CONFIG_THUMB2_KERNEL
-#define SEV		ALT_SMP("sev.w", "nop.w")
+#define SEV		ALT_SMP("sev.w")	ALT_UP("nop.w")
 /*
  * For Thumb-2, special care is needed to ensure that the conditional WFE
  * instruction really does assemble to exactly 4 bytes (as required by
@@ -33,13 +27,12 @@
  */
 #define WFE(cond)	ALT_SMP(		\
 	"it " cond "\n\t"			\
-	"wfe" cond ".n",			\
-						\
-	"nop.w"					\
-)
+	"wfe" cond ".n")			\
+			ALT_UP(			\
+	"nop.w")
 #else
-#define SEV		ALT_SMP("sev", "nop")
-#define WFE(cond)	ALT_SMP("wfe" cond, "nop")
+#define SEV		ALT_SMP("sev")		ALT_UP("nop")
+#define WFE(cond)	ALT_SMP("wfe" cond)	ALT_UP("nop")
 #endif
 
 static inline void dsb_sev(void)
diff --git a/arch/arm/kernel/head.S b/arch/arm/kernel/head.S
index 4eee351..c44b51f 100644
--- a/arch/arm/kernel/head.S
+++ b/arch/arm/kernel/head.S
@@ -495,6 +495,7 @@ smp_on_up:
 	.text
 __do_fixup_smp_on_up:
 	cmp	r4, r5
+	movhs	r0, #0
 	movhs	pc, lr
 	ldmia	r4!, {r0, r6}
  ARM(	str	r6, [r0, r3]	)
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index 1e9be5d..f2299d0 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -23,6 +23,7 @@
 #include <asm/pgtable.h>
 #include <asm/sections.h>
 #include <asm/smp_plat.h>
+#include <asm/smp_on_up.h>
 #include <asm/unwind.h>
 
 #ifdef CONFIG_XIP_KERNEL
@@ -266,7 +267,6 @@ static const Elf_Shdr *find_mod_section(const Elf32_Ehdr *hdr,
 }
 
 extern void fixup_pv_table(const void *, unsigned long);
-extern void fixup_smp(const void *, unsigned long);
 
 int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs,
 		    struct module *mod)
@@ -323,11 +323,7 @@ int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs,
 #endif
 	s = find_mod_section(hdr, sechdrs, ".alt.smp.init");
 	if (s && !is_smp())
-#ifdef CONFIG_SMP_ON_UP
-		fixup_smp((void *)s->sh_addr, s->sh_size);
-#else
-		return -EINVAL;
-#endif
+		return fixup_smp((void *)s->sh_addr, s->sh_size);
 	return 0;
 }
 
-- 
1.8.0


>From d3a669e8d23b06cfcaf0270fe2ed405d1d1155f6 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Wed, 21 Nov 2012 13:57:39 +0000
Subject: [PATCH 2/2] ARM: percpu: fix per-cpu offset accessors for SMP_ON_UP
 on early v6 CPUs

Some ARMv6 CPUs (1156, 1136 < r1p0) do not implement the thread ID
registers and as such fail to boot an SMP_ON_UP kernel with the new
per-cpu offset implementation.

This patch adds ALT_{UP,SMP} sequences so that we return the
__per_cpu_offset of the primary CPU for uniprocessor platforms running
an SMP kernel.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/percpu.h | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
index 9c8d051..5b3babd 100644
--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -16,21 +16,46 @@
 #ifndef _ASM_ARM_PERCPU_H_
 #define _ASM_ARM_PERCPU_H_
 
+#include <asm/smp_on_up.h>
+
 /*
  * Same as asm-generic/percpu.h, except that we store the per cpu offset
  * in the TPIDRPRW.
  */
 #if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
 
+#ifdef CONFIG_THUMB2_KERNEL
+#define T2_WIDE_INSN(insn)	insn ".w"
+#else
+#define T2_WIDE_INSN(insn)	insn
+#endif
+
 static inline void set_my_cpu_offset(unsigned long off)
 {
-	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) );
+	asm volatile(
+ ALT_SMP	("mcr	p15, 0, %0, c13, c0, 4	@ set TPIDRPRW")
+ ALT_UP		(T2_WIDE_INSN("nop"))
+	: : "r" (off));
 }
 
 static inline unsigned long __my_cpu_offset(void)
 {
 	unsigned long off;
-	asm("mrc p15, 0, %0, c13, c0, 4	@ get TPIDRPRW" : "=r" (off) : );
+
+	asm volatile(
+#ifdef CONFIG_SMP_ON_UP
+		".align	2\n"
+#endif
+ ALT_SMP	("mrc	p15, 0, %0, c13, c0, 4	@ get TPIDRPRW")
+ ALT_UP		(T2_WIDE_INSN("ldr")"	%0, . + 12")
+ ALT_SMP	(T2_WIDE_INSN("nop"))
+ ALT_UP		(T2_WIDE_INSN("ldr")"	%0, [%0]")
+ ALT_SMP	(T2_WIDE_INSN("nop"))
+ ALT_UP		(T2_WIDE_INSN("b")"	. + 8")
+ ALT_SMP	(T2_WIDE_INSN("nop"))
+ ALT_UP		(".long	__per_cpu_offset")
+	: "=r" (off));
+
 	return off;
 }
 #define __my_cpu_offset __my_cpu_offset()
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-22 11:34 ` Will Deacon
@ 2012-11-22 11:39   ` Russell King - ARM Linux
  2012-11-23 17:06   ` Rob Herring
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-11-22 11:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Nov 22, 2012 at 11:34:01AM +0000, Will Deacon wrote:
> (3) Just make the thing depend on __LINUX_ARM_ARCH__ >= 7. Yes, we lose on
>     11MPCore, but we win on A8 and the code is much, much simpler.

We also lose it in kernels built for mutli-platform, such as OMAP which
needs to cover both ARMv7 and ARMv6.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-22 11:34 ` Will Deacon
  2012-11-22 11:39   ` Russell King - ARM Linux
@ 2012-11-23 17:06   ` Rob Herring
  2012-11-23 17:12     ` Russell King - ARM Linux
  2012-11-23 17:16     ` Will Deacon
  2012-11-23 20:32   ` Tony Lindgren
  2012-11-25 18:46   ` Rob Herring
  3 siblings, 2 replies; 35+ messages in thread
From: Rob Herring @ 2012-11-23 17:06 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/22/2012 05:34 AM, Will Deacon wrote:
> Hi Rob,
> 
> On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
>> From: Rob Herring <rob.herring@calxeda.com>
>>
>> Use the previously unused TPIDRPRW register to store percpu offsets.
>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
>>
>> This saves 2 loads for each percpu variable access which should yield
>> improved performance, but the improvement has not been quantified.
>>
>> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
>> ---
>>  arch/arm/include/asm/Kbuild   |    1 -
>>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
>>  arch/arm/kernel/smp.c         |    3 +++
>>  3 files changed, 47 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm/include/asm/percpu.h
> 
> Russell pointed out to me that this patch will break on v6 CPUs if they don't
> have the thread ID registers and we're running with SMP_ON_UP=y. Looking at
> the TRMs, the only case we care about is 1136 < r1p0, but it does indeed break
> there (I have a board on my desk).

Are there any non ARM Ltd. cores without v6K we need to worry about? I
wouldn't think there are many 1136 < r1p0 out there (your desk being an
obvious exception).

> There are a few ways to fix this:
> 
> (1) Use the SMP alternates to patch the code when running on UP systems. I
>     tried this and the code is absolutely diabolical (see below).
> 
> (2) Rely on the registers being RAZ/WI rather than undef (which seems to be
>     the case on my board) and add on the pcpu delta manually. This is also
>     really horrible.
> 
> (3) Just make the thing depend on __LINUX_ARM_ARCH__ >= 7. Yes, we lose on
>     11MPCore, but we win on A8 and the code is much, much simpler.

I would lean towards this option. It really only has to depend on v6K
and !v6. We can refine the multi-platform selection to allow v7 and v6K
only builds in addition to v7 and v6. I think we are going to have
enough optimizations with v7 (gcc optimizations, thumb2, unaligned
accesses, etc.) that most people will do v7 only builds.

Rob

> 
> As an aside, you also need to make the asm block volatile in
> __my_cpu_offset -- I can see it being re-ordered before the set for
> secondary CPUs otherwise.
> 
> Will
> 
> --->8
> 
> From b12ab049b864c2b6f0be1558c934d7213871b223 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 21 Nov 2012 10:53:28 +0000
> Subject: [PATCH 1/2] ARM: smp: move SMP_ON_UP alternate macros to common
>  header file
> 
> The ALT_{UP,SMP} macros are used to patch instructions at runtime
> depending on whether we are running on an SMP platform or not. This is
> generally done in out-of-line assembly code, but there is also the need
> for this functionality in inline assembly (e.g. spinlocks).
> 
> This patch moves the macros into their own header file, which provides
> the correct definitions depending on __ASSEMBLY__.
> 
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/include/asm/assembler.h | 29 +------------------
>  arch/arm/include/asm/smp_on_up.h | 60 ++++++++++++++++++++++++++++++++++++++++
>  arch/arm/include/asm/spinlock.h  | 21 +++++---------
>  arch/arm/kernel/head.S           |  1 +
>  arch/arm/kernel/module.c         |  8 ++----
>  5 files changed, 71 insertions(+), 48 deletions(-)
>  create mode 100644 arch/arm/include/asm/smp_on_up.h
> 
> diff --git a/arch/arm/include/asm/assembler.h b/arch/arm/include/asm/assembler.h
> index 2ef9581..7da33e0 100644
> --- a/arch/arm/include/asm/assembler.h
> +++ b/arch/arm/include/asm/assembler.h
> @@ -23,6 +23,7 @@
>  #include <asm/ptrace.h>
>  #include <asm/domain.h>
>  #include <asm/opcodes-virt.h>
> +#include <asm/smp_on_up.h>
>  
>  #define IOMEM(x)	(x)
>  
> @@ -166,34 +167,6 @@
>  	.long	9999b,9001f;			\
>  	.popsection
>  
> -#ifdef CONFIG_SMP
> -#define ALT_SMP(instr...)					\
> -9998:	instr
> -/*
> - * Note: if you get assembler errors from ALT_UP() when building with
> - * CONFIG_THUMB2_KERNEL, you almost certainly need to use
> - * ALT_SMP( W(instr) ... )
> - */
> -#define ALT_UP(instr...)					\
> -	.pushsection ".alt.smp.init", "a"			;\
> -	.long	9998b						;\
> -9997:	instr							;\
> -	.if . - 9997b != 4					;\
> -		.error "ALT_UP() content must assemble to exactly 4 bytes";\
> -	.endif							;\
> -	.popsection
> -#define ALT_UP_B(label)					\
> -	.equ	up_b_offset, label - 9998b			;\
> -	.pushsection ".alt.smp.init", "a"			;\
> -	.long	9998b						;\
> -	W(b)	. + up_b_offset					;\
> -	.popsection
> -#else
> -#define ALT_SMP(instr...)
> -#define ALT_UP(instr...) instr
> -#define ALT_UP_B(label) b label
> -#endif
> -
>  /*
>   * Instruction barrier
>   */
> diff --git a/arch/arm/include/asm/smp_on_up.h b/arch/arm/include/asm/smp_on_up.h
> new file mode 100644
> index 0000000..cc9c527
> --- /dev/null
> +++ b/arch/arm/include/asm/smp_on_up.h
> @@ -0,0 +1,60 @@
> +#ifndef __ASM_ARM_SMP_ON_UP_H_
> +#define __ASM_ARM_SMP_ON_UP_H_
> +
> +#ifndef __ASSEMBLY__
> +#ifdef CONFIG_SMP_ON_UP
> +extern int fixup_smp(const void *addr, unsigned long size);
> +#else
> +static inline int fixup_smp(const void *addr, unsigned long size)
> +{
> +	return -EINVAL;
> +}
> +#endif	/* CONFIG_SMP_ON_UP */
> +#endif	/* __ASSEMBLY__ */
> +
> +/*
> + * Note: if you get assembler errors from ALT_UP() when building with
> + * CONFIG_THUMB2_KERNEL, you almost certainly need to use
> + * ALT_SMP( W(instr) ... )
> + */
> +#ifdef CONFIG_SMP
> +#ifndef __ASSEMBLY__
> +
> +#define ALT_SMP(instr)						\
> +	"9998:	" instr "\n"
> +
> +#define ALT_UP(instr)						\
> +	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
> +	"	.long	9998b\n"				\
> +	"	" instr "\n"					\
> +	"	.popsection\n"
> +
> +#else
> +
> +#define ALT_SMP(instr...)					\
> +9998:	instr
> +
> +#define ALT_UP(instr...)					\
> +	.pushsection ".alt.smp.init", "a"			;\
> +	.long	9998b						;\
> +9997:	instr							;\
> +	.if . - 9997b != 4					;\
> +		.error "ALT_UP() content must assemble to exactly 4 bytes";\
> +	.endif							;\
> +	.popsection
> +
> +#define ALT_UP_B(label)						\
> +	.equ	up_b_offset, label - 9998b			;\
> +	.pushsection ".alt.smp.init", "a"			;\
> +	.long	9998b						;\
> +	W(b)	. + up_b_offset					;\
> +	.popsection
> +
> +#endif	/* __ASSEMBLY */
> +#else
> +#define ALT_SMP(instr...)
> +#define ALT_UP(instr...) instr
> +#define ALT_UP_B(label) b label
> +#endif	/* CONFIG_SMP */
> +
> +#endif	/* __ASM_ARM_SMP_ON_UP_H_ */
> diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
> index b4ca707..58d2f42 100644
> --- a/arch/arm/include/asm/spinlock.h
> +++ b/arch/arm/include/asm/spinlock.h
> @@ -6,20 +6,14 @@
>  #endif
>  
>  #include <asm/processor.h>
> +#include <asm/smp_on_up.h>
>  
>  /*
>   * sev and wfe are ARMv6K extensions.  Uniprocessor ARMv6 may not have the K
>   * extensions, so when running on UP, we have to patch these instructions away.
>   */
> -#define ALT_SMP(smp, up)					\
> -	"9998:	" smp "\n"					\
> -	"	.pushsection \".alt.smp.init\", \"a\"\n"	\
> -	"	.long	9998b\n"				\
> -	"	" up "\n"					\
> -	"	.popsection\n"
> -
>  #ifdef CONFIG_THUMB2_KERNEL
> -#define SEV		ALT_SMP("sev.w", "nop.w")
> +#define SEV		ALT_SMP("sev.w")	ALT_UP("nop.w")
>  /*
>   * For Thumb-2, special care is needed to ensure that the conditional WFE
>   * instruction really does assemble to exactly 4 bytes (as required by
> @@ -33,13 +27,12 @@
>   */
>  #define WFE(cond)	ALT_SMP(		\
>  	"it " cond "\n\t"			\
> -	"wfe" cond ".n",			\
> -						\
> -	"nop.w"					\
> -)
> +	"wfe" cond ".n")			\
> +			ALT_UP(			\
> +	"nop.w")
>  #else
> -#define SEV		ALT_SMP("sev", "nop")
> -#define WFE(cond)	ALT_SMP("wfe" cond, "nop")
> +#define SEV		ALT_SMP("sev")		ALT_UP("nop")
> +#define WFE(cond)	ALT_SMP("wfe" cond)	ALT_UP("nop")
>  #endif
>  
>  static inline void dsb_sev(void)
> diff --git a/arch/arm/kernel/head.S b/arch/arm/kernel/head.S
> index 4eee351..c44b51f 100644
> --- a/arch/arm/kernel/head.S
> +++ b/arch/arm/kernel/head.S
> @@ -495,6 +495,7 @@ smp_on_up:
>  	.text
>  __do_fixup_smp_on_up:
>  	cmp	r4, r5
> +	movhs	r0, #0
>  	movhs	pc, lr
>  	ldmia	r4!, {r0, r6}
>   ARM(	str	r6, [r0, r3]	)
> diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
> index 1e9be5d..f2299d0 100644
> --- a/arch/arm/kernel/module.c
> +++ b/arch/arm/kernel/module.c
> @@ -23,6 +23,7 @@
>  #include <asm/pgtable.h>
>  #include <asm/sections.h>
>  #include <asm/smp_plat.h>
> +#include <asm/smp_on_up.h>
>  #include <asm/unwind.h>
>  
>  #ifdef CONFIG_XIP_KERNEL
> @@ -266,7 +267,6 @@ static const Elf_Shdr *find_mod_section(const Elf32_Ehdr *hdr,
>  }
>  
>  extern void fixup_pv_table(const void *, unsigned long);
> -extern void fixup_smp(const void *, unsigned long);
>  
>  int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs,
>  		    struct module *mod)
> @@ -323,11 +323,7 @@ int module_finalize(const Elf32_Ehdr *hdr, const Elf_Shdr *sechdrs,
>  #endif
>  	s = find_mod_section(hdr, sechdrs, ".alt.smp.init");
>  	if (s && !is_smp())
> -#ifdef CONFIG_SMP_ON_UP
> -		fixup_smp((void *)s->sh_addr, s->sh_size);
> -#else
> -		return -EINVAL;
> -#endif
> +		return fixup_smp((void *)s->sh_addr, s->sh_size);
>  	return 0;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-23 17:06   ` Rob Herring
@ 2012-11-23 17:12     ` Russell King - ARM Linux
  2012-11-23 17:16     ` Will Deacon
  1 sibling, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-11-23 17:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 23, 2012 at 11:06:07AM -0600, Rob Herring wrote:
> On 11/22/2012 05:34 AM, Will Deacon wrote:
> > Hi Rob,
> > 
> > On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> >> From: Rob Herring <rob.herring@calxeda.com>
> >>
> >> Use the previously unused TPIDRPRW register to store percpu offsets.
> >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> >>
> >> This saves 2 loads for each percpu variable access which should yield
> >> improved performance, but the improvement has not been quantified.
> >>
> >> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> >> ---
> >>  arch/arm/include/asm/Kbuild   |    1 -
> >>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
> >>  arch/arm/kernel/smp.c         |    3 +++
> >>  3 files changed, 47 insertions(+), 1 deletion(-)
> >>  create mode 100644 arch/arm/include/asm/percpu.h
> > 
> > Russell pointed out to me that this patch will break on v6 CPUs if they don't
> > have the thread ID registers and we're running with SMP_ON_UP=y. Looking at
> > the TRMs, the only case we care about is 1136 < r1p0, but it does indeed break
> > there (I have a board on my desk).
> 
> Are there any non ARM Ltd. cores without v6K we need to worry about? I
> wouldn't think there are many 1136 < r1p0 out there (your desk being an
> obvious exception).

And my desk...

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-23 17:06   ` Rob Herring
  2012-11-23 17:12     ` Russell King - ARM Linux
@ 2012-11-23 17:16     ` Will Deacon
  2012-11-23 20:34       ` Tony Lindgren
  1 sibling, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-11-23 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Nov 23, 2012 at 05:06:07PM +0000, Rob Herring wrote:
> On 11/22/2012 05:34 AM, Will Deacon wrote:
> > On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> >> From: Rob Herring <rob.herring@calxeda.com>
> >>
> >> Use the previously unused TPIDRPRW register to store percpu offsets.
> >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> >>
> >> This saves 2 loads for each percpu variable access which should yield
> >> improved performance, but the improvement has not been quantified.
> >>
> >> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> >> ---
> >>  arch/arm/include/asm/Kbuild   |    1 -
> >>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
> >>  arch/arm/kernel/smp.c         |    3 +++
> >>  3 files changed, 47 insertions(+), 1 deletion(-)
> >>  create mode 100644 arch/arm/include/asm/percpu.h
> > 
> > Russell pointed out to me that this patch will break on v6 CPUs if they don't
> > have the thread ID registers and we're running with SMP_ON_UP=y. Looking at
> > the TRMs, the only case we care about is 1136 < r1p0, but it does indeed break
> > there (I have a board on my desk).
> 
> Are there any non ARM Ltd. cores without v6K we need to worry about? I
> wouldn't think there are many 1136 < r1p0 out there (your desk being an
> obvious exception).

To be honest, I'm not sure. It would be good if Marvell and Qualcomm could
chime in as I wouldn't be surprised if they had some parts that fit this
category.

As for the 1136, we have a few spare if you want one!

> > There are a few ways to fix this:
> > 
> > (1) Use the SMP alternates to patch the code when running on UP systems. I
> >     tried this and the code is absolutely diabolical (see below).
> > 
> > (2) Rely on the registers being RAZ/WI rather than undef (which seems to be
> >     the case on my board) and add on the pcpu delta manually. This is also
> >     really horrible.
> > 
> > (3) Just make the thing depend on __LINUX_ARM_ARCH__ >= 7. Yes, we lose on
> >     11MPCore, but we win on A8 and the code is much, much simpler.
> 
> I would lean towards this option. It really only has to depend on v6K
> and !v6. We can refine the multi-platform selection to allow v7 and v6K
> only builds in addition to v7 and v6. I think we are going to have
> enough optimizations with v7 (gcc optimizations, thumb2, unaligned
> accesses, etc.) that most people will do v7 only builds.

Sounds alright to me, and I'm happy to test on the boards that I have.

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-22 11:34 ` Will Deacon
  2012-11-22 11:39   ` Russell King - ARM Linux
  2012-11-23 17:06   ` Rob Herring
@ 2012-11-23 20:32   ` Tony Lindgren
  2012-11-25 18:46   ` Rob Herring
  3 siblings, 0 replies; 35+ messages in thread
From: Tony Lindgren @ 2012-11-23 20:32 UTC (permalink / raw)
  To: linux-arm-kernel

* Will Deacon <will.deacon@arm.com> [121122 03:43]:
> Hi Rob,
> 
> On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> > From: Rob Herring <rob.herring@calxeda.com>
> > 
> > Use the previously unused TPIDRPRW register to store percpu offsets.
> > TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> > 
> > This saves 2 loads for each percpu variable access which should yield
> > improved performance, but the improvement has not been quantified.
> > 
> > Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> > ---
> >  arch/arm/include/asm/Kbuild   |    1 -
> >  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
> >  arch/arm/kernel/smp.c         |    3 +++
> >  3 files changed, 47 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/arm/include/asm/percpu.h
> 
> Russell pointed out to me that this patch will break on v6 CPUs if they don't
> have the thread ID registers and we're running with SMP_ON_UP=y. Looking at
> the TRMs, the only case we care about is 1136 < r1p0, but it does indeed break
> there (I have a board on my desk).

Sounds like it will break all omap2420 based systems like Nokia n8x0 :(

FYI those are booting with omap2plus_defconfig with SMP_ON_UP=y.

Regards,

Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-23 17:16     ` Will Deacon
@ 2012-11-23 20:34       ` Tony Lindgren
  0 siblings, 0 replies; 35+ messages in thread
From: Tony Lindgren @ 2012-11-23 20:34 UTC (permalink / raw)
  To: linux-arm-kernel

* Will Deacon <will.deacon@arm.com> [121123 09:23]:
> On Fri, Nov 23, 2012 at 05:06:07PM +0000, Rob Herring wrote:
> > On 11/22/2012 05:34 AM, Will Deacon wrote:
> > > On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> > >> From: Rob Herring <rob.herring@calxeda.com>
> > >>
> > >> Use the previously unused TPIDRPRW register to store percpu offsets.
> > >> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> > >>
> > >> This saves 2 loads for each percpu variable access which should yield
> > >> improved performance, but the improvement has not been quantified.
> > >>
> > >> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> > >> ---
> > >>  arch/arm/include/asm/Kbuild   |    1 -
> > >>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
> > >>  arch/arm/kernel/smp.c         |    3 +++
> > >>  3 files changed, 47 insertions(+), 1 deletion(-)
> > >>  create mode 100644 arch/arm/include/asm/percpu.h
> > > 
> > > Russell pointed out to me that this patch will break on v6 CPUs if they don't
> > > have the thread ID registers and we're running with SMP_ON_UP=y. Looking at
> > > the TRMs, the only case we care about is 1136 < r1p0, but it does indeed break
> > > there (I have a board on my desk).
> > 
> > Are there any non ARM Ltd. cores without v6K we need to worry about? I
> > wouldn't think there are many 1136 < r1p0 out there (your desk being an
> > obvious exception).
> 
> To be honest, I'm not sure. It would be good if Marvell and Qualcomm could
> chime in as I wouldn't be surprised if they had some parts that fit this
> category.

At least omap2420 is without v6K.

Regards,

Tony

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-22 11:34 ` Will Deacon
                     ` (2 preceding siblings ...)
  2012-11-23 20:32   ` Tony Lindgren
@ 2012-11-25 18:46   ` Rob Herring
  2012-11-26 11:13     ` Will Deacon
  3 siblings, 1 reply; 35+ messages in thread
From: Rob Herring @ 2012-11-25 18:46 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/22/2012 05:34 AM, Will Deacon wrote:
> Hi Rob,


> As an aside, you also need to make the asm block volatile in
> __my_cpu_offset -- I can see it being re-ordered before the set for
> secondary CPUs otherwise.

I don't think that is right. Doing that means the register is reloaded
on every access and you end up with code like this (from handle_IRQ):

c000eb4c:       ee1d2f90        mrc     15, 0, r2, cr13, cr0, {4}
c000eb50:       e7926003        ldr     r6, [r2, r3]
c000eb54:       ee1d2f90        mrc     15, 0, r2, cr13, cr0, {4}
c000eb58:       e7821003        str     r1, [r2, r3]
c000eb5c:       eb006cb1        bl      c0029e28 <irq_enter>

I don't really see where there would be a re-ordering issue. There's no
percpu var access before or near the setting that I can see.

Rob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-25 18:46   ` Rob Herring
@ 2012-11-26 11:13     ` Will Deacon
  2012-11-26 15:15       ` Will Deacon
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-11-26 11:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rob,

On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
> On 11/22/2012 05:34 AM, Will Deacon wrote:
> > As an aside, you also need to make the asm block volatile in
> > __my_cpu_offset -- I can see it being re-ordered before the set for
> > secondary CPUs otherwise.
> 
> I don't think that is right. Doing that means the register is reloaded
> on every access and you end up with code like this (from handle_IRQ):
> 
> c000eb4c:       ee1d2f90        mrc     15, 0, r2, cr13, cr0, {4}
> c000eb50:       e7926003        ldr     r6, [r2, r3]
> c000eb54:       ee1d2f90        mrc     15, 0, r2, cr13, cr0, {4}
> c000eb58:       e7821003        str     r1, [r2, r3]
> c000eb5c:       eb006cb1        bl      c0029e28 <irq_enter>
> 
> I don't really see where there would be a re-ordering issue. There's no
> percpu var access before or near the setting that I can see.

Well my A15 doesn't boot with your original patch unless I make that thing
volatile, so something does need tweaking...

The issue is on bringing up the secondary core, so I assumed that a lot
of inlining goes on inside secondary_start_kernel and then the result is
shuffled around, placing a cpu-offset read before we've done the set.

Unfortunately, looking at the disassembly I can't see this happening at
all, so I'll keep digging. The good news is that I've just reproduced the
problem on the model, so I've got more visibility now (although both cores
are just stuck in spinlocks...).

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 11:13     ` Will Deacon
@ 2012-11-26 15:15       ` Will Deacon
  2012-11-26 17:30         ` Rob Herring
                           ` (3 more replies)
  0 siblings, 4 replies; 35+ messages in thread
From: Will Deacon @ 2012-11-26 15:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 26, 2012 at 11:13:37AM +0000, Will Deacon wrote:
> On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
> > On 11/22/2012 05:34 AM, Will Deacon wrote:
> > > As an aside, you also need to make the asm block volatile in
> > > __my_cpu_offset -- I can see it being re-ordered before the set for
> > > secondary CPUs otherwise.
> > 
> > I don't really see where there would be a re-ordering issue. There's no
> > percpu var access before or near the setting that I can see.
> 
> The issue is on bringing up the secondary core, so I assumed that a lot
> of inlining goes on inside secondary_start_kernel and then the result is
> shuffled around, placing a cpu-offset read before we've done the set.
> 
> Unfortunately, looking at the disassembly I can't see this happening at
> all, so I'll keep digging. The good news is that I've just reproduced the
> problem on the model, so I've got more visibility now (although both cores
> are just stuck in spinlocks...).

That was a fun bit of debugging -- my hunch was right, but I was looking in the
wrong place because I had an unrelated problem with my bootloader.

What happens is that every man and his dog is inlined into __schedule,
including all the runqueue accessors, such as this_rq(), which make use of
per-cpu offsets to get the correct pointer. The compiler then spits out
something like this near the start of the function:

  c02c1d66:       af04            add     r7, sp, #16
  [...]
  c02c1d6c:       ee1d 3f90       mrc     15, 0, r3, cr13, cr0, {4}
  c02c1d70:       199b            adds    r3, r3, r6
  c02c1d72:       f8c7 e008       str.w   lr, [r7, #8]
  c02c1d76:       617b            str     r3, [r7, #20]
  c02c1d78:       613e            str     r6, [r7, #16]
  c02c1d7a:       60fb            str     r3, [r7, #12]

so the address of the current runqueue has been calculated and stored, with
a bunch of other stuff, in a structure on the stack.

We then do our context_switch dance (which is also inlined) and return as
the next task (since we've done switch_{mm,to}) before doing:

	barrier();
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);

The problem here is that, because our CPU accessors don't actually make any
memory references, the barrier() has no effect and the old value is just
reloaded off the stack:

  c02c1f22:       f54a fe49       bl      c000cbb8 <__switch_to>
  c02c1f26:       4601            mov     r1, r0
  c02c1f28:       68f8            ldr     r0, [r7, #12]
  c02c1f2a:       f56f ffd5       bl      c0031ed8 <finish_task_switch>

which obviously causes complete chaos if the new task has been pulled from
a different runqueue! (this appears as a double spin unlock on rq->lock).

Fixing this without giving up the performance improvement we gain by *avoiding*
the memory access in the first place is going to be tricky...

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 15:15       ` Will Deacon
@ 2012-11-26 17:30         ` Rob Herring
  2012-11-27 13:17           ` Will Deacon
  2012-11-26 21:58         ` Jamie Lokier
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 35+ messages in thread
From: Rob Herring @ 2012-11-26 17:30 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/26/2012 09:15 AM, Will Deacon wrote:
> On Mon, Nov 26, 2012 at 11:13:37AM +0000, Will Deacon wrote:
>> On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
>>> On 11/22/2012 05:34 AM, Will Deacon wrote:
>>>> As an aside, you also need to make the asm block volatile in
>>>> __my_cpu_offset -- I can see it being re-ordered before the set for
>>>> secondary CPUs otherwise.
>>>
>>> I don't really see where there would be a re-ordering issue. There's no
>>> percpu var access before or near the setting that I can see.
>>
>> The issue is on bringing up the secondary core, so I assumed that a lot
>> of inlining goes on inside secondary_start_kernel and then the result is
>> shuffled around, placing a cpu-offset read before we've done the set.
>>
>> Unfortunately, looking at the disassembly I can't see this happening at
>> all, so I'll keep digging. The good news is that I've just reproduced the
>> problem on the model, so I've got more visibility now (although both cores
>> are just stuck in spinlocks...).
> 
> That was a fun bit of debugging -- my hunch was right, but I was looking in the
> wrong place because I had an unrelated problem with my bootloader.
> 
> What happens is that every man and his dog is inlined into __schedule,
> including all the runqueue accessors, such as this_rq(), which make use of
> per-cpu offsets to get the correct pointer. The compiler then spits out
> something like this near the start of the function:
> 
>   c02c1d66:       af04            add     r7, sp, #16
>   [...]
>   c02c1d6c:       ee1d 3f90       mrc     15, 0, r3, cr13, cr0, {4}
>   c02c1d70:       199b            adds    r3, r3, r6
>   c02c1d72:       f8c7 e008       str.w   lr, [r7, #8]
>   c02c1d76:       617b            str     r3, [r7, #20]
>   c02c1d78:       613e            str     r6, [r7, #16]
>   c02c1d7a:       60fb            str     r3, [r7, #12]
> 
> so the address of the current runqueue has been calculated and stored, with
> a bunch of other stuff, in a structure on the stack.
> 
> We then do our context_switch dance (which is also inlined) and return as
> the next task (since we've done switch_{mm,to}) before doing:
> 
> 	barrier();
> 	/*
> 	 * this_rq must be evaluated again because prev may have moved
> 	 * CPUs since it called schedule(), thus the 'rq' on its stack
> 	 * frame will be invalid.
> 	 */
> 	finish_task_switch(this_rq(), prev);
> 
> The problem here is that, because our CPU accessors don't actually make any
> memory references, the barrier() has no effect and the old value is just
> reloaded off the stack:
> 
>   c02c1f22:       f54a fe49       bl      c000cbb8 <__switch_to>
>   c02c1f26:       4601            mov     r1, r0
>   c02c1f28:       68f8            ldr     r0, [r7, #12]
>   c02c1f2a:       f56f ffd5       bl      c0031ed8 <finish_task_switch>
> 
> which obviously causes complete chaos if the new task has been pulled from
> a different runqueue! (this appears as a double spin unlock on rq->lock).
> 
> Fixing this without giving up the performance improvement we gain by *avoiding*
> the memory access in the first place is going to be tricky...

What compiler and config are you using? I get a reload of the register here:

c0350fba:       f001 fa9d       bl      c03524f8 <__switch_to>
c0350fbe:       4601            mov     r1, r0
c0350fc0:       ee1d 0f90       mrc     15, 0, r0, cr13, cr0, {4}
c0350fc4:       4c2a            ldr     r4, [pc, #168]  ; (c0351070
<__schedule+0x390>)
c0350fc6:       f649 1590       movw    r5, #39312      ; 0x9990
c0350fca:       1900            adds    r0, r0, r4
c0350fcc:       f2cc 0556       movt    r5, #49238      ; 0xc056
c0350fd0:       f4e7 fd56       bl      c0038a80 <finish_task_switch>

Rob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 15:15       ` Will Deacon
  2012-11-26 17:30         ` Rob Herring
@ 2012-11-26 21:58         ` Jamie Lokier
  2012-11-26 23:50           ` Jamie Lokier
  2012-11-27  1:02         ` Jamie Lokier
  2012-11-27 17:35         ` Nicolas Pitre
  3 siblings, 1 reply; 35+ messages in thread
From: Jamie Lokier @ 2012-11-26 21:58 UTC (permalink / raw)
  To: linux-arm-kernel

Will Deacon wrote:
> That was a fun bit of debugging -- my hunch was right, but I was looking in the
> wrong place because I had an unrelated problem with my bootloader.
> 
> What happens is that every man and his dog is inlined into __schedule,
> including all the runqueue accessors, such as this_rq(), which make use of
> per-cpu offsets to get the correct pointer. The compiler then spits out
> something like this near the start of the function:
> 
[...]
>
> so the address of the current runqueue has been calculated and stored, with
> a bunch of other stuff, in a structure on the stack.
>
[...]
>
> 	barrier();
> 	/*
> 	 * this_rq must be evaluated again because prev may have moved
> 	 * CPUs since it called schedule(), thus the 'rq' on its stack
> 	 * frame will be invalid.
> 	 */
> 	finish_task_switch(this_rq(), prev);
> 
> The problem here is that, because our CPU accessors don't actually make any
> memory references, the barrier() has no effect and the old value is just
> reloaded off the stack:
>
[...]
> 
> which obviously causes complete chaos if the new task has been pulled from
> a different runqueue! (this appears as a double spin unlock on rq->lock).
> 
> Fixing this without giving up the performance improvement we gain by *avoiding*
> the memory access in the first place is going to be tricky...

Perhaps look at x86's approach, in arch/x86/include, <asm/current.h>,
<asm/percpu.h> and <asm/switch_to.h>.

current uses this_cpu_read_stable():

/*
 * this_cpu_read() makes gcc load the percpu variable every time it is
 * accessed while this_cpu_read_stable() allows the value to be cached.
 * this_cpu_read_stable() is more efficient and can be used if its value
 * is guaranteed to be valid across cpus.  The current users include
 * get_current() and get_thread_info() both of which are actually
 * per-thread variables implemented as per-cpu variables and thus
 * stable for the duration of the respective task.
 */

So how does x86 do it?

After lots of staring at the x86 headers, I think
this_read_cpu_stable() is a non-volatile asm which depends only on the
address &current_task, which is constant (GCC doesn't know the address
is dereferenced), and I can't see how reloading the value after
switch_to() (32 or 64 bit versions) is guaranteed, unless the "p" asm
is constraint implies something more than "the value is a pointer".

I assume I've missed something very subtle, but if not x86 has the
same bug/issue as discussed in this thread.

-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 21:58         ` Jamie Lokier
@ 2012-11-26 23:50           ` Jamie Lokier
  0 siblings, 0 replies; 35+ messages in thread
From: Jamie Lokier @ 2012-11-26 23:50 UTC (permalink / raw)
  To: linux-arm-kernel

Jamie Lokier wrote:
> So how does x86 do it?
> 
> After lots of staring at the x86 headers, I think
> this_read_cpu_stable() is a non-volatile asm which depends only on the
> address &current_task, which is constant (GCC doesn't know the address
> is dereferenced), and I can't see how reloading the value after
> switch_to() (32 or 64 bit versions) is guaranteed, unless the "p" asm
> is constraint implies something more than "the value is a pointer".
> 
> I assume I've missed something very subtle, but if not x86 has the
> same bug/issue as discussed in this thread.

Well I wasn't thinking too clearly there.

As long as the various scheduler functions don't _care_ that anything
which calls this_read_cpu_stable() is a function of the current task,
and hence current stack, it's fine to have them cached on the stack
and/or in registers across the context switch.

The general rule: When this_cpu... is used for something which is a
constant function of the current stack (task), not really a function
of the CPU (that being just an optimisation), this_cpu_read_stable()
is fine.  When it's for something CPU dependent, non-caching
this_cpu... is needed.  Not caching the per-CPU variable address isn't
special to context switching.  It's unsafe to cache the address in any
preemptible context, and across anything which may call schedule().

Groovy.

-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 15:15       ` Will Deacon
  2012-11-26 17:30         ` Rob Herring
  2012-11-26 21:58         ` Jamie Lokier
@ 2012-11-27  1:02         ` Jamie Lokier
  2012-11-27 22:02           ` Rob Herring
  2012-11-28 12:34           ` Will Deacon
  2012-11-27 17:35         ` Nicolas Pitre
  3 siblings, 2 replies; 35+ messages in thread
From: Jamie Lokier @ 2012-11-27  1:02 UTC (permalink / raw)
  To: linux-arm-kernel

Will Deacon wrote:
> On Mon, Nov 26, 2012 at 11:13:37AM +0000, Will Deacon wrote:
> > On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
> > > On 11/22/2012 05:34 AM, Will Deacon wrote:
> > > > As an aside, you also need to make the asm block volatile in
> > > > __my_cpu_offset -- I can see it being re-ordered before the set for
> > > > secondary CPUs otherwise.
> > > 
> > > I don't really see where there would be a re-ordering issue. There's no
> > > percpu var access before or near the setting that I can see.
> > 
> > The issue is on bringing up the secondary core, so I assumed that a lot
> > of inlining goes on inside secondary_start_kernel and then the result is
> > shuffled around, placing a cpu-offset read before we've done the set.
> > 
> > Unfortunately, looking at the disassembly I can't see this happening at
> > all, so I'll keep digging. The good news is that I've just reproduced the
> > problem on the model, so I've got more visibility now (although both cores
> > are just stuck in spinlocks...).
> 
> That was a fun bit of debugging -- my hunch was right,

Yes it was.

I'll sum up what I found looking at the x86 version.
Brief summary:

   1. Due to preemption, it's not safe to cache per-CPU values within
      a function in general.
   2. Except, if they are per-thread values (like current and
      current_thread_info) that don't depend on the CPU and just use
      per-CPU for efficient reading.
   3. So, implement separate this_cpu_read_stable() and
      this_cpu_read() if you want GCC to cache certain values that are
      safe to cache.
   4. Use asm volatile, not just an "m" constraint.  I think x86 has a
      bug by using just "m" for this_cpu_read().

Long version:

   - It's not really about the code in schedule().  It's about when
     context switch can happen.  Which is every instruction in a
     preemptible context.

   - this_cpu (and __my_cpu_offset) need to reload the per-CPU offset
     each time.  This is because per-CPU offset can change on every
     instruction in a preemptible context.

   - For task-dependent values like current and current_thread_info(),
     x86 uses a variant called this_cpu_read_stable().  That is
     allowed to be cached by GCC across barrier() and in general.

   - Probably this_cpu_read_stable() ought to be available more
     generically across archs

   - That means interaction with barrier(), and with switch_to(),
     aren't relevant.  There are two flavours: Safe to cache always,
     and safe to cache never.  (If you count !CONFIG_PREEMPT maybe
     it's different, but that's tweaking.)

   - x86 __my_cpu_offset does a memory read, using an "m" asm
     constraint on a per-CPU variable (this_cpu_off).  This makes it
     sensitive to barrier(), and otherwise cache the value.  It's a
     nice trick, if it's safe.

   - That seems like a nice trick, allowing some reads to be reused
     between instructions.  It can be replicated even on other archs,
     using an "m" constraint on a dummy extern variable (no need to
     actually read anything!).  But I'm not convinced this isn't a bug
     on x86.  Consider this code:

#include <linux/percpu.h>
#include <linux/preempt.h>
#include <linux/preempt.h>

DEFINE_PER_CPU(int, myvar1) = 0;
DEFINE_PER_CPU(int, myvar2) = 0;

static spinlock_t mylock = SPIN_LOCK_UNLOCKED;

int mytest(void)
{
        long flags;
        int x, y, z;

        x = this_cpu_read(myvar1);

        spin_lock_irqsave(&mylock, flags);

        /*
         * These two values should be consistent due to irqsave: No
         * preemption, no interrupts.  But on x86, GCC can reuse the x
         * above for the value of y.  If preemption happened before the
         * irqsave, y and z are not consistent.
         */
        y = this_cpu_read(myvar1);
        z = this_cpu_read(myvar2);
        spin_unlock_irqrestore(&mylock, flags);

        return y + z;
}

I think the optimisation behaviour of this_cpu_read() is unsafe on x86
for the reason in the comment above.  I have tested with GCC 4.5.2 on
x86-32 and it really does what the comment says.

Curiously all uses of __this_cpu_ptr() are fine: that asm uses
volatile.  I think this_cpu_read() also needs the volatile; but not
this_cpu_read_stable().

It's rather subtle and I wouldn't be surprised if I'm mistaken.

Best,
-- Jamie

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 17:30         ` Rob Herring
@ 2012-11-27 13:17           ` Will Deacon
  2012-11-27 13:26             ` Russell King - ARM Linux
  0 siblings, 1 reply; 35+ messages in thread
From: Will Deacon @ 2012-11-27 13:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Nov 26, 2012 at 05:30:55PM +0000, Rob Herring wrote:
> On 11/26/2012 09:15 AM, Will Deacon wrote:
> > The problem here is that, because our CPU accessors don't actually make any
> > memory references, the barrier() has no effect and the old value is just
> > reloaded off the stack:
> > 
> >   c02c1f22:       f54a fe49       bl      c000cbb8 <__switch_to>
> >   c02c1f26:       4601            mov     r1, r0
> >   c02c1f28:       68f8            ldr     r0, [r7, #12]
> >   c02c1f2a:       f56f ffd5       bl      c0031ed8 <finish_task_switch>
> > 
> > which obviously causes complete chaos if the new task has been pulled from
> > a different runqueue! (this appears as a double spin unlock on rq->lock).
> > 
> > Fixing this without giving up the performance improvement we gain by *avoiding*
> > the memory access in the first place is going to be tricky...
> 
> What compiler and config are you using? I get a reload of the register here:
> 
> c0350fba:       f001 fa9d       bl      c03524f8 <__switch_to>
> c0350fbe:       4601            mov     r1, r0
> c0350fc0:       ee1d 0f90       mrc     15, 0, r0, cr13, cr0, {4}
> c0350fc4:       4c2a            ldr     r4, [pc, #168]  ; (c0351070
> <__schedule+0x390>)
> c0350fc6:       f649 1590       movw    r5, #39312      ; 0x9990
> c0350fca:       1900            adds    r0, r0, r4
> c0350fcc:       f2cc 0556       movt    r5, #49238      ; 0xc056
> c0350fd0:       f4e7 fd56       bl      c0038a80 <finish_task_switch>

I tried both Linaro 12.07 and 12.10 GCC builds, although the problem would
only occur if I did a make clean and then a fresh build on top of that.
Just building the relavant object files didn't seem to tickle the problem.

I can mail you my .config if you like?

Will

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27 13:17           ` Will Deacon
@ 2012-11-27 13:26             ` Russell King - ARM Linux
  0 siblings, 0 replies; 35+ messages in thread
From: Russell King - ARM Linux @ 2012-11-27 13:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Nov 27, 2012 at 01:17:42PM +0000, Will Deacon wrote:
> I tried both Linaro 12.07 and 12.10 GCC builds, although the problem would
> only occur if I did a make clean and then a fresh build on top of that.
> Just building the relavant object files didn't seem to tickle the problem.

Given that the GCC optimiser does different things depending on the
direction the wind is blowing, that doesn't surprise me (anyone who's
looked at the output of modern gcc for even fairly simple functions
will know that how gcc optimises depends on a _lot_ of factors - even
what the preceding functions in the same compilation unit are.)

So I would not take too much into reading the output; if it's _possible_
for GCC to create an output which is not what we'd call correct, then
it's possible for it to create it, and because someone elses GCC doesn't
that's no reason to dismiss it (it could really be some subtle difference
causing the optimiser to behave slightly differently at that point.)

It _could_ be that the optimiser has decided on Jamie's test that it's
cheaper to recompute the percpu value, rather than yours where it's
decided to cache it on the stack.

So... if the point where the percpu stuff needs to be reloaded has a
compiler barrier, and that compiler barrier does not have an effect on
our percpu stuff being reloaded, then we still need to fix that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-11  3:20 [PATCH] ARM: implement optimized percpu variable access Rob Herring
  2012-11-12 10:23 ` Will Deacon
  2012-11-22 11:34 ` Will Deacon
@ 2012-11-27 17:19 ` Nicolas Pitre
  2012-11-27 19:37   ` Rob Herring
  2 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2012-11-27 17:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, 10 Nov 2012, Rob Herring wrote:

> From: Rob Herring <rob.herring@calxeda.com>
> 
> Use the previously unused TPIDRPRW register to store percpu offsets.
> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> 
> This saves 2 loads for each percpu variable access which should yield
> improved performance, but the improvement has not been quantified.
> 
> Signed-off-by: Rob Herring <rob.herring@calxeda.com>

I've just got around to wrap my brain around this patch and the 
discussion that ensued.

Isn't your patch lacking the preserving and restoring of the TPIDRPRW in 
the suspend and resume paths.


> ---
>  arch/arm/include/asm/Kbuild   |    1 -
>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
>  arch/arm/kernel/smp.c         |    3 +++
>  3 files changed, 47 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm/include/asm/percpu.h
> 
> diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
> index f70ae17..2ffdaac 100644
> --- a/arch/arm/include/asm/Kbuild
> +++ b/arch/arm/include/asm/Kbuild
> @@ -16,7 +16,6 @@ generic-y += local64.h
>  generic-y += msgbuf.h
>  generic-y += param.h
>  generic-y += parport.h
> -generic-y += percpu.h
>  generic-y += poll.h
>  generic-y += resource.h
>  generic-y += sections.h
> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
> new file mode 100644
> index 0000000..9eb7372
> --- /dev/null
> +++ b/arch/arm/include/asm/percpu.h
> @@ -0,0 +1,44 @@
> +/*
> + * Copyright 2012 Calxeda, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +#ifndef _ASM_ARM_PERCPU_H_
> +#define _ASM_ARM_PERCPU_H_
> +
> +/*
> + * Same as asm-generic/percpu.h, except that we store the per cpu offset
> + * in the TPIDRPRW.
> + */
> +#if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
> +
> +static inline void set_my_cpu_offset(unsigned long off)
> +{
> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
> +}
> +
> +static inline unsigned long __my_cpu_offset(void)
> +{
> +	unsigned long off;
> +	asm("mrc p15, 0, %0, c13, c0, 4	@ get TPIDRPRW" : "=r" (off) : );
> +	return off;
> +}
> +#define __my_cpu_offset __my_cpu_offset()
> +#else
> +#define set_my_cpu_offset(x)	do {} while(0)
> +
> +#endif /* CONFIG_SMP */
> +
> +#include <asm-generic/percpu.h>
> +
> +#endif /* _ASM_ARM_PERCPU_H_ */
> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index fbc8b26..897ef60 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -313,6 +313,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void)
>  	current->active_mm = mm;
>  	cpumask_set_cpu(cpu, mm_cpumask(mm));
>  
> +	set_my_cpu_offset(per_cpu_offset(cpu));
> +
>  	printk("CPU%u: Booted secondary processor\n", cpu);
>  
>  	cpu_init();
> @@ -371,6 +373,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
>  
>  void __init smp_prepare_boot_cpu(void)
>  {
> +	set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
>  }
>  
>  void __init smp_prepare_cpus(unsigned int max_cpus)
> -- 
> 1.7.10.4
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-12 13:03   ` Rob Herring
  2012-11-12 13:28     ` Will Deacon
@ 2012-11-27 17:29     ` Nicolas Pitre
  1 sibling, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2012-11-27 17:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 12 Nov 2012, Rob Herring wrote:
> On 11/12/2012 04:23 AM, Will Deacon wrote:
> > On Sun, Nov 11, 2012 at 03:20:40AM +0000, Rob Herring wrote:
> >> +static inline void set_my_cpu_offset(unsigned long off)
> >> +{
> >> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
> >> +}
> > 
> > You don't need the "cc" here.
> 
> You would think so, but the compiler drops this instruction if you
> don't. set_cr does the same thing.

If set_cr does it, that's because of historical reasons.  A long long 
time ago, you could use predicates inside inline assembly statements.  
The %? was needed to conditionally execute the included instructions in 
case gcc had decided not to branch over your code.  So you had to write:

	asm volatile("mcr%? p15, 0, %0, c13, c0, 4" : : "r" (off));

Or, if you didn't want to bother with %?, you had to clobber the 
condition code to force gcc to emit a branch over the inline assembly 
block when it shouldn't have been executed.

This subtlety was the source of many bugs as most people didn't use %? 
properly, so CC is always assumed to be clobbered by inline assembly 
now.

Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-26 15:15       ` Will Deacon
                           ` (2 preceding siblings ...)
  2012-11-27  1:02         ` Jamie Lokier
@ 2012-11-27 17:35         ` Nicolas Pitre
  2012-11-27 19:27           ` Nicolas Pitre
  3 siblings, 1 reply; 35+ messages in thread
From: Nicolas Pitre @ 2012-11-27 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 26 Nov 2012, Will Deacon wrote:

> On Mon, Nov 26, 2012 at 11:13:37AM +0000, Will Deacon wrote:
> > On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
> > > On 11/22/2012 05:34 AM, Will Deacon wrote:
> > > > As an aside, you also need to make the asm block volatile in
> > > > __my_cpu_offset -- I can see it being re-ordered before the set for
> > > > secondary CPUs otherwise.
> > > 
> > > I don't really see where there would be a re-ordering issue. There's no
> > > percpu var access before or near the setting that I can see.
> > 
> > The issue is on bringing up the secondary core, so I assumed that a lot
> > of inlining goes on inside secondary_start_kernel and then the result is
> > shuffled around, placing a cpu-offset read before we've done the set.
> > 
> > Unfortunately, looking at the disassembly I can't see this happening at
> > all, so I'll keep digging. The good news is that I've just reproduced the
> > problem on the model, so I've got more visibility now (although both cores
> > are just stuck in spinlocks...).
> 
> That was a fun bit of debugging -- my hunch was right, but I was looking in the
> wrong place because I had an unrelated problem with my bootloader.
> 
> What happens is that every man and his dog is inlined into __schedule,
> including all the runqueue accessors, such as this_rq(), which make use of
> per-cpu offsets to get the correct pointer. The compiler then spits out
> something like this near the start of the function:
> 
>   c02c1d66:       af04            add     r7, sp, #16
>   [...]
>   c02c1d6c:       ee1d 3f90       mrc     15, 0, r3, cr13, cr0, {4}
>   c02c1d70:       199b            adds    r3, r3, r6
>   c02c1d72:       f8c7 e008       str.w   lr, [r7, #8]
>   c02c1d76:       617b            str     r3, [r7, #20]
>   c02c1d78:       613e            str     r6, [r7, #16]
>   c02c1d7a:       60fb            str     r3, [r7, #12]
> 
> so the address of the current runqueue has been calculated and stored, with
> a bunch of other stuff, in a structure on the stack.
> 
> We then do our context_switch dance (which is also inlined) and return as
> the next task (since we've done switch_{mm,to}) before doing:
> 
> 	barrier();
> 	/*
> 	 * this_rq must be evaluated again because prev may have moved
> 	 * CPUs since it called schedule(), thus the 'rq' on its stack
> 	 * frame will be invalid.
> 	 */
> 	finish_task_switch(this_rq(), prev);
> 
> The problem here is that, because our CPU accessors don't actually make any
> memory references, the barrier() has no effect and the old value is just
> reloaded off the stack:
> 
>   c02c1f22:       f54a fe49       bl      c000cbb8 <__switch_to>
>   c02c1f26:       4601            mov     r1, r0
>   c02c1f28:       68f8            ldr     r0, [r7, #12]
>   c02c1f2a:       f56f ffd5       bl      c0031ed8 <finish_task_switch>
> 
> which obviously causes complete chaos if the new task has been pulled from
> a different runqueue! (this appears as a double spin unlock on rq->lock).
> 
> Fixing this without giving up the performance improvement we gain by *avoiding*
> the memory access in the first place is going to be tricky...

What about adding a memory constraint in the offset accessor to create a 
dependency upon which the barrier will have an effect, but without 
actually making any memory access?


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27 17:35         ` Nicolas Pitre
@ 2012-11-27 19:27           ` Nicolas Pitre
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2012-11-27 19:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 27 Nov 2012, Nicolas Pitre wrote:

> On Mon, 26 Nov 2012, Will Deacon wrote:
> 
> > On Mon, Nov 26, 2012 at 11:13:37AM +0000, Will Deacon wrote:
> > > On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
> > > > On 11/22/2012 05:34 AM, Will Deacon wrote:
> > > > > As an aside, you also need to make the asm block volatile in
> > > > > __my_cpu_offset -- I can see it being re-ordered before the set for
> > > > > secondary CPUs otherwise.
> > > > 
> > > > I don't really see where there would be a re-ordering issue. There's no
> > > > percpu var access before or near the setting that I can see.
> > > 
> > > The issue is on bringing up the secondary core, so I assumed that a lot
> > > of inlining goes on inside secondary_start_kernel and then the result is
> > > shuffled around, placing a cpu-offset read before we've done the set.
> > > 
> > > Unfortunately, looking at the disassembly I can't see this happening at
> > > all, so I'll keep digging. The good news is that I've just reproduced the
> > > problem on the model, so I've got more visibility now (although both cores
> > > are just stuck in spinlocks...).
> > 
> > That was a fun bit of debugging -- my hunch was right, but I was looking in the
> > wrong place because I had an unrelated problem with my bootloader.
> > 
> > What happens is that every man and his dog is inlined into __schedule,
> > including all the runqueue accessors, such as this_rq(), which make use of
> > per-cpu offsets to get the correct pointer. The compiler then spits out
> > something like this near the start of the function:
> > 
> >   c02c1d66:       af04            add     r7, sp, #16
> >   [...]
> >   c02c1d6c:       ee1d 3f90       mrc     15, 0, r3, cr13, cr0, {4}
> >   c02c1d70:       199b            adds    r3, r3, r6
> >   c02c1d72:       f8c7 e008       str.w   lr, [r7, #8]
> >   c02c1d76:       617b            str     r3, [r7, #20]
> >   c02c1d78:       613e            str     r6, [r7, #16]
> >   c02c1d7a:       60fb            str     r3, [r7, #12]
> > 
> > so the address of the current runqueue has been calculated and stored, with
> > a bunch of other stuff, in a structure on the stack.
> > 
> > We then do our context_switch dance (which is also inlined) and return as
> > the next task (since we've done switch_{mm,to}) before doing:
> > 
> > 	barrier();
> > 	/*
> > 	 * this_rq must be evaluated again because prev may have moved
> > 	 * CPUs since it called schedule(), thus the 'rq' on its stack
> > 	 * frame will be invalid.
> > 	 */
> > 	finish_task_switch(this_rq(), prev);
> > 
> > The problem here is that, because our CPU accessors don't actually make any
> > memory references, the barrier() has no effect and the old value is just
> > reloaded off the stack:
> > 
> >   c02c1f22:       f54a fe49       bl      c000cbb8 <__switch_to>
> >   c02c1f26:       4601            mov     r1, r0
> >   c02c1f28:       68f8            ldr     r0, [r7, #12]
> >   c02c1f2a:       f56f ffd5       bl      c0031ed8 <finish_task_switch>
> > 
> > which obviously causes complete chaos if the new task has been pulled from
> > a different runqueue! (this appears as a double spin unlock on rq->lock).
> > 
> > Fixing this without giving up the performance improvement we gain by *avoiding*
> > the memory access in the first place is going to be tricky...
> 
> What about adding a memory constraint in the offset accessor to create a 
> dependency upon which the barrier will have an effect, but without 
> actually making any memory access?

OK, whatever Jamie said.  I see that he pushed the analysis much further 
already.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27 17:19 ` Nicolas Pitre
@ 2012-11-27 19:37   ` Rob Herring
  2012-11-27 20:42     ` Rob Herring
  0 siblings, 1 reply; 35+ messages in thread
From: Rob Herring @ 2012-11-27 19:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/27/2012 11:19 AM, Nicolas Pitre wrote:
> On Sat, 10 Nov 2012, Rob Herring wrote:
> 
>> From: Rob Herring <rob.herring@calxeda.com>
>>
>> Use the previously unused TPIDRPRW register to store percpu offsets.
>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
>>
>> This saves 2 loads for each percpu variable access which should yield
>> improved performance, but the improvement has not been quantified.
>>
>> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> 
> I've just got around to wrap my brain around this patch and the 
> discussion that ensued.
> 
> Isn't your patch lacking the preserving and restoring of the TPIDRPRW in 
> the suspend and resume paths.

Why yes, you are right. And I noticed the v6 save/restore is missing the
user thread ID for v6K as well. I haven't looked closer, but perhaps it
is never used.

Rob
> 
>> ---
>>  arch/arm/include/asm/Kbuild   |    1 -
>>  arch/arm/include/asm/percpu.h |   44 +++++++++++++++++++++++++++++++++++++++++
>>  arch/arm/kernel/smp.c         |    3 +++
>>  3 files changed, 47 insertions(+), 1 deletion(-)
>>  create mode 100644 arch/arm/include/asm/percpu.h
>>
>> diff --git a/arch/arm/include/asm/Kbuild b/arch/arm/include/asm/Kbuild
>> index f70ae17..2ffdaac 100644
>> --- a/arch/arm/include/asm/Kbuild
>> +++ b/arch/arm/include/asm/Kbuild
>> @@ -16,7 +16,6 @@ generic-y += local64.h
>>  generic-y += msgbuf.h
>>  generic-y += param.h
>>  generic-y += parport.h
>> -generic-y += percpu.h
>>  generic-y += poll.h
>>  generic-y += resource.h
>>  generic-y += sections.h
>> diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
>> new file mode 100644
>> index 0000000..9eb7372
>> --- /dev/null
>> +++ b/arch/arm/include/asm/percpu.h
>> @@ -0,0 +1,44 @@
>> +/*
>> + * Copyright 2012 Calxeda, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + *
>> + * You should have received a copy of the GNU General Public License along with
>> + * this program.  If not, see <http://www.gnu.org/licenses/>.
>> + */
>> +#ifndef _ASM_ARM_PERCPU_H_
>> +#define _ASM_ARM_PERCPU_H_
>> +
>> +/*
>> + * Same as asm-generic/percpu.h, except that we store the per cpu offset
>> + * in the TPIDRPRW.
>> + */
>> +#if defined(CONFIG_SMP) && (__LINUX_ARM_ARCH__ >= 6)
>> +
>> +static inline void set_my_cpu_offset(unsigned long off)
>> +{
>> +	asm volatile("mcr p15, 0, %0, c13, c0, 4	@ set TPIDRPRW" : : "r" (off) : "cc" );
>> +}
>> +
>> +static inline unsigned long __my_cpu_offset(void)
>> +{
>> +	unsigned long off;
>> +	asm("mrc p15, 0, %0, c13, c0, 4	@ get TPIDRPRW" : "=r" (off) : );
>> +	return off;
>> +}
>> +#define __my_cpu_offset __my_cpu_offset()
>> +#else
>> +#define set_my_cpu_offset(x)	do {} while(0)
>> +
>> +#endif /* CONFIG_SMP */
>> +
>> +#include <asm-generic/percpu.h>
>> +
>> +#endif /* _ASM_ARM_PERCPU_H_ */
>> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
>> index fbc8b26..897ef60 100644
>> --- a/arch/arm/kernel/smp.c
>> +++ b/arch/arm/kernel/smp.c
>> @@ -313,6 +313,8 @@ asmlinkage void __cpuinit secondary_start_kernel(void)
>>  	current->active_mm = mm;
>>  	cpumask_set_cpu(cpu, mm_cpumask(mm));
>>  
>> +	set_my_cpu_offset(per_cpu_offset(cpu));
>> +
>>  	printk("CPU%u: Booted secondary processor\n", cpu);
>>  
>>  	cpu_init();
>> @@ -371,6 +373,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
>>  
>>  void __init smp_prepare_boot_cpu(void)
>>  {
>> +	set_my_cpu_offset(per_cpu_offset(smp_processor_id()));
>>  }
>>  
>>  void __init smp_prepare_cpus(unsigned int max_cpus)
>> -- 
>> 1.7.10.4
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27 19:37   ` Rob Herring
@ 2012-11-27 20:42     ` Rob Herring
  2012-11-27 22:02       ` Nicolas Pitre
  0 siblings, 1 reply; 35+ messages in thread
From: Rob Herring @ 2012-11-27 20:42 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/27/2012 01:37 PM, Rob Herring wrote:
> On 11/27/2012 11:19 AM, Nicolas Pitre wrote:
>> On Sat, 10 Nov 2012, Rob Herring wrote:
>>
>>> From: Rob Herring <rob.herring@calxeda.com>
>>>
>>> Use the previously unused TPIDRPRW register to store percpu offsets.
>>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
>>>
>>> This saves 2 loads for each percpu variable access which should yield
>>> improved performance, but the improvement has not been quantified.
>>>
>>> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
>>
>> I've just got around to wrap my brain around this patch and the 
>> discussion that ensued.
>>
>> Isn't your patch lacking the preserving and restoring of the TPIDRPRW in 
>> the suspend and resume paths.
> 
> Why yes, you are right. And I noticed the v6 save/restore is missing the
> user thread ID for v6K as well. I haven't looked closer, but perhaps it
> is never used.

Moving the setup from secondary_start_kernel to cpu_init will address
this. cpu_init is called for both secondary boot and resume of the boot
cpu. There is no need to save it, we can just reinitialize.

Rob

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27 20:42     ` Rob Herring
@ 2012-11-27 22:02       ` Nicolas Pitre
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolas Pitre @ 2012-11-27 22:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 27 Nov 2012, Rob Herring wrote:

> On 11/27/2012 01:37 PM, Rob Herring wrote:
> > On 11/27/2012 11:19 AM, Nicolas Pitre wrote:
> >> On Sat, 10 Nov 2012, Rob Herring wrote:
> >>
> >>> From: Rob Herring <rob.herring@calxeda.com>
> >>>
> >>> Use the previously unused TPIDRPRW register to store percpu offsets.
> >>> TPIDRPRW is only accessible in PL1, so it can only be used in the kernel.
> >>>
> >>> This saves 2 loads for each percpu variable access which should yield
> >>> improved performance, but the improvement has not been quantified.
> >>>
> >>> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> >>
> >> I've just got around to wrap my brain around this patch and the 
> >> discussion that ensued.
> >>
> >> Isn't your patch lacking the preserving and restoring of the TPIDRPRW in 
> >> the suspend and resume paths.
> > 
> > Why yes, you are right. And I noticed the v6 save/restore is missing the
> > user thread ID for v6K as well. I haven't looked closer, but perhaps it
> > is never used.
> 
> Moving the setup from secondary_start_kernel to cpu_init will address
> this. cpu_init is called for both secondary boot and resume of the boot
> cpu. There is no need to save it, we can just reinitialize.

Hmmm.  Agree on not having to save it.  However cpu_init is called a bit 
late in secondary_start_kernel and probably that ought to be moved 
earlier.  It seems to me that a lot of stuff called prior cpu_init has 
the potential to use per CPU variables somewhere.


Nicolas

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27  1:02         ` Jamie Lokier
@ 2012-11-27 22:02           ` Rob Herring
  2012-11-28 12:34           ` Will Deacon
  1 sibling, 0 replies; 35+ messages in thread
From: Rob Herring @ 2012-11-27 22:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/26/2012 07:02 PM, Jamie Lokier wrote:
> Will Deacon wrote:
>> On Mon, Nov 26, 2012 at 11:13:37AM +0000, Will Deacon wrote:
>>> On Sun, Nov 25, 2012 at 06:46:55PM +0000, Rob Herring wrote:
>>>> On 11/22/2012 05:34 AM, Will Deacon wrote:
>>>>> As an aside, you also need to make the asm block volatile in
>>>>> __my_cpu_offset -- I can see it being re-ordered before the set for
>>>>> secondary CPUs otherwise.
>>>>
>>>> I don't really see where there would be a re-ordering issue. There's no
>>>> percpu var access before or near the setting that I can see.
>>>
>>> The issue is on bringing up the secondary core, so I assumed that a lot
>>> of inlining goes on inside secondary_start_kernel and then the result is
>>> shuffled around, placing a cpu-offset read before we've done the set.
>>>
>>> Unfortunately, looking at the disassembly I can't see this happening at
>>> all, so I'll keep digging. The good news is that I've just reproduced the
>>> problem on the model, so I've got more visibility now (although both cores
>>> are just stuck in spinlocks...).
>>
>> That was a fun bit of debugging -- my hunch was right,
> 
> Yes it was.
> 
> I'll sum up what I found looking at the x86 version.

Thanks for your analysis.

> Brief summary:
> 
>    1. Due to preemption, it's not safe to cache per-CPU values within
>       a function in general.
>    2. Except, if they are per-thread values (like current and
>       current_thread_info) that don't depend on the CPU and just use
>       per-CPU for efficient reading.
>    3. So, implement separate this_cpu_read_stable() and
>       this_cpu_read() if you want GCC to cache certain values that are
>       safe to cache.
>    4. Use asm volatile, not just an "m" constraint.  I think x86 has a
>       bug by using just "m" for this_cpu_read().
> 
> Long version:
> 
>    - It's not really about the code in schedule().  It's about when
>      context switch can happen.  Which is every instruction in a
>      preemptible context.
> 
>    - this_cpu (and __my_cpu_offset) need to reload the per-CPU offset
>      each time.  This is because per-CPU offset can change on every
>      instruction in a preemptible context.

An access itself is not atomic which means accesses to percpu variables
can only be with preemption disabled.

If I'm reading the x86 versions of this_cpu_read correctly, they are a
single instruction and therefore don't need to disable preemption. I
don't think we could do that on ARM. This also means they don't have a
barriers around the access (from preempt_enable and preempt_disable)
like the generic versions do.

> 
>    - For task-dependent values like current and current_thread_info(),
>      x86 uses a variant called this_cpu_read_stable().  That is
>      allowed to be cached by GCC across barrier() and in general.
> 
>    - Probably this_cpu_read_stable() ought to be available more
>      generically across archs

This is probably irrelevant for this discussion. ARM uses the stack
pointer to get current like x86-32.

>    - That means interaction with barrier(), and with switch_to(),
>      aren't relevant.  There are two flavours: Safe to cache always,
>      and safe to cache never.  (If you count !CONFIG_PREEMPT maybe
>      it's different, but that's tweaking.)
> 
>    - x86 __my_cpu_offset does a memory read, using an "m" asm
>      constraint on a per-CPU variable (this_cpu_off).  This makes it
>      sensitive to barrier(), and otherwise cache the value.  It's a
>      nice trick, if it's safe.
> 
>    - That seems like a nice trick, allowing some reads to be reused
>      between instructions.  It can be replicated even on other archs,
>      using an "m" constraint on a dummy extern variable (no need to
>      actually read anything!).  But I'm not convinced this isn't a bug
>      on x86.  Consider this code:
> 
> #include <linux/percpu.h>
> #include <linux/preempt.h>
> #include <linux/preempt.h>
> 
> DEFINE_PER_CPU(int, myvar1) = 0;
> DEFINE_PER_CPU(int, myvar2) = 0;
> 
> static spinlock_t mylock = SPIN_LOCK_UNLOCKED;
> 
> int mytest(void)
> {
>         long flags;
>         int x, y, z;
> 
>         x = this_cpu_read(myvar1);
> 
>         spin_lock_irqsave(&mylock, flags);
> 
>         /*
>          * These two values should be consistent due to irqsave: No
>          * preemption, no interrupts.  But on x86, GCC can reuse the x
>          * above for the value of y.  If preemption happened before the
>          * irqsave, y and z are not consistent.
>          */
>         y = this_cpu_read(myvar1);
>         z = this_cpu_read(myvar2);
>         spin_unlock_irqrestore(&mylock, flags);
> 
>         return y + z;
> }
> 
> I think the optimisation behaviour of this_cpu_read() is unsafe on x86
> for the reason in the comment above.  I have tested with GCC 4.5.2 on
> x86-32 and it really does what the comment says.

I find that the first read into x is just discarded. If I include x in y
+ z + x, then the load happens.

I've tried a variety of builds with PREEMPT and !PREEMPT and gcc seems
to honor the compiler barriers even without an "m" constraint. With
PREMMPT, I get a load (mrc) for each access. With !PREEMPT, I get a
single load for all accesses.

Rob

> 
> Curiously all uses of __this_cpu_ptr() are fine: that asm uses
> volatile.  I think this_cpu_read() also needs the volatile; but not
> this_cpu_read_stable().
> 
> It's rather subtle and I wouldn't be surprised if I'm mistaken.
> 
> Best,
> -- Jamie
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] ARM: implement optimized percpu variable access
  2012-11-27  1:02         ` Jamie Lokier
  2012-11-27 22:02           ` Rob Herring
@ 2012-11-28 12:34           ` Will Deacon
  1 sibling, 0 replies; 35+ messages in thread
From: Will Deacon @ 2012-11-28 12:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jamie,

On Tue, Nov 27, 2012 at 01:02:03AM +0000, Jamie Lokier wrote:
> Will Deacon wrote:
> > That was a fun bit of debugging -- my hunch was right,
> 
> Yes it was.
> 
> I'll sum up what I found looking at the x86 version.
> Brief summary:
> 
>    1. Due to preemption, it's not safe to cache per-CPU values within
>       a function in general.
>    2. Except, if they are per-thread values (like current and
>       current_thread_info) that don't depend on the CPU and just use
>       per-CPU for efficient reading.
>    3. So, implement separate this_cpu_read_stable() and
>       this_cpu_read() if you want GCC to cache certain values that are
>       safe to cache.
>    4. Use asm volatile, not just an "m" constraint.  I think x86 has a
>       bug by using just "m" for this_cpu_read().
> 
> Long version:
> 
>    - It's not really about the code in schedule().  It's about when
>      context switch can happen.  Which is every instruction in a
>      preemptible context.

I disagree. You don't get magically pre-empted: control must pass through
context_switch to put you on a runqueue and choose a different task. With
PREEMPT, this can happen in response to an interrupt but the state of the
interrupted context is still correctly saved/restored via switch_to and
friends.

If a function accesses per-cpu data when preemptible(), then it must be
prepared to handle the pointer being incorrect (and this does seem to be
used: see slab_alloc_node, called during kmalloc, for example). So actually,
the only case of note *is* __schedule. Why? Because preemption is disabled,
but half of the function may appear to execute on one CPU and the other half
(i.e. once the task has been rescheduled from a different runqueue) may
execute on a different CPU.

The solution is to make the per-cpu offset reader hazard with
context-switch/barrier(). I tried Nico's suggestion of adding a memory
clobber and it seems to work pretty well, without any noticeable degradation
in quality of the generated code that I could spot:

diff --git a/arch/arm/include/asm/percpu.h b/arch/arm/include/asm/percpu.h
index 9c8d051..2e58a1d 100644
--- a/arch/arm/include/asm/percpu.h
+++ b/arch/arm/include/asm/percpu.h
@@ -24,13 +24,15 @@

 static inline void set_my_cpu_offset(unsigned long off)
 {
-       asm volatile("mcr p15, 0, %0, c13, c0, 4        @ set TPIDRPRW" : : "r" (off) );
+       /* Set TPIDRPRW */
+       asm volatile("mcr p15, 0, %0, c13, c0, 4" : : "r" (off) : "memory");
 }

 static inline unsigned long __my_cpu_offset(void)
 {
        unsigned long off;
-       asm("mrc p15, 0, %0, c13, c0, 4 @ get TPIDRPRW" : "=r" (off) : );
+       /* Read TPIDRPRW */
+       asm("mrc p15, 0, %0, c13, c0, 4" : "=r" (off) : : "memory");
        return off;
 }
 #define __my_cpu_offset __my_cpu_offset()

>    - That seems like a nice trick, allowing some reads to be reused
>      between instructions.  It can be replicated even on other archs,
>      using an "m" constraint on a dummy extern variable (no need to
>      actually read anything!).  But I'm not convinced this isn't a bug
>      on x86.  Consider this code:

The "m" constraint probably isn't what we want. Firstly, it will cause GCC
to emit instructions to calculate the address of whatever the dummy extern
variable is (probably a PC-relative load to get its address) and secondly,
GCC may generate a post-increment/decrement addressing mode with the
assumption that it will be evaluated exactly once in the asm block. You
could use "o", but you still have to load the address.

> #include <linux/percpu.h>
> #include <linux/preempt.h>
> #include <linux/preempt.h>
> 
> DEFINE_PER_CPU(int, myvar1) = 0;
> DEFINE_PER_CPU(int, myvar2) = 0;
> 
> static spinlock_t mylock = SPIN_LOCK_UNLOCKED;

Guessing this should be raw?

> int mytest(void)
> {
>         long flags;
>         int x, y, z;
> 
>         x = this_cpu_read(myvar1);

If it's important that the CPU doesn't change, this read into x should happen
inside the critical section.

>         spin_lock_irqsave(&mylock, flags);
> 
>         /*
>          * These two values should be consistent due to irqsave: No
>          * preemption, no interrupts.  But on x86, GCC can reuse the x
>          * above for the value of y.  If preemption happened before the
>          * irqsave, y and z are not consistent.
>          */
>         y = this_cpu_read(myvar1);
>         z = this_cpu_read(myvar2);
>         spin_unlock_irqrestore(&mylock, flags);
> 
>         return y + z;
> }

Will

^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2012-11-28 12:34 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-11  3:20 [PATCH] ARM: implement optimized percpu variable access Rob Herring
2012-11-12 10:23 ` Will Deacon
2012-11-12 13:03   ` Rob Herring
2012-11-12 13:28     ` Will Deacon
2012-11-12 14:03       ` Rob Herring
2012-11-27 17:29     ` Nicolas Pitre
2012-11-12 14:21   ` Rob Herring
2012-11-12 14:41     ` Will Deacon
2012-11-12 16:51       ` Will Deacon
2012-11-12 21:01         ` Rob Herring
2012-11-13 10:40           ` Will Deacon
2012-11-22 11:34 ` Will Deacon
2012-11-22 11:39   ` Russell King - ARM Linux
2012-11-23 17:06   ` Rob Herring
2012-11-23 17:12     ` Russell King - ARM Linux
2012-11-23 17:16     ` Will Deacon
2012-11-23 20:34       ` Tony Lindgren
2012-11-23 20:32   ` Tony Lindgren
2012-11-25 18:46   ` Rob Herring
2012-11-26 11:13     ` Will Deacon
2012-11-26 15:15       ` Will Deacon
2012-11-26 17:30         ` Rob Herring
2012-11-27 13:17           ` Will Deacon
2012-11-27 13:26             ` Russell King - ARM Linux
2012-11-26 21:58         ` Jamie Lokier
2012-11-26 23:50           ` Jamie Lokier
2012-11-27  1:02         ` Jamie Lokier
2012-11-27 22:02           ` Rob Herring
2012-11-28 12:34           ` Will Deacon
2012-11-27 17:35         ` Nicolas Pitre
2012-11-27 19:27           ` Nicolas Pitre
2012-11-27 17:19 ` Nicolas Pitre
2012-11-27 19:37   ` Rob Herring
2012-11-27 20:42     ` Rob Herring
2012-11-27 22:02       ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).