[PATCH v2 00/16] low-level CPU and cluster power management

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/16] low-level CPU and cluster power management
@ 2013-01-24  6:27 Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions Nicolas Pitre
                   ` (15 more replies)
  0 siblings, 16 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

This is version 2 of the patch series required to safely power up
and down CPUs in a cluster as can be found in b.L systems.

Please refer to http://article.gmane.org/gmane.linux.ports.arm.kernel/208625
for the initial series and particularly the cover page blurb for this work.

Thanks to those who provided review comments.

Changes from v1:

- Pulled in Rob Herring's auxcr accessor patch and converted this series
  to it.
- VMajor rework of various barriers (some DSBs demoted to DMBs, etc.)
- The sync_mem() macro is now split and enhanced to properly process the
  cache for writers and readers in the cluster critical region helpers.
- BL_NR_CLUSTERS and BL_CPUS_PER_CLUSTER renamed to BL_MAX_CLUSTERS
  and BL_MAX_CPUS_PER_CLUSTER.
- Removed unused C definitions and prototypes for vlocks.
- Simplified the vlock memory allocation.
- The vlock code is GPL v2.
- Replaced MPIDR inline asm by read_cpuid_mpidr().
- Use of MPIDR_AFFINITY_LEVEL() to replace explicit shifts and masks.
- Dropped gic_cpu_if_down().
- Added a DSB before SEV and WFI.
- Fixed power_up_setup helper prototype.
- Nuked smp_wmb() in bL_set_entry_vector().
- Moved the CCI driver to drivers/bus/.
- Dependency on CONFIG_EXPERIMENTAL removed.
- Leftover garbage in Makefile removed.
- Added/clarified various comments in the assembly code.
- Some documentation typos fixed.
- Copyright notices updated to 2013

Still not addressed yet in this series:

- The bL_ rename (will be trivial once I settle on an alternative).
- The CCI and DCSCB device tree binding descriptions.

The new diffstat is:

 .../big.LITTLE/cluster-pm-race-avoidance.txt    | 498 ++++++++++++++++++
 Documentation/arm/big.LITTLE/vlocks.txt         | 211 ++++++++
 arch/arm/Kconfig                                |   6 +
 arch/arm/common/Makefile                        |   1 +
 arch/arm/common/bL_entry.c                      | 314 +++++++++++
 arch/arm/common/bL_head.S                       | 214 ++++++++
 arch/arm/common/bL_platsmp.c                    |  84 +++
 arch/arm/common/vlock.S                         | 108 ++++
 arch/arm/common/vlock.h                         |  28 +
 arch/arm/include/asm/bL_entry.h                 | 190 +++++++
 arch/arm/include/asm/cp15.h                     |  14 +
 arch/arm/include/asm/mach/arch.h                |   3 +
 arch/arm/kernel/setup.c                         |   5 +-
 arch/arm/mach-vexpress/Kconfig                  |   9 +
 arch/arm/mach-vexpress/Makefile                 |   1 +
 arch/arm/mach-vexpress/core.h                   |   2 +
 arch/arm/mach-vexpress/dcscb.c                  | 249 +++++++++
 arch/arm/mach-vexpress/dcscb_setup.S            |  80 +++
 arch/arm/mach-vexpress/platsmp.c                |  12 +
 arch/arm/mach-vexpress/v2m.c                    |   2 +-
 drivers/bus/Kconfig                             |   5 +
 drivers/bus/Makefile                            |   2 +
 drivers/bus/arm-cci.c                           | 124 +++++
 drivers/cpuidle/cpuidle-calxeda.c               |  14 -
 include/linux/arm-cci.h                         |  30 ++
 25 files changed, 2190 insertions(+), 16 deletions(-)


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-28 14:39   ` Will Deacon
  2013-01-24  6:27 ` [PATCH v2 02/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Rob Herring <rob.herring@calxeda.com>

Move the private set_auxcr/get_auxcr functions from
drivers/cpuidle/cpuidle-calxeda.c so they can be used across platforms.

Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Russell King <linux@arm.linux.org.uk>
Acked-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/include/asm/cp15.h       | 14 ++++++++++++++
 drivers/cpuidle/cpuidle-calxeda.c | 14 --------------
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm/include/asm/cp15.h b/arch/arm/include/asm/cp15.h
index 5ef4d8015a..ef0094abf2 100644
--- a/arch/arm/include/asm/cp15.h
+++ b/arch/arm/include/asm/cp15.h
@@ -59,6 +59,20 @@ static inline void set_cr(unsigned int val)
 	isb();
 }
 
+static inline unsigned int get_auxcr(void)
+{
+	unsigned int val;
+	asm("mrc p15, 0, %0, c1, c0, 1	@ get AUXCR" : "=r" (val) : : "cc");
+	return val;
+}
+
+static inline void set_auxcr(unsigned int val)
+{
+	asm volatile("mcr p15, 0, %0, c1, c0, 1	@ set AUXCR"
+	  : : "r" (val) : "cc");
+	isb();
+}
+
 #ifndef CONFIG_SMP
 extern void adjust_cr(unsigned long mask, unsigned long set);
 #endif
diff --git a/drivers/cpuidle/cpuidle-calxeda.c b/drivers/cpuidle/cpuidle-calxeda.c
index e1aab38c5a..ece83d6e04 100644
--- a/drivers/cpuidle/cpuidle-calxeda.c
+++ b/drivers/cpuidle/cpuidle-calxeda.c
@@ -37,20 +37,6 @@ extern void *scu_base_addr;
 
 static struct cpuidle_device __percpu *calxeda_idle_cpuidle_devices;
 
-static inline unsigned int get_auxcr(void)
-{
-	unsigned int val;
-	asm("mrc p15, 0, %0, c1, c0, 1	@ get AUXCR" : "=r" (val) : : "cc");
-	return val;
-}
-
-static inline void set_auxcr(unsigned int val)
-{
-	asm volatile("mcr p15, 0, %0, c1, c0, 1	@ set AUXCR"
-	  : : "r" (val) : "cc");
-	isb();
-}
-
 static noinline void calxeda_idle_restore(void)
 {
 	set_cr(get_cr() | CR_C);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions
  2013-01-24  6:27 ` [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions Nicolas Pitre
@ 2013-01-28 14:39   ` Will Deacon
  2013-01-28 15:23     ` Nicolas Pitre
  0 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2013-01-28 14:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 24, 2013 at 06:27:44AM +0000, Nicolas Pitre wrote:
> From: Rob Herring <rob.herring@calxeda.com>
> 
> Move the private set_auxcr/get_auxcr functions from
> drivers/cpuidle/cpuidle-calxeda.c so they can be used across platforms.
> 
> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> Cc: Russell King <linux@arm.linux.org.uk>
> Acked-by: Tony Lindgren <tony@atomide.com>
> Signed-off-by: Nicolas Pitre <nico@linaro.org>
> ---
>  arch/arm/include/asm/cp15.h       | 14 ++++++++++++++
>  drivers/cpuidle/cpuidle-calxeda.c | 14 --------------
>  2 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm/include/asm/cp15.h b/arch/arm/include/asm/cp15.h
> index 5ef4d8015a..ef0094abf2 100644
> --- a/arch/arm/include/asm/cp15.h
> +++ b/arch/arm/include/asm/cp15.h
> @@ -59,6 +59,20 @@ static inline void set_cr(unsigned int val)
>  	isb();
>  }
>  
> +static inline unsigned int get_auxcr(void)
> +{
> +	unsigned int val;
> +	asm("mrc p15, 0, %0, c1, c0, 1	@ get AUXCR" : "=r" (val) : : "cc");
> +	return val;
> +}
> +
> +static inline void set_auxcr(unsigned int val)
> +{
> +	asm volatile("mcr p15, 0, %0, c1, c0, 1	@ set AUXCR"
> +	  : : "r" (val) : "cc");
> +	isb();
> +}

Oh no! It's the return of the magic "cc" clobber! Could we have an extra
patch to remove those please (since this is just a move)?

Will

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions
  2013-01-28 14:39   ` Will Deacon
@ 2013-01-28 15:23     ` Nicolas Pitre
  0 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-28 15:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 28 Jan 2013, Will Deacon wrote:

> On Thu, Jan 24, 2013 at 06:27:44AM +0000, Nicolas Pitre wrote:
> > From: Rob Herring <rob.herring@calxeda.com>
> > 
> > Move the private set_auxcr/get_auxcr functions from
> > drivers/cpuidle/cpuidle-calxeda.c so they can be used across platforms.
> > 
> > Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> > Cc: Russell King <linux@arm.linux.org.uk>
> > Acked-by: Tony Lindgren <tony@atomide.com>
> > Signed-off-by: Nicolas Pitre <nico@linaro.org>
> > ---
> >  arch/arm/include/asm/cp15.h       | 14 ++++++++++++++
> >  drivers/cpuidle/cpuidle-calxeda.c | 14 --------------
> >  2 files changed, 14 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/cp15.h b/arch/arm/include/asm/cp15.h
> > index 5ef4d8015a..ef0094abf2 100644
> > --- a/arch/arm/include/asm/cp15.h
> > +++ b/arch/arm/include/asm/cp15.h
> > @@ -59,6 +59,20 @@ static inline void set_cr(unsigned int val)
> >  	isb();
> >  }
> >  
> > +static inline unsigned int get_auxcr(void)
> > +{
> > +	unsigned int val;
> > +	asm("mrc p15, 0, %0, c1, c0, 1	@ get AUXCR" : "=r" (val) : : "cc");
> > +	return val;
> > +}
> > +
> > +static inline void set_auxcr(unsigned int val)
> > +{
> > +	asm volatile("mcr p15, 0, %0, c1, c0, 1	@ set AUXCR"
> > +	  : : "r" (val) : "cc");
> > +	isb();
> > +}
> 
> Oh no! It's the return of the magic "cc" clobber! Could we have an extra
> patch to remove those please (since this is just a move)?

I've removed the CC clobber from my copy of this patch.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 02/16] ARM: b.L: secondary kernel entry code
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-28 14:46   ` Will Deacon
  2013-01-24  6:27 ` [PATCH v2 03/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

CPUs in a big.LITTLE systems have special needs when entering the kernel
due to a hotplug event, or when resuming from a deep sleep mode.

This is vectorized so multiple CPUs can enter the kernel in parallel
without serialization.

Only the basic structure is introduced here.  This will be extended
later.

TODO: MPIDR based indexing should eventually be made runtime adjusted.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/Kconfig                |  6 +++
 arch/arm/common/Makefile        |  1 +
 arch/arm/common/bL_entry.c      | 29 +++++++++++++++
 arch/arm/common/bL_head.S       | 81 +++++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/bL_entry.h | 35 ++++++++++++++++++
 5 files changed, 152 insertions(+)
 create mode 100644 arch/arm/common/bL_entry.c
 create mode 100644 arch/arm/common/bL_head.S
 create mode 100644 arch/arm/include/asm/bL_entry.h

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 67874b82a4..3dd5591c79 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1584,6 +1584,12 @@ config HAVE_ARM_TWD
 	help
 	  This options enables support for the ARM timer and watchdog unit
 
+config BIG_LITTLE
+	bool "big.LITTLE support"
+	depends on CPU_V7 && SMP
+	help
+	  This option enables support for the big.LITTLE architecture.
+
 choice
 	prompt "Memory split"
 	default VMSPLIT_3G
diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
index e8a4e58f1b..8025899a20 100644
--- a/arch/arm/common/Makefile
+++ b/arch/arm/common/Makefile
@@ -13,3 +13,4 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
 obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
 obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
 obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
+obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o
diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
new file mode 100644
index 0000000000..4e1044612d
--- /dev/null
+++ b/arch/arm/common/bL_entry.c
@@ -0,0 +1,29 @@
+/*
+ * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
+ *
+ * Created by:  Nicolas Pitre, March 2012
+ * Copyright:   (C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+
+#include <asm/bL_entry.h>
+#include <asm/barrier.h>
+#include <asm/proc-fns.h>
+#include <asm/cacheflush.h>
+
+extern volatile unsigned long bL_entry_vectors[BL_MAX_CLUSTERS][BL_MAX_CPUS_PER_CLUSTER];
+
+void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
+{
+	unsigned long val = ptr ? virt_to_phys(ptr) : 0;
+	bL_entry_vectors[cluster][cpu] = val;
+	__cpuc_flush_dcache_area((void *)&bL_entry_vectors[cluster][cpu], 4);
+	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
+			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
+}
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
new file mode 100644
index 0000000000..072a13da20
--- /dev/null
+++ b/arch/arm/common/bL_head.S
@@ -0,0 +1,81 @@
+/*
+ * arch/arm/common/bL_head.S -- big.LITTLE kernel re-entry point
+ *
+ * Created by:  Nicolas Pitre, March 2012
+ * Copyright:   (C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/linkage.h>
+#include <asm/bL_entry.h>
+
+	.macro	pr_dbg	cpu, string
+#if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
+	b	1901f
+1902:	.ascii	"CPU 0: \0CPU 1: \0CPU 2: \0CPU 3: \0"
+	.ascii	"CPU 4: \0CPU 5: \0CPU 6: \0CPU 7: \0"
+1903:	.asciz	"\string"
+	.align
+1901:	adr	r0, 1902b
+	add	r0, r0, \cpu, lsl #3
+	bl	printascii
+	adr	r0, 1903b
+	bl	printascii
+#endif
+	.endm
+
+	.arm
+	.align
+
+ENTRY(bL_entry_point)
+
+ THUMB(	adr	r12, BSYM(1f)	)
+ THUMB(	bx	r12		)
+ THUMB(	.thumb			)
+1:
+	mrc	p15, 0, r0, c0, c0, 5		@ MPIDR
+	ubfx	r9, r0, #0, #4			@ r9 = cpu
+	ubfx	r10, r0, #8, #4			@ r10 = cluster
+	mov	r3, #BL_MAX_CPUS_PER_CLUSTER
+	mla	r4, r3, r10, r9			@ r4 = canonical CPU index
+	cmp	r4, #(BL_MAX_CPUS_PER_CLUSTER * BL_MAX_CLUSTERS)
+	blo	2f
+
+	/* We didn't expect this CPU.  Try to cheaply make it quiet. */
+1:	wfi
+	wfe
+	b	1b
+
+2:	pr_dbg	r4, "kernel bL_entry_point\n"
+
+	/*
+	 * MMU is off so we need to get to bL_entry_vectors in a
+	 * position independent way.
+	 */
+	adr	r5, 3f
+	ldr	r6, [r5]
+	add	r6, r5, r6			@ r6 = bL_entry_vectors
+
+bL_entry_gated:
+	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
+	cmp	r5, #0
+	wfeeq
+	beq	bL_entry_gated
+	pr_dbg	r4, "released\n"
+	bx	r5
+
+	.align	2
+
+3:	.word	bL_entry_vectors - .
+
+ENDPROC(bL_entry_point)
+
+	.bss
+	.align	5
+
+	.type	bL_entry_vectors, #object
+ENTRY(bL_entry_vectors)
+	.space	4 * BL_MAX_CLUSTERS * BL_MAX_CPUS_PER_CLUSTER
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
new file mode 100644
index 0000000000..7525614243
--- /dev/null
+++ b/arch/arm/include/asm/bL_entry.h
@@ -0,0 +1,35 @@
+/*
+ * arch/arm/include/asm/bL_entry.h
+ *
+ * Created by:  Nicolas Pitre, April 2012
+ * Copyright:   (C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef BL_ENTRY_H
+#define BL_ENTRY_H
+
+#define BL_MAX_CPUS_PER_CLUSTER	4
+#define BL_MAX_CLUSTERS		2
+
+#ifndef __ASSEMBLY__
+
+/*
+ * Platform specific code should use this symbol to set up secondary
+ * entry location for processors to use when released from reset.
+ */
+extern void bL_entry_point(void);
+
+/*
+ * This is used to indicate where the given CPU from given cluster should
+ * branch once it is ready to re-enter the kernel using ptr, or NULL if it
+ * should be gated.  A gated CPU is held in a WFE loop until its vector
+ * becomes non NULL.
+ */
+void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
+
+#endif /* ! __ASSEMBLY__ */
+#endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 02/16] ARM: b.L: secondary kernel entry code
  2013-01-24  6:27 ` [PATCH v2 02/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
@ 2013-01-28 14:46   ` Will Deacon
  2013-01-28 15:07     ` Nicolas Pitre
  0 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2013-01-28 14:46 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Nicolas,

On Thu, Jan 24, 2013 at 06:27:45AM +0000, Nicolas Pitre wrote:
> CPUs in a big.LITTLE systems have special needs when entering the kernel
> due to a hotplug event, or when resuming from a deep sleep mode.
> 
> This is vectorized so multiple CPUs can enter the kernel in parallel
> without serialization.
> 
> Only the basic structure is introduced here.  This will be extended
> later.
> 
> TODO: MPIDR based indexing should eventually be made runtime adjusted.

What's your plan for this TODO? Do you aim to merge the code first and add
that later? If so, maybe add a TODO comment in the code as well?

> diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> new file mode 100644
> index 0000000000..4e1044612d
> --- /dev/null
> +++ b/arch/arm/common/bL_entry.c
> @@ -0,0 +1,29 @@
> +/*
> + * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
> + *
> + * Created by:  Nicolas Pitre, March 2012
> + * Copyright:   (C) 2012-2013  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +
> +#include <asm/bL_entry.h>

Similarly with the naming... was there any consensus to replace bL_ with
something else? I personally find the capitalisation pretty jarring and at
odds with the rest of the kernel, but "bl" is branch-and-link so that's also
not much better.

> diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> new file mode 100644
> index 0000000000..7525614243
> --- /dev/null
> +++ b/arch/arm/include/asm/bL_entry.h
> @@ -0,0 +1,35 @@
> +/*
> + * arch/arm/include/asm/bL_entry.h
> + *
> + * Created by:  Nicolas Pitre, April 2012
> + * Copyright:   (C) 2012-2013  Linaro Limited
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef BL_ENTRY_H
> +#define BL_ENTRY_H
> +
> +#define BL_MAX_CPUS_PER_CLUSTER	4
> +#define BL_MAX_CLUSTERS		2

Again, do you have any ideas/plans on how to remove these constant limits?

Cheers,

Will

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 02/16] ARM: b.L: secondary kernel entry code
  2013-01-28 14:46   ` Will Deacon
@ 2013-01-28 15:07     ` Nicolas Pitre
  0 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-28 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 28 Jan 2013, Will Deacon wrote:

> Hi Nicolas,
> 
> On Thu, Jan 24, 2013 at 06:27:45AM +0000, Nicolas Pitre wrote:
> > CPUs in a big.LITTLE systems have special needs when entering the kernel
> > due to a hotplug event, or when resuming from a deep sleep mode.
> > 
> > This is vectorized so multiple CPUs can enter the kernel in parallel
> > without serialization.
> > 
> > Only the basic structure is introduced here.  This will be extended
> > later.
> > 
> > TODO: MPIDR based indexing should eventually be made runtime adjusted.
> 
> What's your plan for this TODO? Do you aim to merge the code first and add
> that later? If so, maybe add a TODO comment in the code as well?

That should come later as this is probably going to be a non trivial 
task.  We might also decide that the current code is good enough for now 
and it is likely to be the case for a few years.

> > diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
> > new file mode 100644
> > index 0000000000..4e1044612d
> > --- /dev/null
> > +++ b/arch/arm/common/bL_entry.c
> > @@ -0,0 +1,29 @@
> > +/*
> > + * arch/arm/common/bL_entry.c -- big.LITTLE kernel re-entry point
> > + *
> > + * Created by:  Nicolas Pitre, March 2012
> > + * Copyright:   (C) 2012-2013  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/init.h>
> > +
> > +#include <asm/bL_entry.h>
> 
> Similarly with the naming... was there any consensus to replace bL_ with
> something else? I personally find the capitalisation pretty jarring and at
> odds with the rest of the kernel, but "bl" is branch-and-link so that's also
> not much better.

I really find the capitalisation incredibly nice and to the point.  
There is no better way to unambiguously refer to "big.LITTLE" with a 
2-letter prefix.

But as I said, the naming is something I want to change to make this 
code appear more generic as there is nothing really b.L specific here.  
That is a trivial change, once I'm settled on something.

> > diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
> > new file mode 100644
> > index 0000000000..7525614243
> > --- /dev/null
> > +++ b/arch/arm/include/asm/bL_entry.h
> > @@ -0,0 +1,35 @@
> > +/*
> > + * arch/arm/include/asm/bL_entry.h
> > + *
> > + * Created by:  Nicolas Pitre, April 2012
> > + * Copyright:   (C) 2012-2013  Linaro Limited
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#ifndef BL_ENTRY_H
> > +#define BL_ENTRY_H
> > +
> > +#define BL_MAX_CPUS_PER_CLUSTER	4
> > +#define BL_MAX_CLUSTERS		2
> 
> Again, do you have any ideas/plans on how to remove these constant limits?

Not for the initial merge.  Again, this should serve the systems to come 
in the next few years just fine.  That gives us plenty of time to 
enhance this code with proper dynamic allocation.


Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 03/16] ARM: b.L: introduce the CPU/cluster power API
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 02/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 04/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

This is the basic API used to handle the powering up/down of individual
CPUs in a big.LITTLE system.  The platform specific backend implementation
has the responsibility to also handle the cluster level power as well when
the first/last CPU in a cluster is brought up/down.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/bL_entry.c      | 88 +++++++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/bL_entry.h | 92 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 180 insertions(+)

diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 4e1044612d..54bf0e572f 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -11,11 +11,13 @@
 
 #include <linux/kernel.h>
 #include <linux/init.h>
+#include <linux/irqflags.h>
 
 #include <asm/bL_entry.h>
 #include <asm/barrier.h>
 #include <asm/proc-fns.h>
 #include <asm/cacheflush.h>
+#include <asm/idmap.h>
 
 extern volatile unsigned long bL_entry_vectors[BL_MAX_CLUSTERS][BL_MAX_CPUS_PER_CLUSTER];
 
@@ -27,3 +29,89 @@ void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr)
 	outer_clean_range(__pa(&bL_entry_vectors[cluster][cpu]),
 			  __pa(&bL_entry_vectors[cluster][cpu + 1]));
 }
+
+static const struct bL_platform_power_ops *platform_ops;
+
+int __init bL_platform_power_register(const struct bL_platform_power_ops *ops)
+{
+	if (platform_ops)
+		return -EBUSY;
+	platform_ops = ops;
+	return 0;
+}
+
+int bL_cpu_power_up(unsigned int cpu, unsigned int cluster)
+{
+	if (!platform_ops)
+		return -EUNATCH;
+	might_sleep();
+	return platform_ops->power_up(cpu, cluster);
+}
+
+typedef void (*phys_reset_t)(unsigned long);
+
+void bL_cpu_power_down(void)
+{
+	phys_reset_t phys_reset;
+
+	BUG_ON(!platform_ops);
+	BUG_ON(!irqs_disabled());
+
+	/*
+	 * Do this before calling into the power_down method,
+	 * as it might not always be safe to do afterwards.
+	 */
+	setup_mm_for_reboot();
+
+	platform_ops->power_down();
+
+	/*
+	 * It is possible for a power_up request to happen concurrently
+	 * with a power_down request for the same CPU. In this case the
+	 * power_down method might not be able to actually enter a
+	 * powered down state with the WFI instruction if the power_up
+	 * method has removed the required reset condition.  The
+	 * power_down method is then allowed to return. We must perform
+	 * a re-entry in the kernel as if the power_up method just had
+	 * deasserted reset on the CPU.
+	 *
+	 * To simplify race issues, the platform specific implementation
+	 * must accommodate for the possibility of unordered calls to
+	 * power_down and power_up with a usage count. Therefore, if a
+	 * call to power_up is issued for a CPU that is not down, then
+	 * the next call to power_down must not attempt a full shutdown
+	 * but only do the minimum (normally disabling L1 cache and CPU
+	 * coherency) and return just as if a concurrent power_up request
+	 * had happened as described above.
+	 */
+
+	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
+	phys_reset(virt_to_phys(bL_entry_point));
+
+	/* should never get here */
+	BUG();
+}
+
+void bL_cpu_suspend(u64 expected_residency)
+{
+	phys_reset_t phys_reset;
+
+	BUG_ON(!platform_ops);
+	BUG_ON(!irqs_disabled());
+
+	/* Very similar to bL_cpu_power_down() */
+	setup_mm_for_reboot();
+	platform_ops->suspend(expected_residency);
+	phys_reset = (phys_reset_t)(unsigned long)virt_to_phys(cpu_reset);
+	phys_reset(virt_to_phys(bL_entry_point));
+	BUG();
+}
+
+int bL_cpu_powered_up(void)
+{
+	if (!platform_ops)
+		return -EUNATCH;
+	if (platform_ops->powered_up)
+		platform_ops->powered_up();
+	return 0;
+}
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
index 7525614243..adf8706c76 100644
--- a/arch/arm/include/asm/bL_entry.h
+++ b/arch/arm/include/asm/bL_entry.h
@@ -31,5 +31,97 @@ extern void bL_entry_point(void);
  */
 void bL_set_entry_vector(unsigned cpu, unsigned cluster, void *ptr);
 
+/*
+ * CPU/cluster power operations API for higher subsystems to use.
+ */
+
+/**
+ * bL_cpu_power_up - make given CPU in given cluster runable
+ *
+ * @cpu: CPU number within given cluster
+ * @cluster: cluster number for the CPU
+ *
+ * The identified CPU is brought out of reset.  If the cluster was powered
+ * down then it is brought up as well, taking care not to let the other CPUs
+ * in the cluster run, and ensuring appropriate cluster setup.
+ *
+ * Caller must ensure the appropriate entry vector is initialized with
+ * bL_set_entry_vector() prior to calling this.
+ *
+ * This must be called in a sleepable context.  However, the implementation
+ * is strongly encouraged to return early and let the operation happen
+ * asynchronously, especially when significant delays are expected.
+ *
+ * If the operation cannot be performed then an error code is returned.
+ */
+int bL_cpu_power_up(unsigned int cpu, unsigned int cluster);
+
+/**
+ * bL_cpu_power_down - power the calling CPU down
+ *
+ * The calling CPU is powered down.
+ *
+ * If this CPU is found to be the "last man standing" in the cluster
+ * then the cluster is prepared for power-down too.
+ *
+ * This must be called with interrupts disabled.
+ *
+ * This does not return.  Re-entry in the kernel is expected via
+ * bL_entry_point.
+ */
+void bL_cpu_power_down(void);
+
+/**
+ * bL_cpu_suspend - bring the calling CPU in a suspended state
+ *
+ * @expected_residency: duration in microseconds the CPU is expected
+ *			to remain suspended, or 0 if unknown/infinity.
+ *
+ * The calling CPU is suspended.  The expected residency argument is used
+ * as a hint by the platform specific backend to implement the appropriate
+ * sleep state level according to the knowledge it has on wake-up latency
+ * for the given hardware.
+ *
+ * If this CPU is found to be the "last man standing" in the cluster
+ * then the cluster may be prepared for power-down too, if the expected
+ * residency makes it worthwhile.
+ *
+ * This must be called with interrupts disabled.
+ *
+ * This does not return.  Re-entry in the kernel is expected via
+ * bL_entry_point.
+ */
+void bL_cpu_suspend(u64 expected_residency);
+
+/**
+ * bL_cpu_powered_up - housekeeping workafter a CPU has been powered up
+ *
+ * This lets the platform specific backend code perform needed housekeeping
+ * work.  This must be called by the newly activated CPU as soon as it is
+ * fully operational in kernel space, before it enables interrupts.
+ *
+ * If the operation cannot be performed then an error code is returned.
+ */
+int bL_cpu_powered_up(void);
+
+/*
+ * Platform specific methods used in the implementation of the above API.
+ */
+struct bL_platform_power_ops {
+	int (*power_up)(unsigned int cpu, unsigned int cluster);
+	void (*power_down)(void);
+	void (*suspend)(u64);
+	void (*powered_up)(void);
+};
+
+/**
+ * bL_platform_power_register - register platform specific power methods
+ *
+ * @ops: bL_platform_power_ops structure to register
+ *
+ * An error is returned if the registration has been done previously.
+ */
+int __init bL_platform_power_register(const struct bL_platform_power_ops *ops);
+
 #endif /* ! __ASSEMBLY__ */
 #endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 04/16] ARM: b.L: introduce helpers for platform coherency exit/setup
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (2 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 03/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 05/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

This provides helper methods to coordinate between CPUs coming down
and CPUs going up, as well as documentation on the used algorithms,
so that cluster teardown and setup
operations are not done for a cluster simultaneously.

For use in the power_down() implementation:
  * __bL_cpu_going_down(unsigned int cluster, unsigned int cpu)
  * __bL_outbound_enter_critical(unsigned int cluster)
  * __bL_outbound_leave_critical(unsigned int cluster)
  * __bL_cpu_down(unsigned int cluster, unsigned int cpu)

The power_up_setup() helper should do platform-specific setup in
preparation for turning the CPU on, such as invalidating local caches
or entering coherency.  It must be assembler for now, since it must
run before the MMU can be switched on.  It is passed the affinity level
which should be initialized.

Because the bL_cluster_sync_struct content is looked-up and modified
with the cache enabled or disabled depending on the code path, it is
crucial to always ensure proper cache maintenance to update main memory
right away.  Therefore, any cached write must be followed by a cache
clean operation and any cached read must be preceded by a cache
invalidate operation (actually a cache flush i.e. clean+invalidate to
avoid discarding possible concurrent writes) on the accessed memory.

Also, in order to prevent a cached writer from interfering with an
adjacent non-cached writer, we ensure each state variable is located to
a separate cache line.

Thanks to Nicolas Pitre and Achin Gupta for the help with this
patch.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
---
 .../arm/big.LITTLE/cluster-pm-race-avoidance.txt   | 498 +++++++++++++++++++++
 arch/arm/common/bL_entry.c                         | 197 ++++++++
 arch/arm/common/bL_head.S                          | 106 ++++-
 arch/arm/include/asm/bL_entry.h                    |  63 +++
 4 files changed, 862 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt

diff --git a/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
new file mode 100644
index 0000000000..ba6dadb0d4
--- /dev/null
+++ b/Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
@@ -0,0 +1,498 @@
+Big.LITTLE cluster Power-up/power-down race avoidance algorithm
+===============================================================
+
+This file documents the algorithm which is used to coordinate CPU and
+cluster setup and teardown operations and to manage hardware coherency
+controls safely.
+
+The section "Rationale" explains what the algorithm is for and why it is
+needed.  "Basic model" explains general concepts using a simplified view
+of the system.  The other sections explain the actual details of the
+algorithm in use.
+
+
+Rationale
+---------
+
+In a system containing multiple CPUs, it is desirable to have the
+ability to turn off individual CPUs when the system is idle, reducing
+power consumption and thermal dissipation.
+
+In a system containing multiple clusters of CPUs, it is also desirable
+to have the ability to turn off entire clusters.
+
+Turning entire clusters off and on is a risky business, because it
+involves performing potentially destructive operations affecting a group
+of independently running CPUs, while the OS continues to run.  This
+means that we need some coordination in order to ensure that critical
+cluster-level operations are only performed when it is truly safe to do
+so.
+
+Simple locking may not be sufficient to solve this problem, because
+mechanisms like Linux spinlocks may rely on coherency mechanisms which
+are not immediately enabled when a cluster powers up.  Since enabling or
+disabling those mechanisms may itself be a non-atomic operation (such as
+writing some hardware registers and invalidating large caches), other
+methods of coordination are required in order to guarantee safe
+power-down and power-up at the cluster level.
+
+The mechanism presented in this document describes a coherent memory
+based protocol for performing the needed coordination.  It aims to be as
+lightweight as possible, while providing the required safety properties.
+
+
+Basic model
+-----------
+
+Each cluster and CPU is assigned a state, as follows:
+
+	DOWN
+	COMING_UP
+	UP
+	GOING_DOWN
+
+	    +---------> UP ----------+
+	    |                        v
+
+	COMING_UP                GOING_DOWN
+
+	    ^                        |
+	    +--------- DOWN <--------+
+
+
+DOWN:	The CPU or cluster is not coherent, and is either powered off or
+	suspended, or is ready to be powered off or suspended.
+
+COMING_UP: The CPU or cluster has committed to moving to the UP state.
+	It may be part way through the process of initialisation and
+	enabling coherency.
+
+UP:	The CPU or cluster is active and coherent at the hardware
+	level.  A CPU in this state is not necessarily being used
+	actively by the kernel.
+
+GOING_DOWN: The CPU or cluster has committed to moving to the DOWN
+	state.  It may be part way through the process of teardown and
+	coherency exit.
+
+
+Each CPU has one of these states assigned to it at any point in time.
+The CPU states are described in the "CPU state" section, below.
+
+Each cluster is also assigned a state, but it is necessary to split the
+state value into two parts (the "cluster" state and "inbound" state) and
+to introduce additional states in order to avoid races between different
+CPUs in the cluster simultaneously modifying the state.  The cluster-
+level states are described in the "Cluster state" section.
+
+To help distinguish the CPU states from cluster states in this
+discussion, the state names are given a CPU_ prefix for the CPU states,
+and a CLUSTER_ or INBOUND_ prefix for the cluster states.
+
+
+CPU state
+---------
+
+In this algorithm, each individual core in a multi-core processor is
+referred to as a "CPU".  CPUs are assumed to be single-threaded:
+therefore, a CPU can only be doing one thing@a single point in time.
+
+This means that CPUs fit the basic model closely.
+
+The algorithm defines the following states for each CPU in the system:
+
+	CPU_DOWN
+	CPU_COMING_UP
+	CPU_UP
+	CPU_GOING_DOWN
+
+	 cluster setup and
+	CPU setup complete          policy decision
+	      +-----------> CPU_UP ------------+
+	      |                                v
+
+	CPU_COMING_UP                   CPU_GOING_DOWN
+
+	      ^                                |
+	      +----------- CPU_DOWN <----------+
+	 policy decision           CPU teardown complete
+	or hardware event
+
+
+The definitions of the four states correspond closely to the states of
+the basic model.
+
+Transitions between states occur as follows.
+
+A trigger event (spontaneous) means that the CPU can transition to the
+next state as a result of making local progress only, with no
+requirement for any external event to happen.
+
+
+CPU_DOWN:
+
+	A CPU reaches the CPU_DOWN state when it is ready for
+	power-down.  On reaching this state, the CPU will typically
+	power itself down or suspend itself, via a WFI instruction or a
+	firmware call.
+
+	Next state:	CPU_COMING_UP
+	Conditions:	none
+
+	Trigger events:
+
+		a) an explicit hardware power-up operation, resulting
+		   from a policy decision on another CPU;
+
+		b) a hardware event, such as an interrupt.
+
+
+CPU_COMING_UP:
+
+	A CPU cannot start participating in hardware coherency until the
+	cluster is set up and coherent.  If the cluster is not ready,
+	then the CPU will wait in the CPU_COMING_UP state until the
+	cluster has been set up.
+
+	Next state:	CPU_UP
+	Conditions:	The CPU's parent cluster must be in CLUSTER_UP.
+	Trigger events:	Transition of the parent cluster to CLUSTER_UP.
+
+	Refer to the "Cluster state" section for a description of the
+	CLUSTER_UP state.
+
+
+CPU_UP:
+	When a CPU reaches the CPU_UP state, it is safe for the CPU to
+	start participating in local coherency.
+
+	This is done by jumping to the kernel's CPU resume code.
+
+	Note that the definition of this state is slightly different
+	from the basic model definition: CPU_UP does not mean that the
+	CPU is coherent yet, but it does mean that it is safe to resume
+	the kernel.  The kernel handles the rest of the resume
+	procedure, so the remaining steps are not visible as part of the
+	race avoidance algorithm.
+
+	The CPU remains in this state until an explicit policy decision
+	is made to shut down or suspend the CPU.
+
+	Next state:	CPU_GOING_DOWN
+	Conditions:	none
+	Trigger events:	explicit policy decision
+
+
+CPU_GOING_DOWN:
+
+	While in this state, the CPU exits coherency, including any
+	operations required to achieve this (such as cleaning data
+	caches).
+
+	Next state:	CPU_DOWN
+	Conditions:	local CPU teardown complete
+	Trigger events:	(spontaneous)
+
+
+Cluster state
+-------------
+
+A cluster is a group of connected CPUs with some common resources.
+Because a cluster contains multiple CPUs, it can be doing multiple
+things@the same time.  This has some implications.  In particular, a
+CPU can start up while another CPU is tearing the cluster down.
+
+In this discussion, the "outbound side" is the view of the cluster state
+as seen by a CPU tearing the cluster down.  The "inbound side" is the
+view of the cluster state as seen by a CPU setting the CPU up.
+
+In order to enable safe coordination in such situations, it is important
+that a CPU which is setting up the cluster can advertise its state
+independently of the CPU which is tearing down the cluster.  For this
+reason, the cluster state is split into two parts:
+
+	"cluster" state: The global state of the cluster; or the state
+		on the outbound side:
+
+		CLUSTER_DOWN
+		CLUSTER_UP
+		CLUSTER_GOING_DOWN
+
+	"inbound" state: The state of the cluster on the inbound side.
+
+		INBOUND_NOT_COMING_UP
+		INBOUND_COMING_UP
+
+
+	The different pairings of these states results in six possible
+	states for the cluster as a whole:
+
+	                            CLUSTER_UP
+	          +==========> INBOUND_NOT_COMING_UP -------------+
+	          #                                               |
+	                                                          |
+	     CLUSTER_UP     <----+                                |
+	  INBOUND_COMING_UP      |                                v
+
+	          ^             CLUSTER_GOING_DOWN       CLUSTER_GOING_DOWN
+	          #              INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP
+
+	    CLUSTER_DOWN         |                                |
+	  INBOUND_COMING_UP <----+                                |
+	                                                          |
+	          ^                                               |
+	          +===========     CLUSTER_DOWN      <------------+
+	                       INBOUND_NOT_COMING_UP
+
+	Transitions -----> can only be made by the outbound CPU, and
+	only involve changes to the "cluster" state.
+
+	Transitions ===##> can only be made by the inbound CPU, and only
+	involve changes to the "inbound" state, except where there is no
+	further transition possible on the outbound side (i.e., the
+	outbound CPU has put the cluster into the CLUSTER_DOWN state).
+
+	The race avoidance algorithm does not provide a way to determine
+	which exact CPUs within the cluster play these roles.  This must
+	be decided in advance by some other means.  Refer to the section
+	"Last man and first man selection" for more explanation.
+
+
+	CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the
+	cluster can actually be powered down.
+
+	The parallelism of the inbound and outbound CPUs is observed by
+	the existence of two different paths from CLUSTER_GOING_DOWN/
+	INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic
+	model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to
+	COMING_UP in the basic model).  The second path avoids cluster
+	teardown completely.
+
+	CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic
+	model.  The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP
+	is trivial and merely resets the state machine ready for the
+	next cycle.
+
+	Details of the allowable transitions follow.
+
+	The next state in each case is notated
+
+		<cluster state>/<inbound state> (<transitioner>)
+
+	where the <transitioner> is the side on which the transition
+	can occur; either the inbound or the outbound side.
+
+
+CLUSTER_DOWN/INBOUND_NOT_COMING_UP:
+
+	Next state:	CLUSTER_DOWN/INBOUND_COMING_UP (inbound)
+	Conditions:	none
+	Trigger events:
+
+		a) an explicit hardware power-up operation, resulting
+		   from a policy decision on another CPU;
+
+		b) a hardware event, such as an interrupt.
+
+
+CLUSTER_DOWN/INBOUND_COMING_UP:
+
+	In this state, an inbound CPU sets up the cluster, including
+	enabling of hardware coherency at the cluster level and any
+	other operations (such as cache invalidation) which are required
+	in order to achieve this.
+
+	The purpose of this state is to do sufficient cluster-level
+	setup to enable other CPUs in the cluster to enter coherency
+	safely.
+
+	Next state:	CLUSTER_UP/INBOUND_COMING_UP (inbound)
+	Conditions:	cluster-level setup and hardware coherency complete
+	Trigger events:	(spontaneous)
+
+
+CLUSTER_UP/INBOUND_COMING_UP:
+
+	Cluster-level setup is complete and hardware coherency is
+	enabled for the cluster.  Other CPUs in the cluster can safely
+	enter coherency.
+
+	This is a transient state, leading immediately to
+	CLUSTER_UP/INBOUND_NOT_COMING_UP.  All other CPUs on the cluster
+	should consider treat these two states as equivalent.
+
+	Next state:	CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound)
+	Conditions:	none
+	Trigger events:	(spontaneous)
+
+
+CLUSTER_UP/INBOUND_NOT_COMING_UP:
+
+	Cluster-level setup is complete and hardware coherency is
+	enabled for the cluster.  Other CPUs in the cluster can safely
+	enter coherency.
+
+	The cluster will remain in this state until a policy decision is
+	made to power the cluster down.
+
+	Next state:	CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound)
+	Conditions:	none
+	Trigger events:	policy decision to power down the cluster
+
+
+CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP:
+
+	An outbound CPU is tearing the cluster down.  The selected CPU
+	must wait in this state until all CPUs in the cluster are in the
+	CPU_DOWN state.
+
+	When all CPUs are in the CPU_DOWN state, the cluster can be torn
+	down, for example by cleaning data caches and exiting
+	cluster-level coherency.
+
+	To avoid wasteful unnecessary teardown operations, the outbound
+	should check the inbound cluster state for asynchronous
+	transitions to INBOUND_COMING_UP.  Alternatively, individual
+	CPUs can be checked for entry into CPU_COMING_UP or CPU_UP.
+
+
+	Next states:
+
+	CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound)
+		Conditions:	cluster torn down and ready to power off
+		Trigger events:	(spontaneous)
+
+	CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound)
+		Conditions:	none
+		Trigger events:
+
+			a) an explicit hardware power-up operation,
+			   resulting from a policy decision on another
+			   CPU;
+
+			b) a hardware event, such as an interrupt.
+
+
+CLUSTER_GOING_DOWN/INBOUND_COMING_UP:
+
+	The cluster is (or was) being torn down, but another CPU has
+	come online in the meantime and is trying to set up the cluster
+	again.
+
+	If the outbound CPU observes this state, it has two choices:
+
+		a) back out of teardown, restoring the cluster to the
+		   CLUSTER_UP state;
+
+		b) finish tearing the cluster down and put the cluster
+		   in the CLUSTER_DOWN state; the inbound CPU will
+		   set up the cluster again from there.
+
+	Choice (a) permits the removal of some latency by avoiding
+	unnecessary teardown and setup operations in situations where
+	the cluster is not really going to be powered down.
+
+
+	Next states:
+
+	CLUSTER_UP/INBOUND_COMING_UP (outbound)
+		Conditions:	cluster-level setup and hardware
+				coherency complete
+		Trigger events:	(spontaneous)
+
+	CLUSTER_DOWN/INBOUND_COMING_UP (outbound)
+		Conditions:	cluster torn down and ready to power off
+		Trigger events:	(spontaneous)
+
+
+Last man and First man selection
+--------------------------------
+
+The CPU which performs cluster tear-down operations on the outbound side
+is commonly referred to as the "last man".
+
+The CPU which performs cluster setup on the inbound side is commonly
+referred to as the "first man".
+
+The race avoidance algorithm documented above does not provide a
+mechanism to choose which CPUs should play these roles.
+
+
+Last man:
+
+When shutting down the cluster, all the CPUs involved are initially
+executing Linux and hence coherent.  Therefore, ordinary spinlocks can
+be used to select a last man safely, before the CPUs become
+non-coherent.
+
+
+First man:
+
+Because CPUs may power up asynchronously in response to external wake-up
+events, a dynamic mechanism is needed to make sure that only one CPU
+attempts to play the first man role and do the cluster-level
+initialisation: any other CPUs must wait for this to complete before
+proceeding.
+
+Cluster-level initialisation may involve actions such as configuring
+coherency controls in the bus fabric.
+
+The current implementation in bL_head.S uses a separate mutual exclusion
+mechanism to do this arbitration.  This mechanism is documented in
+detail in vlocks.txt.
+
+
+Features and Limitations
+------------------------
+
+Implementation:
+
+	The current ARM-based implementation is split between
+	arch/arm/common/bL_head.S (low-level inbound CPU operations) and
+	arch/arm/common/bL_entry.c (everything else):
+
+	__bL_cpu_going_down() signals the transition of a CPU to the
+		CPU_GOING_DOWN state.
+
+	__bL_cpu_down() signals the transition of a CPU to the CPU_DOWN
+		state.
+
+	A CPU transitions to CPU_COMING_UP and then to CPU_UP via the
+		low-level power-up code in bL_head.S.  This could
+		involve CPU-specific setup code, but in the current
+		implementation it does not.
+
+	__bL_outbound_enter_critical() and __bL_outbound_leave_critical()
+		handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN
+		and from there to CLUSTER_DOWN or back to CLUSTER_UP (in
+		the case of an aborted cluster power-down).
+
+		These functions are more complex than the __bL_cpu_*()
+		functions due to the extra inter-CPU coordination which
+		is needed for safe transitions at the cluster level.
+
+	A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via
+		the low-level power-up code in bL_head.S.  This
+		typically involves platform-specific setup code,
+		provided by the platform-specific power_up_setup
+		function registered via bL_cluster_sync_init.
+
+Deep topologies:
+
+	As currently described and implemented, the algorithm does not
+	support CPU topologies involving more than two levels (i.e.,
+	clusters of clusters are not supported).  The algorithm could be
+	extended by replicating the cluster-level states for the
+	additional topological levels, and modifying the transition
+	rules for the intermediate (non-outermost) cluster levels.
+
+
+Colophon
+--------
+
+Originally created and documented by Dave Martin for Linaro Limited, in
+collaboration with Nicolas Pitre and Achin Gupta.
+
+Copyright (C) 2012-2013  Linaro Limited
+Distributed under the terms of Version 2 of the GNU General Public
+License, as defined in linux/COPYING.
diff --git a/arch/arm/common/bL_entry.c b/arch/arm/common/bL_entry.c
index 54bf0e572f..14d72b97ad 100644
--- a/arch/arm/common/bL_entry.c
+++ b/arch/arm/common/bL_entry.c
@@ -18,6 +18,7 @@
 #include <asm/proc-fns.h>
 #include <asm/cacheflush.h>
 #include <asm/idmap.h>
+#include <asm/cputype.h>
 
 extern volatile unsigned long bL_entry_vectors[BL_MAX_CLUSTERS][BL_MAX_CPUS_PER_CLUSTER];
 
@@ -115,3 +116,199 @@ int bL_cpu_powered_up(void)
 		platform_ops->powered_up();
 	return 0;
 }
+
+struct bL_sync_struct bL_sync;
+
+/*
+ * There is no __cpuc_clean_dcache_area but we use it anyway for
+ * code clarity, and alias it to __cpuc_flush_dcache_area.
+ */
+#define __cpuc_clean_dcache_area __cpuc_flush_dcache_area
+
+/*
+ * Ensure preceding writes to *p by this CPU are visible to
+ * subsequent reads by other CPUs:
+ */
+static void __sync_range_w(volatile void *p, size_t size)
+{
+	char *_p = (char *)p;
+
+	__cpuc_clean_dcache_area(_p, size);
+	outer_clean_range(__pa(_p), __pa(_p + size));
+}
+
+/*
+ * Ensure preceding writes to *p by other CPUs are visible to
+ * subsequent reads by this CPU.  We must be careful not to
+ * discard data simultaneously written by another CPU, hence the
+ * usage of flush rather than invalidate operations.
+ */
+static void __sync_range_r(volatile void *p, size_t size)
+{
+	char *_p = (char *)p;
+
+#ifdef CONFIG_OUTER_CACHE
+	if (outer_cache.flush_range) {
+		/*
+		 * Ensure dirty data migrated from other CPUs into our cache
+		 * are cleaned out safely before the outer cache is cleaned:
+		 */
+		__cpuc_clean_dcache_area(_p, size);
+
+		/* Clean and invalidate stale data for *p from outer ... */
+		outer_flush_range(__pa(_p), __pa(_p + size));
+	}
+#endif
+
+	/* ... and inner cache: */
+	__cpuc_flush_dcache_area(_p, size);
+}
+
+#define sync_w(ptr) __sync_range_w(ptr, sizeof *(ptr))
+#define sync_r(ptr) __sync_range_r(ptr, sizeof *(ptr))
+
+/*
+ * __bL_cpu_going_down: Indicates that the cpu is being torn down.
+ *    This must be called at the point of committing to teardown of a CPU.
+ *    The CPU cache (SCTRL.C bit) is expected to still be active.
+ */
+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster)
+{
+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_GOING_DOWN;
+	sync_w(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+}
+
+/*
+ * __bL_cpu_down: Indicates that cpu teardown is complete and that the
+ *    cluster can be torn down without disrupting this CPU.
+ *    To avoid deadlocks, this must be called before a CPU is powered down.
+ *    The CPU cache (SCTRL.C bit) is expected to be off.
+ */
+void __bL_cpu_down(unsigned int cpu, unsigned int cluster)
+{
+	dmb();
+	bL_sync.clusters[cluster].cpus[cpu].cpu = CPU_DOWN;
+	sync_w(&bL_sync.clusters[cluster].cpus[cpu].cpu);
+	dsb_sev();
+}
+
+/*
+ * __bL_outbound_leave_critical: Leave the cluster teardown critical section.
+ * @state: the final state of the cluster:
+ *     CLUSTER_UP: no destructive teardown was done and the cluster has been
+ *         restored to the previous state (CPU cache still active); or
+ *     CLUSTER_DOWN: the cluster has been torn-down, ready for power-off
+ *         (CPU cache disabled).
+ */
+void __bL_outbound_leave_critical(unsigned int cluster, int state)
+{
+	dmb();
+	bL_sync.clusters[cluster].cluster = state;
+	sync_w(&bL_sync.clusters[cluster].cluster);
+	dsb_sev();
+}
+
+/*
+ * __bL_outbound_enter_critical: Enter the cluster teardown critical section.
+ * This function should be called by the last man, after local CPU teardown
+ * is complete.  CPU cache expected to be active.
+ *
+ * Returns:
+ *     false: the critical section was not entered because an inbound CPU was
+ *         observed, or the cluster is already being set up;
+ *     true: the critical section was entered: it is now safe to tear down the
+ *         cluster.
+ */
+bool __bL_outbound_enter_critical(unsigned int cpu, unsigned int cluster)
+{
+	unsigned int i;
+	struct bL_cluster_sync_struct *c = &bL_sync.clusters[cluster];
+
+	/* Warn inbound CPUs that the cluster is being torn down: */
+	c->cluster = CLUSTER_GOING_DOWN;
+	sync_w(&c->cluster);
+
+	/* Back out if the inbound cluster is already in the critical region: */
+	sync_r(&c->inbound);
+	if (c->inbound == INBOUND_COMING_UP)
+		goto abort;
+
+	/*
+	 * Wait for all CPUs to get out of the GOING_DOWN state, so that local
+	 * teardown is complete on each CPU before tearing down the cluster.
+	 *
+	 * If any CPU has been woken up again from the DOWN state, then we
+	 * shouldn't be taking the cluster down at all: abort in that case.
+	 */
+	sync_r(&c->cpus);
+	for (i = 0; i < BL_MAX_CPUS_PER_CLUSTER; i++) {
+		int cpustate;
+
+		if (i == cpu)
+			continue;
+
+		while (1) {
+			cpustate = c->cpus[i].cpu;
+			if (cpustate != CPU_GOING_DOWN)
+				break;
+
+			wfe();
+			sync_r(&c->cpus[i].cpu);
+		}
+
+		switch (cpustate) {
+		case CPU_DOWN:
+			continue;
+
+		default:
+			goto abort;
+		}
+	}
+
+	return true;
+
+abort:
+	__bL_outbound_leave_critical(cluster, CLUSTER_UP);
+	return false;
+}
+
+int __bL_cluster_state(unsigned int cluster)
+{
+	sync_r(&bL_sync.clusters[cluster].cluster);
+	return bL_sync.clusters[cluster].cluster;
+}
+
+extern unsigned long bL_power_up_setup_phys;
+
+int __init bL_cluster_sync_init(
+	void (*power_up_setup)(unsigned int affinity_level))
+{
+	unsigned int i, j, mpidr, this_cluster;
+
+	BUILD_BUG_ON(BL_SYNC_CLUSTER_SIZE * BL_MAX_CLUSTERS != sizeof bL_sync);
+	BUG_ON((unsigned long)&bL_sync & (__CACHE_WRITEBACK_GRANULE - 1));
+
+	/*
+	 * Set initial CPU and cluster states.
+	 * Only one cluster is assumed to be active at this point.
+	 */
+	for (i = 0; i < BL_MAX_CLUSTERS; i++) {
+		bL_sync.clusters[i].cluster = CLUSTER_DOWN;
+		bL_sync.clusters[i].inbound = INBOUND_NOT_COMING_UP;
+		for (j = 0; j < BL_MAX_CPUS_PER_CLUSTER; j++)
+			bL_sync.clusters[i].cpus[j].cpu = CPU_DOWN;
+	}
+	mpidr = read_cpuid_mpidr();
+	this_cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
+	for_each_online_cpu(i)
+		bL_sync.clusters[this_cluster].cpus[i].cpu = CPU_UP;
+	bL_sync.clusters[this_cluster].cluster = CLUSTER_UP;
+	sync_w(&bL_sync);
+
+	if (power_up_setup) {
+		bL_power_up_setup_phys = virt_to_phys(power_up_setup);
+		sync_w(&bL_power_up_setup_phys);
+	}
+
+	return 0;
+}
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index 072a13da20..a226cdf4ce 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -7,11 +7,19 @@
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
+ *
+ *
+ * Refer to Documentation/arm/big.LITTLE/cluster-pm-race-avoidance.txt
+ * for details of the synchronisation algorithms used here.
  */
 
 #include <linux/linkage.h>
 #include <asm/bL_entry.h>
 
+.if BL_SYNC_CLUSTER_CPUS
+.error "cpus must be the first member of struct bL_cluster_sync_struct"
+.endif
+
 	.macro	pr_dbg	cpu, string
 #if defined(CONFIG_DEBUG_LL) && defined(DEBUG)
 	b	1901f
@@ -52,24 +60,114 @@ ENTRY(bL_entry_point)
 2:	pr_dbg	r4, "kernel bL_entry_point\n"
 
 	/*
-	 * MMU is off so we need to get to bL_entry_vectors in a
+	 * MMU is off so we need to get to various variables in a
 	 * position independent way.
 	 */
 	adr	r5, 3f
-	ldr	r6, [r5]
+	ldmia	r5, {r6, r7, r8}
 	add	r6, r5, r6			@ r6 = bL_entry_vectors
+	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
+	add	r8, r5, r8			@ r8 = bL_sync
+
+	mov	r0, #BL_SYNC_CLUSTER_SIZE
+	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
+
+	@ Signal that this CPU is coming UP:
+	mov	r0, #CPU_COMING_UP
+	mov	r5, #BL_SYNC_CPU_SIZE
+	mla	r5, r9, r5, r8			@ r5 = bL_sync cpu address
+	strb	r0, [r5]
+
+	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
+	@ state, because there is at least one active CPU (this CPU).
+
+	@ Note: the following is racy as another CPU might be testing
+	@ the same flag at the same moment.  That'll be fixed later.
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_UP			@ cluster already up?
+	bne	cluster_setup			@ if not, set up the cluster
+
+	@ Otherwise, skip setup:
+	b	cluster_setup_complete
+
+cluster_setup:
+	@ Control dependency implies strb not observable before previous ldrb.
+
+	@ Signal that the cluster is being brought up:
+	mov	r0, #INBOUND_COMING_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
+	dmb
+
+	@ Any CPU trying to take the cluster into CLUSTER_GOING_DOWN from this
+	@ point onwards will observe INBOUND_COMING_UP and abort.
+
+	@ Wait for any previously-pending cluster teardown operations to abort
+	@ or complete:
+cluster_teardown_wait:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_GOING_DOWN
+	bne	first_man_setup
+	wfe
+	b	cluster_teardown_wait
+
+first_man_setup:
+	dmb
+
+	@ If the outbound gave up before teardown started, skip cluster setup:
+
+	cmp	r0, #CLUSTER_UP
+	beq	cluster_setup_leave
+
+	@ power_up_setup is now responsible for setting up the cluster:
+
+	cmp	r7, #0
+	mov	r0, #1		@ second (cluster) affinity level
+	blxne	r7		@ Call power_up_setup if defined
+	dmb
+
+	mov	r0, #CLUSTER_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	dmb
+
+cluster_setup_leave:
+	@ Leave the cluster setup critical section:
+
+	mov	r0, #INBOUND_NOT_COMING_UP
+	strb	r0, [r8, #BL_SYNC_CLUSTER_INBOUND]
+	dsb
+	sev
+
+cluster_setup_complete:
+	@ If a platform-specific CPU setup hook is needed, it is
+	@ called from here.
+
+	cmp	r7, #0
+	mov	r0, #0		@ first (CPU) affinity level
+	blxne	r7		@ Call power_up_setup if defined
+	dmb
+
+	@ Mark the CPU as up:
+
+	mov	r0, #CPU_UP
+	strb	r0, [r5]
+
+	@ Observability order of CPU_UP and opening of the gate does not matter.
 
 bL_entry_gated:
 	ldr	r5, [r6, r4, lsl #2]		@ r5 = CPU entry vector
 	cmp	r5, #0
 	wfeeq
 	beq	bL_entry_gated
+	dmb
+
 	pr_dbg	r4, "released\n"
 	bx	r5
 
 	.align	2
 
 3:	.word	bL_entry_vectors - .
+	.word	bL_power_up_setup_phys - 3b
+	.word	bL_sync - 3b
 
 ENDPROC(bL_entry_point)
 
@@ -79,3 +177,7 @@ ENDPROC(bL_entry_point)
 	.type	bL_entry_vectors, #object
 ENTRY(bL_entry_vectors)
 	.space	4 * BL_MAX_CLUSTERS * BL_MAX_CPUS_PER_CLUSTER
+
+	.type	bL_power_up_setup_phys, #object
+ENTRY(bL_power_up_setup_phys)
+	.space  4		@ set by bL_cluster_sync_init()
diff --git a/arch/arm/include/asm/bL_entry.h b/arch/arm/include/asm/bL_entry.h
index adf8706c76..0736575783 100644
--- a/arch/arm/include/asm/bL_entry.h
+++ b/arch/arm/include/asm/bL_entry.h
@@ -15,8 +15,37 @@
 #define BL_MAX_CPUS_PER_CLUSTER	4
 #define BL_MAX_CLUSTERS		2
 
+/* Definitions for bL_cluster_sync_struct */
+#define CPU_DOWN		0x11
+#define CPU_COMING_UP		0x12
+#define CPU_UP			0x13
+#define CPU_GOING_DOWN		0x14
+
+#define CLUSTER_DOWN		0x21
+#define CLUSTER_UP		0x22
+#define CLUSTER_GOING_DOWN	0x23
+
+#define INBOUND_NOT_COMING_UP	0x31
+#define INBOUND_COMING_UP	0x32
+
+/* This is a complete guess. */
+#define __CACHE_WRITEBACK_ORDER	6
+#define __CACHE_WRITEBACK_GRANULE (1 << __CACHE_WRITEBACK_ORDER)
+
+/* Offsets for the bL_cluster_sync_struct members, for use in asm: */
+#define BL_SYNC_CLUSTER_CPUS	0
+#define BL_SYNC_CPU_SIZE	__CACHE_WRITEBACK_GRANULE
+#define BL_SYNC_CLUSTER_CLUSTER \
+	(BL_SYNC_CLUSTER_CPUS + BL_SYNC_CPU_SIZE * BL_MAX_CPUS_PER_CLUSTER)
+#define BL_SYNC_CLUSTER_INBOUND \
+	(BL_SYNC_CLUSTER_CLUSTER + __CACHE_WRITEBACK_GRANULE)
+#define BL_SYNC_CLUSTER_SIZE \
+	(BL_SYNC_CLUSTER_INBOUND + __CACHE_WRITEBACK_GRANULE)
+
 #ifndef __ASSEMBLY__
 
+#include <linux/types.h>
+
 /*
  * Platform specific code should use this symbol to set up secondary
  * entry location for processors to use when released from reset.
@@ -123,5 +152,39 @@ struct bL_platform_power_ops {
  */
 int __init bL_platform_power_register(const struct bL_platform_power_ops *ops);
 
+/* Synchronisation structures for coordinating safe cluster setup/teardown: */
+
+/*
+ * When modifying this structure, make sure you update the BL_SYNC_ defines
+ * to match.
+ */
+struct bL_cluster_sync_struct {
+	/* individual CPU states */
+	struct {
+		volatile s8 cpu __aligned(__CACHE_WRITEBACK_GRANULE);
+	} cpus[BL_MAX_CPUS_PER_CLUSTER];
+
+	/* cluster state */
+	volatile s8 cluster __aligned(__CACHE_WRITEBACK_GRANULE);
+
+	/* inbound-side state */
+	volatile s8 inbound __aligned(__CACHE_WRITEBACK_GRANULE);
+};
+
+struct bL_sync_struct {
+	struct bL_cluster_sync_struct clusters[BL_MAX_CLUSTERS];
+};
+
+extern unsigned long bL_sync_phys;	/* physical address of *bL_sync */
+
+void __bL_cpu_going_down(unsigned int cpu, unsigned int cluster);
+void __bL_cpu_down(unsigned int cpu, unsigned int cluster);
+void __bL_outbound_leave_critical(unsigned int cluster, int state);
+bool __bL_outbound_enter_critical(unsigned int this_cpu, unsigned int cluster);
+int __bL_cluster_state(unsigned int cluster);
+
+int __init bL_cluster_sync_init(
+	void (*power_up_setup)(unsigned int affinity_level));
+
 #endif /* ! __ASSEMBLY__ */
 #endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 05/16] ARM: b.L: Add baremetal voting mutexes
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (3 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 04/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election Nicolas Pitre
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

This patch adds a simple low-level voting mutex implementation
to be used to arbitrate during first man selection when no load/store
exclusive instructions are usable.

For want of a better name, these are called "vlocks".  (I was
tempted to call them ballot locks, but "block" is way too confusing
an abbreviation...)

There is no function to wait for the lock to be released, and no
vlock_lock() function since we don't need these at the moment.
These could straightforwardly be added if vlocks get used for other
purposes.

For architectural correctness even Strongly-Ordered memory accesses
require barriers in order to guarantee that multiple CPUs have a
coherent view of the ordering of memory accesses.  Whether or not
this matters depends on hardware implementation details of the
memory system.  Since the purpose of this code is to provide a clean,
generic locking mechanism with no platform-specific dependencies the
barriers should be present to avoid unpleasant surprises on future
platforms.

Note:

  * When taking the lock, we don't care about implicit background
    memory operations and other signalling which may be pending,
    because those are not part of the critical section anyway.

    A DMB is sufficient to ensure correctly observed ordering if
    the explicit memory accesses in vlock_trylock.

  * No barrier is required after checking the election result,
    because the result is determined by the store to
    VLOCK_OWNER_OFFSET and is already globally observed due to the
    barriers in voting_end.  This means that global agreement on
    the winner is guaranteed, even before the winner is known
    locally.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
---
 Documentation/arm/big.LITTLE/vlocks.txt | 211 ++++++++++++++++++++++++++++++++
 arch/arm/common/vlock.S                 | 108 ++++++++++++++++
 arch/arm/common/vlock.h                 |  28 +++++
 3 files changed, 347 insertions(+)
 create mode 100644 Documentation/arm/big.LITTLE/vlocks.txt
 create mode 100644 arch/arm/common/vlock.S
 create mode 100644 arch/arm/common/vlock.h

diff --git a/Documentation/arm/big.LITTLE/vlocks.txt b/Documentation/arm/big.LITTLE/vlocks.txt
new file mode 100644
index 0000000000..415960a9ba
--- /dev/null
+++ b/Documentation/arm/big.LITTLE/vlocks.txt
@@ -0,0 +1,211 @@
+vlocks for Bare-Metal Mutual Exclusion
+======================================
+
+Voting Locks, or "vlocks" provide a simple low-level mutual exclusion
+mechanism, with reasonable but minimal requirements on the memory
+system.
+
+These are intended to be used to coordinate critical activity among CPUs
+which are otherwise non-coherent, in situations where the hardware
+provides no other mechanism to support this and ordinary spinlocks
+cannot be used.
+
+
+vlocks make use of the atomicity provided by the memory system for
+writes to a single memory location.  To arbitrate, every CPU "votes for
+itself", by storing a unique number to a common memory location.  The
+final value seen in that memory location when all the votes have been
+cast identifies the winner.
+
+In order to make sure that the election produces an unambiguous result
+in finite time, a CPU will only enter the election in the first place if
+no winner has been chosen and the election does not appear to have
+started yet.
+
+
+Algorithm
+---------
+
+The easiest way to explain the vlocks algorithm is with some pseudo-code:
+
+
+	int currently_voting[NR_CPUS] = { 0, };
+	int last_vote = -1; /* no votes yet */
+
+	bool vlock_trylock(int this_cpu)
+	{
+		/* signal our desire to vote */
+		currently_voting[this_cpu] = 1;
+		if (last_vote != -1) {
+			/* someone already volunteered himself */
+			currently_voting[this_cpu] = 0;
+			return false; /* not ourself */
+		}
+
+		/* let's suggest ourself */
+		last_vote = this_cpu;
+		currently_voting[this_cpu] = 0;
+
+		/* then wait until everyone else is done voting */
+		for_each_cpu(i) {
+			while (currently_voting[i] != 0)
+				/* wait */;
+		}
+
+		/* result */
+		if (last_vote == this_cpu)
+			return true; /* we won */
+		return false;
+	}
+
+	bool vlock_unlock(void)
+	{
+		last_vote = -1;
+	}
+
+
+The currently_voting[] array provides a way for the CPUs to determine
+whether an election is in progress, and plays a role analogous to the
+"entering" array in Lamport's bakery algorithm [1].
+
+However, once the election has started, the underlying memory system
+atomicity is used to pick the winner.  This avoids the need for a static
+priority rule to act as a tie-breaker, or any counters which could
+overflow.
+
+As long as the last_vote variable is globally visible to all CPUs, it
+will contain only one value that won't change once every CPU has cleared
+its currently_voting flag.
+
+
+Features and limitations
+------------------------
+
+ * vlocks are not intended to be fair.  In the contended case, it is the
+   _last_ CPU which attempts to get the lock which will be most likely
+   to win.
+
+   vlocks are therefore best suited to situations where it is necessary
+   to pick a unique winner, but it does not matter which CPU actually
+   wins.
+
+ * Like other similar mechanisms, vlocks will not scale well to a large
+   number of CPUs.
+
+   vlocks can be cascaded in a voting hierarchy to permit better scaling
+   if necessary, as in the following hypothetical example for 4096 CPUs:
+
+	/* first level: local election */
+	my_town = towns[(this_cpu >> 4) & 0xf];
+	I_won = vlock_trylock(my_town, this_cpu & 0xf);
+	if (I_won) {
+		/* we won the town election, let's go for the state */
+		my_state = states[(this_cpu >> 8) & 0xf];
+		I_won = vlock_lock(my_state, this_cpu & 0xf));
+		if (I_won) {
+			/* and so on */
+			I_won = vlock_lock(the_whole_country, this_cpu & 0xf];
+			if (I_won) {
+				/* ... */
+			}
+			vlock_unlock(the_whole_country);
+		}
+		vlock_unlock(my_state);
+	}
+	vlock_unlock(my_town);
+
+
+ARM implementation
+------------------
+
+The current ARM implementation [2] contains some optimisations beyond
+the basic algorithm:
+
+ * By packing the members of the currently_voting array close together,
+   we can read the whole array in one transaction (providing the number
+   of CPUs potentially contending the lock is small enough).  This
+   reduces the number of round-trips required to external memory.
+
+   In the ARM implementation, this means that we can use a single load
+   and comparison:
+
+	LDR	Rt, [Rn]
+	CMP	Rt, #0
+
+   ...in place of code equivalent to:
+
+	LDRB	Rt, [Rn]
+	CMP	Rt, #0
+	LDRBEQ	Rt, [Rn, #1]
+	CMPEQ	Rt, #0
+	LDRBEQ	Rt, [Rn, #2]
+	CMPEQ	Rt, #0
+	LDRBEQ	Rt, [Rn, #3]
+	CMPEQ	Rt, #0
+
+   This cuts down on the fast-path latency, as well as potentially
+   reducing bus contention in contended cases.
+
+   The optimisation relies on the fact that the ARM memory system
+   guarantees coherency between overlapping memory accesses of
+   different sizes, similarly to many other architectures.  Note that
+   we do not care which element of currently_voting appears in which
+   bits of Rt, so there is no need to worry about endianness in this
+   optimisation.
+
+   If there are too many CPUs to read the currently_voting array in
+   one transaction then multiple transations are still required.  The
+   implementation uses a simple loop of word-sized loads for this
+   case.  The number of transactions is still fewer than would be
+   required if bytes were loaded individually.
+
+
+   In principle, we could aggregate further by using LDRD or LDM, but
+   to keep the code simple this was not attempted in the initial
+   implementation.
+
+
+ * vlocks are currently only used to coordinate between CPUs which are
+   unable to enable their caches yet.  This means that the
+   implementation removes many of the barriers which would be required
+   when executing the algorithm in cached memory.
+
+   packing of the currently_voting array does not work with cached
+   memory unless all CPUs contending the lock are cache-coherent, due
+   to cache writebacks from one CPU clobbering values written by other
+   CPUs.  (Though if all the CPUs are cache-coherent, you should be
+   probably be using proper spinlocks instead anyway).
+
+
+ * The "no votes yet" value used for the last_vote variable is 0 (not
+   -1 as in the pseudocode).  This allows statically-allocated vlocks
+   to be implicitly initialised to an unlocked state simply by putting
+   them in .bss.
+
+   An offset is added to each CPU's ID for the purpose of setting this
+   variable, so that no CPU uses the value 0 for its ID.
+
+
+Colophon
+--------
+
+Originally created and documented by Dave Martin for Linaro Limited, for
+use in ARM-based big.LITTLE platforms, with review and input gratefully
+received from Nicolas Pitre and Achin Gupta.  Thanks to Nicolas for
+grabbing most of this text out of the relevant mail thread and writing
+up the pseudocode.
+
+Copyright (C) 2012-2013  Linaro Limited
+Distributed under the terms of Version 2 of the GNU General Public
+License, as defined in linux/COPYING.
+
+
+References
+----------
+
+[1] Lamport, L. "A New Solution of Dijkstra's Concurrent Programming
+    Problem", Communications of the ACM 17, 8 (August 1974), 453-455.
+
+    http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm
+
+[2] linux/arch/arm/common/vlock.S, www.kernel.org.
diff --git a/arch/arm/common/vlock.S b/arch/arm/common/vlock.S
new file mode 100644
index 0000000000..9109825606
--- /dev/null
+++ b/arch/arm/common/vlock.S
@@ -0,0 +1,108 @@
+/*
+ * vlock.S - simple voting lock implementation for ARM
+ *
+ * Created by:	Dave Martin, 2012-08-16
+ * Copyright:	(C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ *
+ * This algorithm is described in more detail in
+ * Documentation/arm/big.LITTLE/vlocks.txt.
+ */
+
+#include <linux/linkage.h>
+#include "vlock.h"
+
+/* Select different code if voting flags  can fit in a single word. */
+#if VLOCK_VOTING_SIZE > 4
+#define FEW(x...)
+#define MANY(x...) x
+#else
+#define FEW(x...) x
+#define MANY(x...)
+#endif
+
+@ voting lock for first-man coordination
+
+.macro voting_begin rbase:req, rcpu:req, rscratch:req
+	mov	\rscratch, #1
+	strb	\rscratch, [\rbase, \rcpu]
+	dmb
+.endm
+
+.macro voting_end rbase:req, rcpu:req, rscratch:req
+	dmb
+	mov	\rscratch, #0
+	strb	\rscratch, [\rbase, \rcpu]
+	dsb
+	sev
+.endm
+
+/*
+ * The vlock structure must reside in Strongly-Ordered or Device memory.
+ * This implementation deliberately eliminates most of the barriers which
+ * would be required for other memory types, and assumes that independent
+ * writes to neighbouring locations within a cacheline do not interfere
+ * with one another.
+ */
+
+@ r0: lock structure base
+@ r1: CPU ID (0-based index within cluster)
+ENTRY(vlock_trylock)
+	add	r1, r1, #VLOCK_VOTING_OFFSET
+
+	voting_begin	r0, r1, r2
+
+	ldrb	r2, [r0, #VLOCK_OWNER_OFFSET]	@ check whether lock is held
+	cmp	r2, #VLOCK_OWNER_NONE
+	bne	trylock_fail			@ fail if so
+
+	@ Control dependency implies strb not observable before previous ldrb.
+
+	strb	r1, [r0, #VLOCK_OWNER_OFFSET]	@ submit my vote
+
+	voting_end	r0, r1, r2		@ implies DMB
+
+	@ Wait for the current round of voting to finish:
+
+ MANY(	mov	r3, #VLOCK_VOTING_OFFSET			)
+0:
+ MANY(	ldr	r2, [r0, r3]					)
+ FEW(	ldr	r2, [r0, #VLOCK_VOTING_OFFSET]			)
+	cmp	r2, #0
+	wfene
+	bne	0b
+ MANY(	add	r3, r3, #4					)
+ MANY(	cmp	r3, #VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE	)
+ MANY(	bne	0b						)
+
+	@ Check who won:
+
+	dmb
+	ldrb	r2, [r0, #VLOCK_OWNER_OFFSET]
+	eor	r0, r1, r2			@ zero if I won, else nonzero
+	bx	lr
+
+trylock_fail:
+	voting_end	r0, r1, r2
+	mov	r0, #1				@ nonzero indicates that I lost
+	bx	lr
+ENDPROC(vlock_trylock)
+
+@ r0: lock structure base
+ENTRY(vlock_unlock)
+	dmb
+	mov	r1, #VLOCK_OWNER_NONE
+	strb	r1, [r0, #VLOCK_OWNER_OFFSET]
+	dsb
+	sev
+	bx	lr
+ENDPROC(vlock_unlock)
diff --git a/arch/arm/common/vlock.h b/arch/arm/common/vlock.h
new file mode 100644
index 0000000000..bd4d649af2
--- /dev/null
+++ b/arch/arm/common/vlock.h
@@ -0,0 +1,28 @@
+/*
+ * vlock.h - simple voting lock implementation
+ *
+ * Created by:	Dave Martin, 2012-08-16
+ * Copyright:	(C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __VLOCK_H
+#define __VLOCK_H
+
+#include <asm/bL_entry.h>
+
+#define VLOCK_OWNER_OFFSET	0
+#define VLOCK_VOTING_OFFSET	4
+#define VLOCK_VOTING_SIZE	((BL_MAX_CPUS_PER_CLUSTER + 3) / 4 * 4)
+#define VLOCK_SIZE		(VLOCK_VOTING_OFFSET + VLOCK_VOTING_SIZE)
+#define VLOCK_OWNER_NONE	0
+
+#endif /* ! __VLOCK_H */
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (4 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 05/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-28 17:18   ` Will Deacon
  2013-01-24  6:27 ` [PATCH v2 07/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

Instead of requiring the first man to be elected in advance (which
can be suboptimal in some situations), this patch uses a per-
cluster mutex to co-ordinate selection of the first man.

This should also make it more feasible to reuse this code path for
asynchronous cluster resume (as in CPUidle scenarios).

We must ensure that the vlock data doesn't share a cacheline with
anything else, or dirty cache eviction could corrupt it.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
---
 arch/arm/common/Makefile  |  2 +-
 arch/arm/common/bL_head.S | 41 ++++++++++++++++++++++++++++++++++++-----
 2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
index 8025899a20..aa797237a7 100644
--- a/arch/arm/common/Makefile
+++ b/arch/arm/common/Makefile
@@ -13,4 +13,4 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
 obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
 obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
 obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
-obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o
+obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o vlock.o
diff --git a/arch/arm/common/bL_head.S b/arch/arm/common/bL_head.S
index a226cdf4ce..86bcc8a003 100644
--- a/arch/arm/common/bL_head.S
+++ b/arch/arm/common/bL_head.S
@@ -16,6 +16,8 @@
 #include <linux/linkage.h>
 #include <asm/bL_entry.h>
 
+#include "vlock.h"
+
 .if BL_SYNC_CLUSTER_CPUS
 .error "cpus must be the first member of struct bL_cluster_sync_struct"
 .endif
@@ -64,10 +66,11 @@ ENTRY(bL_entry_point)
 	 * position independent way.
 	 */
 	adr	r5, 3f
-	ldmia	r5, {r6, r7, r8}
+	ldmia	r5, {r6, r7, r8, r11}
 	add	r6, r5, r6			@ r6 = bL_entry_vectors
 	ldr	r7, [r5, r7]			@ r7 = bL_power_up_setup_phys
 	add	r8, r5, r8			@ r8 = bL_sync
+	add	r11, r5, r11			@ r11 = first_man_locks
 
 	mov	r0, #BL_SYNC_CLUSTER_SIZE
 	mla	r8, r0, r10, r8			@ r8 = bL_sync cluster base
@@ -81,13 +84,22 @@ ENTRY(bL_entry_point)
 	@ At this point, the cluster cannot unexpectedly enter the GOING_DOWN
 	@ state, because there is at least one active CPU (this CPU).
 
-	@ Note: the following is racy as another CPU might be testing
-	@ the same flag at the same moment.  That'll be fixed later.
+	mov	r0, #VLOCK_SIZE
+	mla	r11, r0, r10, r11		@ r11 = cluster first man lock
+	mov	r0, r11
+	mov	r1, r9				@ cpu
+	bl	vlock_trylock			@ implies DMB
+
+	cmp	r0, #0				@ failed to get the lock?
+	bne	cluster_setup_wait		@ wait for cluster setup if so
+
 	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
 	cmp	r0, #CLUSTER_UP			@ cluster already up?
 	bne	cluster_setup			@ if not, set up the cluster
 
-	@ Otherwise, skip setup:
+	@ Otherwise, release the first man lock and skip setup:
+	mov	r0, r11
+	bl	vlock_unlock
 	b	cluster_setup_complete
 
 cluster_setup:
@@ -137,6 +149,19 @@ cluster_setup_leave:
 	dsb
 	sev
 
+	mov	r0, r11
+	bl	vlock_unlock	@ implies DMB
+	b	cluster_setup_complete
+
+	@ In the contended case, non-first men wait here for cluster setup
+	@ to complete:
+cluster_setup_wait:
+	ldrb	r0, [r8, #BL_SYNC_CLUSTER_CLUSTER]
+	cmp	r0, #CLUSTER_UP
+	wfene
+	bne	cluster_setup_wait
+	dmb
+
 cluster_setup_complete:
 	@ If a platform-specific CPU setup hook is needed, it is
 	@ called from here.
@@ -168,11 +193,17 @@ bL_entry_gated:
 3:	.word	bL_entry_vectors - .
 	.word	bL_power_up_setup_phys - 3b
 	.word	bL_sync - 3b
+	.word	first_man_locks - 3b
 
 ENDPROC(bL_entry_point)
 
 	.bss
-	.align	5
+
+	.align	__CACHE_WRITEBACK_ORDER
+	.type	first_man_locks, #object
+first_man_locks:
+	.space	VLOCK_SIZE * BL_MAX_CLUSTERS
+	.align	__CACHE_WRITEBACK_ORDER
 
 	.type	bL_entry_vectors, #object
 ENTRY(bL_entry_vectors)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election
  2013-01-24  6:27 ` [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election Nicolas Pitre
@ 2013-01-28 17:18   ` Will Deacon
  2013-01-28 17:58     ` Nicolas Pitre
  0 siblings, 1 reply; 25+ messages in thread
From: Will Deacon @ 2013-01-28 17:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 24, 2013 at 06:27:49AM +0000, Nicolas Pitre wrote:
> From: Dave Martin <dave.martin@linaro.org>
> 
> Instead of requiring the first man to be elected in advance (which
> can be suboptimal in some situations), this patch uses a per-
> cluster mutex to co-ordinate selection of the first man.
> 
> This should also make it more feasible to reuse this code path for
> asynchronous cluster resume (as in CPUidle scenarios).
> 
> We must ensure that the vlock data doesn't share a cacheline with
> anything else, or dirty cache eviction could corrupt it.
> 
> Signed-off-by: Dave Martin <dave.martin@linaro.org>
> Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>

[...]

> +
> +	.align	__CACHE_WRITEBACK_ORDER
> +	.type	first_man_locks, #object
> +first_man_locks:
> +	.space	VLOCK_SIZE * BL_MAX_CLUSTERS
> +	.align	__CACHE_WRITEBACK_ORDER
>  
>  	.type	bL_entry_vectors, #object
>  ENTRY(bL_entry_vectors)

I've just been chatting to Dave about this and __CACHE_WRITEBACK_ORDER
isn't really the correct solution here.

To summarise the problem: although vlocks are only accessed by CPUs with
their caches disabled, the lock structures could reside in the same
cacheline (at some level of cache) as cacheable data being written by
another CPU. This comes about because the vlock code has a cacheable alias
via the kernel linear mapping and means that when the cacheable data is
evicted, it clobbers the vlocks with stale values which are part of the
dirty cacheline.

Now, we also have this problem for DMA mappings, as mentioned here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2012-October/124276.html

It seems to me that we actually want a mechanism for allocating/managing
physically contiguous blocks of memory such that the cacheable alias is
removed from the linear mapping (perhaps we could use PAGE_NONE to avoid
confusing the mm code?).

Will

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election
  2013-01-28 17:18   ` Will Deacon
@ 2013-01-28 17:58     ` Nicolas Pitre
  0 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-28 17:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 28 Jan 2013, Will Deacon wrote:

> On Thu, Jan 24, 2013 at 06:27:49AM +0000, Nicolas Pitre wrote:
> > From: Dave Martin <dave.martin@linaro.org>
> > 
> > Instead of requiring the first man to be elected in advance (which
> > can be suboptimal in some situations), this patch uses a per-
> > cluster mutex to co-ordinate selection of the first man.
> > 
> > This should also make it more feasible to reuse this code path for
> > asynchronous cluster resume (as in CPUidle scenarios).
> > 
> > We must ensure that the vlock data doesn't share a cacheline with
> > anything else, or dirty cache eviction could corrupt it.
> > 
> > Signed-off-by: Dave Martin <dave.martin@linaro.org>
> > Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
> 
> [...]
> 
> > +
> > +	.align	__CACHE_WRITEBACK_ORDER
> > +	.type	first_man_locks, #object
> > +first_man_locks:
> > +	.space	VLOCK_SIZE * BL_MAX_CLUSTERS
> > +	.align	__CACHE_WRITEBACK_ORDER
> >  
> >  	.type	bL_entry_vectors, #object
> >  ENTRY(bL_entry_vectors)
> 
> I've just been chatting to Dave about this and __CACHE_WRITEBACK_ORDER
> isn't really the correct solution here.
> 
> To summarise the problem: although vlocks are only accessed by CPUs with
> their caches disabled, the lock structures could reside in the same
> cacheline (at some level of cache) as cacheable data being written by
> another CPU. This comes about because the vlock code has a cacheable alias
> via the kernel linear mapping and means that when the cacheable data is
> evicted, it clobbers the vlocks with stale values which are part of the
> dirty cacheline.
> 
> Now, we also have this problem for DMA mappings, as mentioned here:
> 
>   http://lists.infradead.org/pipermail/linux-arm-kernel/2012-October/124276.html
> 
> It seems to me that we actually want a mechanism for allocating/managing
> physically contiguous blocks of memory such that the cacheable alias is
> removed from the linear mapping (perhaps we could use PAGE_NONE to avoid
> confusing the mm code?).

Well, I partly disagree.

I don't dispute the need for a mechanism to allocate physically 
contiguous blocks of memory in the DMA case or other similar users of 
largish dynamic allocations.  

But That's not the case here.  In the vlock case, what we actually need 
in practice is equivalent to a _single_ cache line of cache free memory.  
Requiring a dynamic allocation infrastructure tailored for this specific 
case is going to waste much more CPU cycles and memory in the end than 
what this static allocation is currently doing, even if it were 
overallocating.

Your suggestion would be needed when we get to the point where dynamic 
sizing of the number of clusters is required. But, as I said in response 
to your previous comment, let's not fall into the trap of 
overengineering this solution for the time being.  Better approach this 
incrementally if actual usage does indicate that some more 
sophistication is needed.  The whole stack is already complex enough as 
it is and I'd prefer if people could get familiar with the simpler 
version initially.

Nicolas

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 07/16] ARM: b.L: generic SMP secondary bringup and hotplug support
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (5 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 08/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

Now that the b.L power API is in place, we can use it for SMP secondary
bringup and CPU hotplug in a generic fashion.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/Makefile     |  2 +-
 arch/arm/common/bL_platsmp.c | 79 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 arch/arm/common/bL_platsmp.c

diff --git a/arch/arm/common/Makefile b/arch/arm/common/Makefile
index aa797237a7..d1eaae8eb7 100644
--- a/arch/arm/common/Makefile
+++ b/arch/arm/common/Makefile
@@ -13,4 +13,4 @@ obj-$(CONFIG_SHARP_PARAM)	+= sharpsl_param.o
 obj-$(CONFIG_SHARP_SCOOP)	+= scoop.o
 obj-$(CONFIG_PCI_HOST_ITE8152)  += it8152.o
 obj-$(CONFIG_ARM_TIMER_SP804)	+= timer-sp.o
-obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o vlock.o
+obj-$(CONFIG_BIG_LITTLE)	+= bL_head.o bL_entry.o bL_platsmp.o vlock.o
diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
new file mode 100644
index 0000000000..e28dd50640
--- /dev/null
+++ b/arch/arm/common/bL_platsmp.c
@@ -0,0 +1,79 @@
+/*
+ * linux/arch/arm/mach-vexpress/bL_platsmp.c
+ *
+ * Created by:  Nicolas Pitre, November 2012
+ * Copyright:   (C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Code to handle secondary CPU bringup and hotplug for the bL power API.
+ */
+
+#include <linux/init.h>
+#include <linux/smp.h>
+
+#include <asm/bL_entry.h>
+#include <asm/smp_plat.h>
+#include <asm/hardware/gic.h>
+
+static void __init simple_smp_init_cpus(void)
+{
+	set_smp_cross_call(gic_raise_softirq);
+}
+
+static int __cpuinit bL_boot_secondary(unsigned int cpu, struct task_struct *idle)
+{
+	unsigned int pcpu, pcluster, ret;
+	extern void secondary_startup(void);
+
+	pcpu = cpu_logical_map(cpu) & 0xff;
+	pcluster = (cpu_logical_map(cpu) >> 8) & 0xff;
+	pr_debug("%s: logical CPU %d is physical CPU %d cluster %d\n",
+		 __func__, cpu, pcpu, pcluster);
+
+	bL_set_entry_vector(pcpu, pcluster, NULL);
+	ret = bL_cpu_power_up(pcpu, pcluster);
+	if (ret)
+		return ret;
+	bL_set_entry_vector(pcpu, pcluster, secondary_startup);
+	gic_raise_softirq(cpumask_of(cpu), 0);
+	dsb_sev();
+	return 0;
+}
+
+static void __cpuinit bL_secondary_init(unsigned int cpu)
+{
+	bL_cpu_powered_up();
+	gic_secondary_init(0);
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+
+static int bL_cpu_disable(unsigned int cpu)
+{
+	/*
+	 * We assume all CPUs may be shut down.
+	 * This would be the hook to use for eventual Secure
+	 * OS migration requests.
+	 */
+	return 0;
+}
+
+static void __ref bL_cpu_die(unsigned int cpu)
+{
+	bL_cpu_power_down();
+}
+
+#endif
+
+struct smp_operations __initdata bL_smp_ops = {
+	.smp_init_cpus		= simple_smp_init_cpus,
+	.smp_boot_secondary	= bL_boot_secondary,
+	.smp_secondary_init	= bL_secondary_init,
+#ifdef CONFIG_HOTPLUG_CPU
+	.cpu_disable		= bL_cpu_disable,
+	.cpu_die		= bL_cpu_die,
+#endif
+};
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 08/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (6 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 07/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

If for whatever reason a CPU is unexpectedly awaken, it shouldn't
re-enter the kernel by using whatever entry vector that might have
been set by a previous operation.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/common/bL_platsmp.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/common/bL_platsmp.c b/arch/arm/common/bL_platsmp.c
index e28dd50640..c3835b0e29 100644
--- a/arch/arm/common/bL_platsmp.c
+++ b/arch/arm/common/bL_platsmp.c
@@ -63,6 +63,11 @@ static int bL_cpu_disable(unsigned int cpu)
 
 static void __ref bL_cpu_die(unsigned int cpu)
 {
+	unsigned int mpidr, pcpu, pcluster;
+	mpidr = read_cpuid_mpidr();
+	pcpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
+	pcluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
+	bL_set_entry_vector(pcpu, pcluster, NULL);
 	bL_cpu_power_down();
 }
 
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 09/16] ARM: vexpress: Select the correct SMP operations at run-time
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (7 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 08/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24 11:56   ` Jon Medhurst (Tixy)
  2013-01-24  6:27 ` [PATCH v2 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Jon Medhurst <tixy@linaro.org>

Signed-off-by: Jon Medhurst <tixy@linaro.org>
---
 arch/arm/include/asm/mach/arch.h |  3 +++
 arch/arm/kernel/setup.c          |  5 ++++-
 arch/arm/mach-vexpress/core.h    |  2 ++
 arch/arm/mach-vexpress/platsmp.c | 12 ++++++++++++
 arch/arm/mach-vexpress/v2m.c     |  2 +-
 5 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/mach/arch.h b/arch/arm/include/asm/mach/arch.h
index 917d4fcfd9..3d01c6d6c3 100644
--- a/arch/arm/include/asm/mach/arch.h
+++ b/arch/arm/include/asm/mach/arch.h
@@ -17,8 +17,10 @@ struct pt_regs;
 struct smp_operations;
 #ifdef CONFIG_SMP
 #define smp_ops(ops) (&(ops))
+#define smp_init_ops(ops) (&(ops))
 #else
 #define smp_ops(ops) (struct smp_operations *)NULL
+#define smp_init_ops(ops) (void (*)(void))NULL
 #endif
 
 struct machine_desc {
@@ -42,6 +44,7 @@ struct machine_desc {
 	unsigned char		reserve_lp2 :1;	/* never has lp2	*/
 	char			restart_mode;	/* default restart mode	*/
 	struct smp_operations	*smp;		/* SMP operations	*/
+	void			(*smp_init)(void);
 	void			(*fixup)(struct tag *, char **,
 					 struct meminfo *);
 	void			(*reserve)(void);/* reserve mem blocks	*/
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 3f6cbb2e3e..41edca8582 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -768,7 +768,10 @@ void __init setup_arch(char **cmdline_p)
 	arm_dt_init_cpu_maps();
 #ifdef CONFIG_SMP
 	if (is_smp()) {
-		smp_set_ops(mdesc->smp);
+		if(mdesc->smp_init)
+			(*mdesc->smp_init)();
+		else
+			smp_set_ops(mdesc->smp);
 		smp_init_cpus();
 	}
 #endif
diff --git a/arch/arm/mach-vexpress/core.h b/arch/arm/mach-vexpress/core.h
index f134cd4a85..3a761fd76c 100644
--- a/arch/arm/mach-vexpress/core.h
+++ b/arch/arm/mach-vexpress/core.h
@@ -6,6 +6,8 @@
 
 void vexpress_dt_smp_map_io(void);
 
+void vexpress_smp_init_ops(void);
+
 extern struct smp_operations	vexpress_smp_ops;
 
 extern void vexpress_cpu_die(unsigned int cpu);
diff --git a/arch/arm/mach-vexpress/platsmp.c b/arch/arm/mach-vexpress/platsmp.c
index c5d70de9bb..e62a08b561 100644
--- a/arch/arm/mach-vexpress/platsmp.c
+++ b/arch/arm/mach-vexpress/platsmp.c
@@ -12,6 +12,7 @@
 #include <linux/errno.h>
 #include <linux/smp.h>
 #include <linux/io.h>
+#include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/vexpress.h>
 
@@ -206,3 +207,14 @@ struct smp_operations __initdata vexpress_smp_ops = {
 	.cpu_die		= vexpress_cpu_die,
 #endif
 };
+
+void __init vexpress_smp_init_ops(void)
+{
+	struct smp_operations *ops = &vexpress_smp_ops;
+#ifdef CONFIG_BIG_LITTLE
+	extern struct smp_operations bL_smp_ops;
+	if(of_find_compatible_node(NULL, NULL, "arm,cci"))
+		ops = &bL_smp_ops;
+#endif
+	smp_set_ops(ops);
+}
diff --git a/arch/arm/mach-vexpress/v2m.c b/arch/arm/mach-vexpress/v2m.c
index 011661a6c5..34172bd504 100644
--- a/arch/arm/mach-vexpress/v2m.c
+++ b/arch/arm/mach-vexpress/v2m.c
@@ -494,7 +494,7 @@ static const char * const v2m_dt_match[] __initconst = {
 
 DT_MACHINE_START(VEXPRESS_DT, "ARM-Versatile Express")
 	.dt_compat	= v2m_dt_match,
-	.smp		= smp_ops(vexpress_smp_ops),
+	.smp_init	= smp_init_ops(vexpress_smp_init_ops),
 	.map_io		= v2m_dt_map_io,
 	.init_early	= v2m_dt_init_early,
 	.init_irq	= v2m_dt_init_irq,
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 09/16] ARM: vexpress: Select the correct SMP operations at run-time
  2013-01-24  6:27 ` [PATCH v2 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
@ 2013-01-24 11:56   ` Jon Medhurst (Tixy)
  0 siblings, 0 replies; 25+ messages in thread
From: Jon Medhurst (Tixy) @ 2013-01-24 11:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2013-01-24 at 01:27 -0500, Nicolas Pitre wrote:
> From: Jon Medhurst <tixy@linaro.org>
> 
> Signed-off-by: Jon Medhurst <tixy@linaro.org>
> ---

Should this patch be split into two. One to introduce the new smp_init
hook, and one to make vexpress use it? With descriptions like:

-----------------------------------------------------------------------
ARM: kernel: Enable selection of SMP operations at boot time

Add a new 'smp_init' hook to machine_desc so platforms can specify a
function to be used to setup smp ops instead of having a statically
defined value.
-----------------------------------------------------------------------
ARM: vexpress: Select multi-cluster SMP operation if required
-----------------------------------------------------------------------

-- 
Tixy

>  arch/arm/include/asm/mach/arch.h |  3 +++
>  arch/arm/kernel/setup.c          |  5 ++++-
>  arch/arm/mach-vexpress/core.h    |  2 ++
>  arch/arm/mach-vexpress/platsmp.c | 12 ++++++++++++
>  arch/arm/mach-vexpress/v2m.c     |  2 +-
>  5 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/mach/arch.h b/arch/arm/include/asm/mach/arch.h
> index 917d4fcfd9..3d01c6d6c3 100644
> --- a/arch/arm/include/asm/mach/arch.h
> +++ b/arch/arm/include/asm/mach/arch.h
> @@ -17,8 +17,10 @@ struct pt_regs;
>  struct smp_operations;
>  #ifdef CONFIG_SMP
>  #define smp_ops(ops) (&(ops))
> +#define smp_init_ops(ops) (&(ops))
>  #else
>  #define smp_ops(ops) (struct smp_operations *)NULL
> +#define smp_init_ops(ops) (void (*)(void))NULL
>  #endif
>  
>  struct machine_desc {
> @@ -42,6 +44,7 @@ struct machine_desc {
>  	unsigned char		reserve_lp2 :1;	/* never has lp2	*/
>  	char			restart_mode;	/* default restart mode	*/
>  	struct smp_operations	*smp;		/* SMP operations	*/
> +	void			(*smp_init)(void);
>  	void			(*fixup)(struct tag *, char **,
>  					 struct meminfo *);
>  	void			(*reserve)(void);/* reserve mem blocks	*/
> diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
> index 3f6cbb2e3e..41edca8582 100644
> --- a/arch/arm/kernel/setup.c
> +++ b/arch/arm/kernel/setup.c
> @@ -768,7 +768,10 @@ void __init setup_arch(char **cmdline_p)
>  	arm_dt_init_cpu_maps();
>  #ifdef CONFIG_SMP
>  	if (is_smp()) {
> -		smp_set_ops(mdesc->smp);
> +		if(mdesc->smp_init)
> +			(*mdesc->smp_init)();
> +		else
> +			smp_set_ops(mdesc->smp);
>  		smp_init_cpus();
>  	}
>  #endif
> diff --git a/arch/arm/mach-vexpress/core.h b/arch/arm/mach-vexpress/core.h
> index f134cd4a85..3a761fd76c 100644
> --- a/arch/arm/mach-vexpress/core.h
> +++ b/arch/arm/mach-vexpress/core.h
> @@ -6,6 +6,8 @@
>  
>  void vexpress_dt_smp_map_io(void);
>  
> +void vexpress_smp_init_ops(void);
> +
>  extern struct smp_operations	vexpress_smp_ops;
>  
>  extern void vexpress_cpu_die(unsigned int cpu);
> diff --git a/arch/arm/mach-vexpress/platsmp.c b/arch/arm/mach-vexpress/platsmp.c
> index c5d70de9bb..e62a08b561 100644
> --- a/arch/arm/mach-vexpress/platsmp.c
> +++ b/arch/arm/mach-vexpress/platsmp.c
> @@ -12,6 +12,7 @@
>  #include <linux/errno.h>
>  #include <linux/smp.h>
>  #include <linux/io.h>
> +#include <linux/of.h>
>  #include <linux/of_fdt.h>
>  #include <linux/vexpress.h>
>  
> @@ -206,3 +207,14 @@ struct smp_operations __initdata vexpress_smp_ops = {
>  	.cpu_die		= vexpress_cpu_die,
>  #endif
>  };
> +
> +void __init vexpress_smp_init_ops(void)
> +{
> +	struct smp_operations *ops = &vexpress_smp_ops;
> +#ifdef CONFIG_BIG_LITTLE
> +	extern struct smp_operations bL_smp_ops;
> +	if(of_find_compatible_node(NULL, NULL, "arm,cci"))
> +		ops = &bL_smp_ops;
> +#endif
> +	smp_set_ops(ops);
> +}
> diff --git a/arch/arm/mach-vexpress/v2m.c b/arch/arm/mach-vexpress/v2m.c
> index 011661a6c5..34172bd504 100644
> --- a/arch/arm/mach-vexpress/v2m.c
> +++ b/arch/arm/mach-vexpress/v2m.c
> @@ -494,7 +494,7 @@ static const char * const v2m_dt_match[] __initconst = {
>  
>  DT_MACHINE_START(VEXPRESS_DT, "ARM-Versatile Express")
>  	.dt_compat	= v2m_dt_match,
> -	.smp		= smp_ops(vexpress_smp_ops),
> +	.smp_init	= smp_init_ops(vexpress_smp_init_ops),
>  	.map_io		= v2m_dt_map_io,
>  	.init_early	= v2m_dt_init_early,
>  	.init_irq	= v2m_dt_init_irq,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 10/16] ARM: vexpress: introduce DCSCB support
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (8 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation Nicolas Pitre
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

This adds basic CPU and cluster reset controls on RTSM for the
A15x4-A7x4 model configuration using the Dual Cluster System
Configuration Block (DCSCB).

The cache coherency interconnect (CCI) is not handled yet.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/Kconfig  |   8 ++
 arch/arm/mach-vexpress/Makefile |   1 +
 arch/arm/mach-vexpress/dcscb.c  | 159 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 168 insertions(+)
 create mode 100644 arch/arm/mach-vexpress/dcscb.c

diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
index 52d315b792..50d6587d0c 100644
--- a/arch/arm/mach-vexpress/Kconfig
+++ b/arch/arm/mach-vexpress/Kconfig
@@ -52,4 +52,12 @@ config ARCH_VEXPRESS_CORTEX_A5_A9_ERRATA
 config ARCH_VEXPRESS_CA9X4
 	bool "Versatile Express Cortex-A9x4 tile"
 
+config ARCH_VEXPRESS_DCSCB
+	bool "Dual Cluster System Control Block (DCSCB) support"
+	depends on BIG_LITTLE
+	help
+	  Support for the Dual Cluster System Configuration Block (DCSCB).
+	  This is needed to provide CPU and cluster power management
+	  on RTSM.
+
 endmenu
diff --git a/arch/arm/mach-vexpress/Makefile b/arch/arm/mach-vexpress/Makefile
index 80b64971fb..2253644054 100644
--- a/arch/arm/mach-vexpress/Makefile
+++ b/arch/arm/mach-vexpress/Makefile
@@ -6,5 +6,6 @@ ccflags-$(CONFIG_ARCH_MULTIPLATFORM) := -I$(srctree)/$(src)/include \
 
 obj-y					:= v2m.o reset.o
 obj-$(CONFIG_ARCH_VEXPRESS_CA9X4)	+= ct-ca9x4.o
+obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o
 obj-$(CONFIG_SMP)			+= platsmp.o
 obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
new file mode 100644
index 0000000000..220b990edd
--- /dev/null
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -0,0 +1,159 @@
+/*
+ * arch/arm/mach-vexpress/dcscb.c - Dual Cluster System Control Block
+ *
+ * Created by:	Nicolas Pitre, May 2012
+ * Copyright:	(C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/io.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/vexpress.h>
+
+#include <asm/bL_entry.h>
+#include <asm/proc-fns.h>
+#include <asm/cacheflush.h>
+#include <asm/cputype.h>
+#include <asm/cp15.h>
+
+
+#define DCSCB_PHYS_BASE	0x60000000
+
+#define RST_HOLD0	0x0
+#define RST_HOLD1	0x4
+#define SYS_SWRESET	0x8
+#define RST_STAT0	0xc
+#define RST_STAT1	0x10
+#define EAG_CFG_R	0x20
+#define EAG_CFG_W	0x24
+#define KFC_CFG_R	0x28
+#define KFC_CFG_W	0x2c
+#define DCS_CFG_R	0x30
+
+/*
+ * We can't use regular spinlocks. In the switcher case, it is possible
+ * for an outbound CPU to call power_down() after its inbound counterpart
+ * is already live using the same logical CPU number which trips lockdep
+ * debugging.
+ */
+static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
+
+static void __iomem *dcscb_base;
+
+static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
+{
+	unsigned int rst_hold, cpumask = (1 << cpu);
+
+	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
+	if (cpu >= 4 || cluster >= 2)
+		return -EINVAL;
+
+	/*
+	 * Since this is called with IRQs enabled, and no arch_spin_lock_irq
+	 * variant exists, we need to disable IRQs manually here.
+	 */
+	local_irq_disable();
+	arch_spin_lock(&dcscb_lock);
+
+	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+	if (rst_hold & (1 << 8)) {
+		/* remove cluster reset and add individual CPU's reset */
+		rst_hold &= ~(1 << 8);
+		rst_hold |= 0xf;
+	}
+	rst_hold &= ~(cpumask | (cpumask << 4));
+	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+
+	arch_spin_unlock(&dcscb_lock);
+	local_irq_enable();
+
+	return 0;
+}
+
+static void dcscb_power_down(void)
+{
+	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
+
+	mpidr = read_cpuid_mpidr();
+	cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
+	cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
+	cpumask = (1 << cpu);
+
+	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
+	BUG_ON(cpu >= 4 || cluster >= 2);
+
+	arch_spin_lock(&dcscb_lock);
+	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+	rst_hold |= cpumask;
+	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
+		rst_hold |= (1 << 8);
+	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	arch_spin_unlock(&dcscb_lock);
+	last_man = (rst_hold & (1 << 8));
+
+	/*
+	 * Now let's clean our L1 cache and shut ourself down.
+	 * If we're the last CPU in this cluster then clean L2 too.
+	 */
+
+	/*
+	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
+	 * a preliminary flush here for those CPUs.  At least, that's
+	 * the theory -- without the extra flush, Linux explodes on
+	 * RTSM (maybe not needed anymore, to be investigated)..
+	 */
+	flush_cache_louis();
+	cpu_proc_fin();
+
+	if (!last_man) {
+		flush_cache_louis();
+	} else {
+		flush_cache_all();
+		outer_flush_all();
+	}
+
+	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
+	set_auxcr(get_auxcr() & ~(1 << 6));
+
+	/* Now we are prepared for power-down, do it: */
+	dsb();
+	wfi();
+
+	/* Not dead@this point?  Let our caller cope. */
+}
+
+static const struct bL_platform_power_ops dcscb_power_ops = {
+	.power_up	= dcscb_power_up,
+	.power_down	= dcscb_power_down,
+};
+
+static int __init dcscb_init(void)
+{
+	int ret;
+
+	dcscb_base = ioremap(DCSCB_PHYS_BASE, 0x1000);
+	if (!dcscb_base)
+		return -ENOMEM;
+
+	ret = bL_platform_power_register(&dcscb_power_ops);
+	if (ret) {
+		iounmap(dcscb_base);
+		return ret;
+	}
+
+	/*
+	 * Future entries into the kernel can now go
+	 * through the b.L entry vectors.
+	 */
+	vexpress_flags_set(virt_to_phys(bL_entry_point));
+
+	return 0;
+}
+
+early_initcall(dcscb_init);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (9 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster Nicolas Pitre
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

It is possible for a CPU to be told to power up before it managed
to power itself down.  Solve this race with a usage count as mandated
by the API definition.

Signed-off-by: nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/dcscb.c | 77 +++++++++++++++++++++++++++++++++---------
 1 file changed, 61 insertions(+), 16 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 220b990edd..cde9d3a8d2 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -45,6 +45,7 @@
 static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 
 static void __iomem *dcscb_base;
+static int dcscb_use_count[4][2];
 
 static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 {
@@ -61,14 +62,27 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 	local_irq_disable();
 	arch_spin_lock(&dcscb_lock);
 
-	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
-	if (rst_hold & (1 << 8)) {
-		/* remove cluster reset and add individual CPU's reset */
-		rst_hold &= ~(1 << 8);
-		rst_hold |= 0xf;
+	dcscb_use_count[cpu][cluster]++;
+	if (dcscb_use_count[cpu][cluster] == 1) {
+		rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+		if (rst_hold & (1 << 8)) {
+			/* remove cluster reset and add individual CPU's reset */
+			rst_hold &= ~(1 << 8);
+			rst_hold |= 0xf;
+		}
+		rst_hold &= ~(cpumask | (cpumask << 4));
+		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	} else if (dcscb_use_count[cpu][cluster] != 2) {
+		/*
+		 * The only possible values are:
+		 * 0 = CPU down
+		 * 1 = CPU (still) up
+		 * 2 = CPU requested to be up before it had a chance
+		 *     to actually make itself down.
+		 * Any other value is a bug.
+		 */
+		BUG();
 	}
-	rst_hold &= ~(cpumask | (cpumask << 4));
-	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
 
 	arch_spin_unlock(&dcscb_lock);
 	local_irq_enable();
@@ -78,7 +92,8 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 
 static void dcscb_power_down(void)
 {
-	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, last_man;
+	unsigned int mpidr, cpu, cluster, rst_hold, cpumask;
+	bool last_man = false, skip_wfi = false;
 
 	mpidr = read_cpuid_mpidr();
 	cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
@@ -89,13 +104,26 @@ static void dcscb_power_down(void)
 	BUG_ON(cpu >= 4 || cluster >= 2);
 
 	arch_spin_lock(&dcscb_lock);
-	rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
-	rst_hold |= cpumask;
-	if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf)
-		rst_hold |= (1 << 8);
-	writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	dcscb_use_count[cpu][cluster]--;
+	if (dcscb_use_count[cpu][cluster] == 0) {
+		rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
+		rst_hold |= cpumask;
+		if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf) {
+			rst_hold |= (1 << 8);
+			last_man = true;
+		}
+		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
+	} else if (dcscb_use_count[cpu][cluster] == 1) {
+		/*
+		 * A power_up request went ahead of us.
+		 * Even if we do not want to shut this CPU down,
+		 * the caller expects a certain state as if the WFI
+		 * was aborted.  So let's continue with cache cleaning.
+		 */
+		skip_wfi = true;
+	} else
+		BUG();
 	arch_spin_unlock(&dcscb_lock);
-	last_man = (rst_hold & (1 << 8));
 
 	/*
 	 * Now let's clean our L1 cache and shut ourself down.
@@ -122,8 +150,10 @@ static void dcscb_power_down(void)
 	set_auxcr(get_auxcr() & ~(1 << 6));
 
 	/* Now we are prepared for power-down, do it: */
-	dsb();
-	wfi();
+	if (!skip_wfi) {
+		dsb();
+		wfi();
+	}
 
 	/* Not dead@this point?  Let our caller cope. */
 }
@@ -133,6 +163,19 @@ static const struct bL_platform_power_ops dcscb_power_ops = {
 	.power_down	= dcscb_power_down,
 };
 
+static void __init dcscb_usage_count_init(void)
+{
+	unsigned int mpidr, cpu, cluster;
+
+	mpidr = read_cpuid_mpidr();
+	cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
+	cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
+
+	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
+	BUG_ON(cpu >= 4 || cluster >= 2);
+	dcscb_use_count[cpu][cluster] = 1;
+}
+
 static int __init dcscb_init(void)
 {
 	int ret;
@@ -141,6 +184,8 @@ static int __init dcscb_init(void)
 	if (!dcscb_base)
 		return -ENOMEM;
 
+	dcscb_usage_count_init();
+
 	ret = bL_platform_power_register(&dcscb_power_ops);
 	if (ret) {
 		iounmap(dcscb_base);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (10 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 13/16] drivers/bus: add ARM CCI support Nicolas Pitre
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

If 4 CPUs are assumed, the A15x1-A7x1 model configuration would never
shut down the initial cluster as the 0xf reset bit mask will never be
observed.  Let's construct this mask based on the provided information
in the DCSCB config register for the number of CPUs per cluster.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/dcscb.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index cde9d3a8d2..e61ccce3c5 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -46,10 +46,12 @@ static arch_spinlock_t dcscb_lock = __ARCH_SPIN_LOCK_UNLOCKED;
 
 static void __iomem *dcscb_base;
 static int dcscb_use_count[4][2];
+static int dcscb_cluster_cpu_mask[2];
 
 static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 {
 	unsigned int rst_hold, cpumask = (1 << cpu);
+	unsigned int cluster_mask = dcscb_cluster_cpu_mask[cluster];
 
 	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
 	if (cpu >= 4 || cluster >= 2)
@@ -68,7 +70,7 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 		if (rst_hold & (1 << 8)) {
 			/* remove cluster reset and add individual CPU's reset */
 			rst_hold &= ~(1 << 8);
-			rst_hold |= 0xf;
+			rst_hold |= cluster_mask;
 		}
 		rst_hold &= ~(cpumask | (cpumask << 4));
 		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
@@ -92,13 +94,14 @@ static int dcscb_power_up(unsigned int cpu, unsigned int cluster)
 
 static void dcscb_power_down(void)
 {
-	unsigned int mpidr, cpu, cluster, rst_hold, cpumask;
+	unsigned int mpidr, cpu, cluster, rst_hold, cpumask, cluster_mask;
 	bool last_man = false, skip_wfi = false;
 
 	mpidr = read_cpuid_mpidr();
 	cpu = MPIDR_AFFINITY_LEVEL(mpidr, 0);
 	cluster = MPIDR_AFFINITY_LEVEL(mpidr, 1);
 	cpumask = (1 << cpu);
+	cluster_mask = dcscb_cluster_cpu_mask[cluster];
 
 	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
 	BUG_ON(cpu >= 4 || cluster >= 2);
@@ -108,7 +111,7 @@ static void dcscb_power_down(void)
 	if (dcscb_use_count[cpu][cluster] == 0) {
 		rst_hold = readl_relaxed(dcscb_base + RST_HOLD0 + cluster * 4);
 		rst_hold |= cpumask;
-		if (((rst_hold | (rst_hold >> 4)) & 0xf) == 0xf) {
+		if (((rst_hold | (rst_hold >> 4)) & cluster_mask) == cluster_mask) {
 			rst_hold |= (1 << 8);
 			last_man = true;
 		}
@@ -178,12 +181,15 @@ static void __init dcscb_usage_count_init(void)
 
 static int __init dcscb_init(void)
 {
+	unsigned int cfg;
 	int ret;
 
 	dcscb_base = ioremap(DCSCB_PHYS_BASE, 0x1000);
 	if (!dcscb_base)
 		return -ENOMEM;
-
+	cfg = readl_relaxed(dcscb_base + DCS_CFG_R);
+	dcscb_cluster_cpu_mask[0] = (1 << (((cfg >> 16) >> (0 << 2)) & 0xf)) - 1;
+	dcscb_cluster_cpu_mask[1] = (1 << (((cfg >> 16) >> (1 << 2)) & 0xf)) - 1;
 	dcscb_usage_count_init();
 
 	ret = bL_platform_power_register(&dcscb_power_ops);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 13/16] drivers/bus: add ARM CCI support
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (11 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24 21:05   ` saeed bishara
  2013-01-24  6:27 ` [PATCH v2 14/16] ARM: CCI: ensure powerdown-time data is flushed from cache Nicolas Pitre
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>

On ARM multi-cluster systems coherency between cores running on
different clusters is managed by the cache-coherent interconnect (CCI).
It allows broadcasting of TLB invalidates and memory barriers and it
guarantees cache coherency at system level.

This patch enables the basic infrastructure required in Linux to
handle and programme the CCI component. The first implementation is
based on a platform device, its relative DT compatible property and
a simple programming interface.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 drivers/bus/Kconfig     |   4 ++
 drivers/bus/Makefile    |   2 +
 drivers/bus/arm-cci.c   | 107 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/arm-cci.h |  30 ++++++++++++++
 4 files changed, 143 insertions(+)
 create mode 100644 drivers/bus/arm-cci.c
 create mode 100644 include/linux/arm-cci.h

diff --git a/drivers/bus/Kconfig b/drivers/bus/Kconfig
index 0f51ed687d..d032f74ff2 100644
--- a/drivers/bus/Kconfig
+++ b/drivers/bus/Kconfig
@@ -19,4 +19,8 @@ config OMAP_INTERCONNECT
 
 	help
 	  Driver to enable OMAP interconnect error handling driver.
+
+config ARM_CCI
+       bool "ARM CCI driver support"
+
 endmenu
diff --git a/drivers/bus/Makefile b/drivers/bus/Makefile
index 45d997c854..55aac809e5 100644
--- a/drivers/bus/Makefile
+++ b/drivers/bus/Makefile
@@ -6,3 +6,5 @@ obj-$(CONFIG_OMAP_OCP2SCP)	+= omap-ocp2scp.o
 
 # Interconnect bus driver for OMAP SoCs.
 obj-$(CONFIG_OMAP_INTERCONNECT)	+= omap_l3_smx.o omap_l3_noc.o
+
+obj-$(CONFIG_ARM_CCI)		+= arm-cci.o
diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
new file mode 100644
index 0000000000..5de3aa3d1f
--- /dev/null
+++ b/drivers/bus/arm-cci.c
@@ -0,0 +1,107 @@
+/*
+ * ARM Cache Coherency Interconnect (CCI400) support
+ *
+ * Copyright (C) 2012-2013 ARM Ltd.
+ * Author: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed "as is" WITHOUT ANY WARRANTY of any
+ * kind, whether express or implied; without even the implied warranty
+ * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/device.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/slab.h>
+#include <linux/arm-cci.h>
+
+#define CCI400_EAG_OFFSET       0x4000
+#define CCI400_KF_OFFSET        0x5000
+
+#define DRIVER_NAME	"CCI"
+struct cci_drvdata {
+	void __iomem *baseaddr;
+	spinlock_t lock;
+};
+
+static struct cci_drvdata *info;
+
+void disable_cci(int cluster)
+{
+	u32 cci_reg = cluster ? CCI400_KF_OFFSET : CCI400_EAG_OFFSET;
+	writel_relaxed(0x0, info->baseaddr	+ cci_reg);
+
+	while (readl_relaxed(info->baseaddr + 0xc) & 0x1)
+			;
+}
+EXPORT_SYMBOL_GPL(disable_cci);
+
+static int cci_driver_probe(struct platform_device *pdev)
+{
+	struct resource *res;
+	int ret = 0;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info) {
+		dev_err(&pdev->dev, "unable to allocate mem\n");
+		return -ENOMEM;
+	}
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res) {
+		dev_err(&pdev->dev, "No memory resource\n");
+		ret = -EINVAL;
+		goto mem_free;
+	}
+
+	if (!request_mem_region(res->start, resource_size(res),
+				dev_name(&pdev->dev))) {
+		dev_err(&pdev->dev, "address 0x%x in use\n", (u32) res->start);
+		ret = -EBUSY;
+		goto mem_free;
+	}
+
+	info->baseaddr = ioremap(res->start, resource_size(res));
+	if (!info->baseaddr) {
+		ret = -ENXIO;
+		goto ioremap_err;
+	}
+
+	platform_set_drvdata(pdev, info);
+
+	pr_info("CCI loaded at %p\n", info->baseaddr);
+	return ret;
+
+ioremap_err:
+	release_region(res->start, resource_size(res));
+mem_free:
+	kfree(info);
+
+	return ret;
+}
+
+static const struct of_device_id arm_cci_matches[] = {
+	{.compatible = "arm,cci"},
+	{},
+};
+
+static struct platform_driver cci_platform_driver = {
+	.driver = {
+		   .name = DRIVER_NAME,
+		   .of_match_table = arm_cci_matches,
+		  },
+	.probe = cci_driver_probe,
+};
+
+static int __init cci_init(void)
+{
+	return platform_driver_register(&cci_platform_driver);
+}
+
+core_initcall(cci_init);
diff --git a/include/linux/arm-cci.h b/include/linux/arm-cci.h
new file mode 100644
index 0000000000..86ae587817
--- /dev/null
+++ b/include/linux/arm-cci.h
@@ -0,0 +1,30 @@
+/*
+ * CCI support
+ *
+ * Copyright (C) 2012-2013 ARM Ltd.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef __LINUX_ARM_CCI_H
+#define __LINUX_ARM_CCI_H
+
+#ifdef CONFIG_ARM_CCI
+extern void disable_cci(int cluster);
+#else
+static inline void disable_cci(int cluster) { }
+#endif
+
+#endif
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 13/16] drivers/bus: add ARM CCI support
  2013-01-24  6:27 ` [PATCH v2 13/16] drivers/bus: add ARM CCI support Nicolas Pitre
@ 2013-01-24 21:05   ` saeed bishara
  0 siblings, 0 replies; 25+ messages in thread
From: saeed bishara @ 2013-01-24 21:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jan 24, 2013 at 8:27 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> +static int cci_driver_probe(struct platform_device *pdev)
> +{
> +       struct resource *res;
> +       int ret = 0;
> +
> +       info = kzalloc(sizeof(*info), GFP_KERNEL);
devm_kzalloc and managed resources allocation can save you few code lines.
> +       if (!info) {
> +               dev_err(&pdev->dev, "unable to allocate mem\n");
> +               return -ENOMEM;
> +       }
> +
> +       res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> +       if (!res) {
> +               dev_err(&pdev->dev, "No memory resource\n");
> +               ret = -EINVAL;
> +               goto mem_free;
> +       }
> +
> +       if (!request_mem_region(res->start, resource_size(res),
> +                               dev_name(&pdev->dev))) {
> +               dev_err(&pdev->dev, "address 0x%x in use\n", (u32) res->start);
I suggest that you use %pR or %pr (more info at Documentation/prink-formats.txt
also, request_mem_region can fail due to other reasons
> +               ret = -EBUSY;
> +               goto mem_free;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 14/16] ARM: CCI: ensure powerdown-time data is flushed from cache
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (12 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 13/16] drivers/bus: add ARM CCI support Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 16/16] ARM: vexpress/dcscb: probe via device tree Nicolas Pitre
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

Non-local variables used by the CCI management function called after
disabling the cache must be flushed out to main memory in advance,
otherwise incoherency of those values may occur if they are sitting
in the cache of some other CPU when cci_disable() executes.

This patch adds the appropriate flushing to the CCI driver to ensure
that the relevant data is available in RAM ahead of time.

Because this creates a dependency on arch-specific cacheflushing
functions, this patch also makes ARM_CCI depend on ARM.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 drivers/bus/Kconfig   |  1 +
 drivers/bus/arm-cci.c | 21 +++++++++++++++++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/bus/Kconfig b/drivers/bus/Kconfig
index d032f74ff2..cd4ac9f001 100644
--- a/drivers/bus/Kconfig
+++ b/drivers/bus/Kconfig
@@ -22,5 +22,6 @@ config OMAP_INTERCONNECT
 
 config ARM_CCI
        bool "ARM CCI driver support"
+	depends on ARM
 
 endmenu
diff --git a/drivers/bus/arm-cci.c b/drivers/bus/arm-cci.c
index 5de3aa3d1f..11689f166d 100644
--- a/drivers/bus/arm-cci.c
+++ b/drivers/bus/arm-cci.c
@@ -21,8 +21,16 @@
 #include <linux/slab.h>
 #include <linux/arm-cci.h>
 
-#define CCI400_EAG_OFFSET       0x4000
-#define CCI400_KF_OFFSET        0x5000
+#include <asm/cacheflush.h>
+#include <asm/memory.h>
+#include <asm/outercache.h>
+
+#include <asm/irq_regs.h>
+#include <asm/pmu.h>
+
+#define CCI400_PMCR                   0x0100
+#define CCI400_EAG_OFFSET             0x4000
+#define CCI400_KF_OFFSET              0x5000
 
 #define DRIVER_NAME	"CCI"
 struct cci_drvdata {
@@ -73,6 +81,15 @@ static int cci_driver_probe(struct platform_device *pdev)
 		goto ioremap_err;
 	}
 
+	/*
+	 * Multi-cluster systems may need this data when non-coherent, during
+	 * cluster power-up/power-down. Make sure it reaches main memory:
+	 */
+	__cpuc_flush_dcache_area(info, sizeof *info);
+	__cpuc_flush_dcache_area(&info, sizeof info);
+	outer_clean_range(virt_to_phys(info), virt_to_phys(info + 1));
+	outer_clean_range(virt_to_phys(&info), virt_to_phys(&info + 1));
+
 	platform_set_drvdata(pdev, info);
 
 	pr_info("CCI loaded at %p\n", info->baseaddr);
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (13 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 14/16] ARM: CCI: ensure powerdown-time data is flushed from cache Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  2013-01-24  6:27 ` [PATCH v2 16/16] ARM: vexpress/dcscb: probe via device tree Nicolas Pitre
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

From: Dave Martin <dave.martin@linaro.org>

Add the required code to properly handle race free platform coherency exit
to the DCSCB power down method.

The power_up_setup callback is used to enable the CCI interface for
the cluster being brought up.  This must be done in assembly before
the kernel environment is entered.

Thanks to Achin Gupta and Nicolas Pitre for their help and
contributions.

Signed-off-by: Dave Martin <dave.martin@linaro.org>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/Kconfig       |  1 +
 arch/arm/mach-vexpress/Makefile      |  2 +-
 arch/arm/mach-vexpress/dcscb.c       | 74 ++++++++++++++++++++++++---------
 arch/arm/mach-vexpress/dcscb_setup.S | 80 ++++++++++++++++++++++++++++++++++++
 4 files changed, 137 insertions(+), 20 deletions(-)
 create mode 100644 arch/arm/mach-vexpress/dcscb_setup.S

diff --git a/arch/arm/mach-vexpress/Kconfig b/arch/arm/mach-vexpress/Kconfig
index 50d6587d0c..2feee6dfa4 100644
--- a/arch/arm/mach-vexpress/Kconfig
+++ b/arch/arm/mach-vexpress/Kconfig
@@ -55,6 +55,7 @@ config ARCH_VEXPRESS_CA9X4
 config ARCH_VEXPRESS_DCSCB
 	bool "Dual Cluster System Control Block (DCSCB) support"
 	depends on BIG_LITTLE
+	select ARM_CCI
 	help
 	  Support for the Dual Cluster System Configuration Block (DCSCB).
 	  This is needed to provide CPU and cluster power management
diff --git a/arch/arm/mach-vexpress/Makefile b/arch/arm/mach-vexpress/Makefile
index 2253644054..f6e90f3272 100644
--- a/arch/arm/mach-vexpress/Makefile
+++ b/arch/arm/mach-vexpress/Makefile
@@ -6,6 +6,6 @@ ccflags-$(CONFIG_ARCH_MULTIPLATFORM) := -I$(srctree)/$(src)/include \
 
 obj-y					:= v2m.o reset.o
 obj-$(CONFIG_ARCH_VEXPRESS_CA9X4)	+= ct-ca9x4.o
-obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o
+obj-$(CONFIG_ARCH_VEXPRESS_DCSCB)	+= dcscb.o	dcscb_setup.o
 obj-$(CONFIG_SMP)			+= platsmp.o
 obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index e61ccce3c5..575c489a4c 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -15,6 +15,7 @@
 #include <linux/spinlock.h>
 #include <linux/errno.h>
 #include <linux/vexpress.h>
+#include <linux/arm-cci.h>
 
 #include <asm/bL_entry.h>
 #include <asm/proc-fns.h>
@@ -106,6 +107,8 @@ static void dcscb_power_down(void)
 	pr_debug("%s: cpu %u cluster %u\n", __func__, cpu, cluster);
 	BUG_ON(cpu >= 4 || cluster >= 2);
 
+	__bL_cpu_going_down(cpu, cluster);
+
 	arch_spin_lock(&dcscb_lock);
 	dcscb_use_count[cpu][cluster]--;
 	if (dcscb_use_count[cpu][cluster] == 0) {
@@ -113,6 +116,7 @@ static void dcscb_power_down(void)
 		rst_hold |= cpumask;
 		if (((rst_hold | (rst_hold >> 4)) & cluster_mask) == cluster_mask) {
 			rst_hold |= (1 << 8);
+			BUG_ON(__bL_cluster_state(cluster) != CLUSTER_UP);
 			last_man = true;
 		}
 		writel(rst_hold, dcscb_base + RST_HOLD0 + cluster * 4);
@@ -126,31 +130,59 @@ static void dcscb_power_down(void)
 		skip_wfi = true;
 	} else
 		BUG();
-	arch_spin_unlock(&dcscb_lock);
 
-	/*
-	 * Now let's clean our L1 cache and shut ourself down.
-	 * If we're the last CPU in this cluster then clean L2 too.
-	 */
-
-	/*
-	 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
-	 * a preliminary flush here for those CPUs.  At least, that's
-	 * the theory -- without the extra flush, Linux explodes on
-	 * RTSM (maybe not needed anymore, to be investigated)..
-	 */
-	flush_cache_louis();
-	cpu_proc_fin();
+	if (last_man && __bL_outbound_enter_critical(cpu, cluster)) {
+		arch_spin_unlock(&dcscb_lock);
 
-	if (!last_man) {
-		flush_cache_louis();
-	} else {
+		/*
+		 * Flush all cache levels for this cluster.
+		 *
+		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
+		 * a preliminary flush here for those CPUs.  At least, that's
+		 * the theory -- without the extra flush, Linux explodes on
+		 * RTSM (maybe not needed anymore, to be investigated).
+		 */
 		flush_cache_all();
+		cpu_proc_fin(); /* disable allocation into internal caches*/
+		flush_cache_all();
+
+		/*
+		 * This is a harmless no-op.  On platforms with a real
+		 * outer cache this might either be needed or not,
+		 * depending on where the outer cache sits.
+		 */
 		outer_flush_all();
+
+		/* Disable local coherency by clearing the ACTLR "SMP" bit: */
+		set_auxcr(get_auxcr() & ~(1 << 6));
+
+		/*
+		 * Disable cluster-level coherency by masking
+		 * incoming snoops and DVM messages:
+		 */
+		disable_cci(cluster);
+
+		__bL_outbound_leave_critical(cluster, CLUSTER_DOWN);
+	} else {
+		arch_spin_unlock(&dcscb_lock);
+
+		/*
+		 * Flush the local CPU cache.
+		 *
+		 * A15/A7 can hit in the cache with SCTLR.C=0, so we don't need
+		 * a preliminary flush here for those CPUs.  At least, that's
+		 * the theory -- without the extra flush, Linux explodes on
+		 * RTSM (maybe not needed anymore, to be investigated).
+		 */
+		flush_cache_louis();
+		cpu_proc_fin(); /* disable allocation into internal caches*/
+		flush_cache_louis();
+
+		/* Disable local coherency by clearing the ACTLR "SMP" bit: */
+		set_auxcr(get_auxcr() & ~(1 << 6));
 	}
 
-	/* Disable local coherency by clearing the ACTLR "SMP" bit: */
-	set_auxcr(get_auxcr() & ~(1 << 6));
+	__bL_cpu_down(cpu, cluster);
 
 	/* Now we are prepared for power-down, do it: */
 	if (!skip_wfi) {
@@ -179,6 +211,8 @@ static void __init dcscb_usage_count_init(void)
 	dcscb_use_count[cpu][cluster] = 1;
 }
 
+extern void dcscb_power_up_setup(unsigned int affinity_level);
+
 static int __init dcscb_init(void)
 {
 	unsigned int cfg;
@@ -193,6 +227,8 @@ static int __init dcscb_init(void)
 	dcscb_usage_count_init();
 
 	ret = bL_platform_power_register(&dcscb_power_ops);
+	if (!ret)
+		ret = bL_cluster_sync_init(dcscb_power_up_setup);
 	if (ret) {
 		iounmap(dcscb_base);
 		return ret;
diff --git a/arch/arm/mach-vexpress/dcscb_setup.S b/arch/arm/mach-vexpress/dcscb_setup.S
new file mode 100644
index 0000000000..d61a2f3552
--- /dev/null
+++ b/arch/arm/mach-vexpress/dcscb_setup.S
@@ -0,0 +1,80 @@
+/*
+ * arch/arm/include/asm/dcscb_setup.S
+ *
+ * Created by:  Dave Martin, 2012-06-22
+ * Copyright:   (C) 2012-2013  Linaro Limited
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+
+#include <linux/linkage.h>
+#include <asm/bL_entry.h>
+
+
+#define SLAVE_SNOOPCTL_OFFSET	0
+#define SNOOPCTL_SNOOP_ENABLE	(1 << 0)
+#define SNOOPCTL_DVM_ENABLE	(1 << 1)
+
+#define CCI_STATUS_OFFSET	0xc
+#define STATUS_CHANGE_PENDING	(1 << 0)
+
+#define CCI_SLAVE_OFFSET(n)	(0x1000 + 0x1000 * (n))
+
+#define RTSM_CCI_PHYS_BASE	0x2c090000
+#define RTSM_CCI_SLAVE_A15	3
+#define RTSM_CCI_SLAVE_A7	4
+
+#define RTSM_CCI_A15_OFFSET	CCI_SLAVE_OFFSET(RTSM_CCI_SLAVE_A15)
+#define RTSM_CCI_A7_OFFSET	CCI_SLAVE_OFFSET(RTSM_CCI_SLAVE_A7)
+
+
+ENTRY(dcscb_power_up_setup)
+
+	cmp	r0, #0			@ check affinity level
+	beq	2f
+
+/*
+ * Enable cluster-level coherency, in preparation for turning on the MMU.
+ * The ACTLR SMP bit does not need to be set here, because cpu_resume()
+ * already restores that.
+ */
+
+	mrc	p15, 0, r0, c0, c0, 5	@ MPIDR
+	ubfx	r0, r0, #8, #4		@ cluster
+
+	@ A15/A7 may not require explicit L2 invalidation on reset, dependent
+	@ on hardware integration desicions.
+	@ For now, this code assumes that L2 is either already invalidated, or
+	@ invalidation is not required.
+
+	ldr	r3, =RTSM_CCI_PHYS_BASE + RTSM_CCI_A15_OFFSET
+	cmp	r0, #0		@ A15 cluster?
+	addne	r3, r3, #RTSM_CCI_A7_OFFSET - RTSM_CCI_A15_OFFSET
+
+	@ r3 now points to the correct CCI slave register block
+
+	ldr	r0, [r3, #SLAVE_SNOOPCTL_OFFSET]
+	orr	r0, r0, #SNOOPCTL_SNOOP_ENABLE | SNOOPCTL_DVM_ENABLE
+	str	r0, [r3, #SLAVE_SNOOPCTL_OFFSET]	@ enable CCI snoops
+
+	@ Wait for snoop control change to complete:
+
+	ldr	r3, =RTSM_CCI_PHYS_BASE
+
+1:	ldr	r0, [r3, #CCI_STATUS_OFFSET]
+	tst	r0, #STATUS_CHANGE_PENDING
+	bne	1b
+
+	dsb		@ Synchronise side-effects of enabling CCI
+
+	bx	lr
+
+2:	@ Implementation-specific local CPU setup operations should go here,
+	@ if any.  In this case, there is nothing to do.
+
+	bx	lr
+
+ENDPROC(dcscb_power_up_setup)
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 16/16] ARM: vexpress/dcscb: probe via device tree
  2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
                   ` (14 preceding siblings ...)
  2013-01-24  6:27 ` [PATCH v2 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
@ 2013-01-24  6:27 ` Nicolas Pitre
  15 siblings, 0 replies; 25+ messages in thread
From: Nicolas Pitre @ 2013-01-24  6:27 UTC (permalink / raw)
  To: linux-arm-kernel

This allows for the DCSCB support to be compiled in and selected
at run time.

Signed-off-by: Nicolas Pitre <nico@linaro.org>
---
 arch/arm/mach-vexpress/dcscb.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-vexpress/dcscb.c b/arch/arm/mach-vexpress/dcscb.c
index 575c489a4c..17e410e4fe 100644
--- a/arch/arm/mach-vexpress/dcscb.c
+++ b/arch/arm/mach-vexpress/dcscb.c
@@ -14,6 +14,7 @@
 #include <linux/io.h>
 #include <linux/spinlock.h>
 #include <linux/errno.h>
+#include <linux/of_address.h>
 #include <linux/vexpress.h>
 #include <linux/arm-cci.h>
 
@@ -24,8 +25,6 @@
 #include <asm/cp15.h>
 
 
-#define DCSCB_PHYS_BASE	0x60000000
-
 #define RST_HOLD0	0x0
 #define RST_HOLD1	0x4
 #define SYS_SWRESET	0x8
@@ -215,12 +214,16 @@ extern void dcscb_power_up_setup(unsigned int affinity_level);
 
 static int __init dcscb_init(void)
 {
+	struct device_node *node;
 	unsigned int cfg;
 	int ret;
 
-	dcscb_base = ioremap(DCSCB_PHYS_BASE, 0x1000);
+	node = of_find_compatible_node(NULL, NULL, "arm,dcscb");
+	if (!node)
+		return -ENODEV;
+	dcscb_base= of_iomap(node, 0);
 	if (!dcscb_base)
-		return -ENOMEM;
+		return -EINVAL;
 	cfg = readl_relaxed(dcscb_base + DCS_CFG_R);
 	dcscb_cluster_cpu_mask[0] = (1 << (((cfg >> 16) >> (0 << 2)) & 0xf)) - 1;
 	dcscb_cluster_cpu_mask[1] = (1 << (((cfg >> 16) >> (1 << 2)) & 0xf)) - 1;
-- 
1.8.0

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-01-28 17:58 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-24  6:27 [PATCH v2 00/16] low-level CPU and cluster power management Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 01/16] ARM: introduce common set_auxcr/get_auxcr functions Nicolas Pitre
2013-01-28 14:39   ` Will Deacon
2013-01-28 15:23     ` Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 02/16] ARM: b.L: secondary kernel entry code Nicolas Pitre
2013-01-28 14:46   ` Will Deacon
2013-01-28 15:07     ` Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 03/16] ARM: b.L: introduce the CPU/cluster power API Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 04/16] ARM: b.L: introduce helpers for platform coherency exit/setup Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 05/16] ARM: b.L: Add baremetal voting mutexes Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 06/16] ARM: bL_head.S: vlock-based first man election Nicolas Pitre
2013-01-28 17:18   ` Will Deacon
2013-01-28 17:58     ` Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 07/16] ARM: b.L: generic SMP secondary bringup and hotplug support Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 08/16] ARM: bL_platsmp.c: close the kernel entry gate before hot-unplugging a CPU Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 09/16] ARM: vexpress: Select the correct SMP operations at run-time Nicolas Pitre
2013-01-24 11:56   ` Jon Medhurst (Tixy)
2013-01-24  6:27 ` [PATCH v2 10/16] ARM: vexpress: introduce DCSCB support Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 11/16] ARM: vexpress/dcscb: add CPU use counts to the power up/down API implementation Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 12/16] ARM: vexpress/dcscb: do not hardcode number of CPUs per cluster Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 13/16] drivers/bus: add ARM CCI support Nicolas Pitre
2013-01-24 21:05   ` saeed bishara
2013-01-24  6:27 ` [PATCH v2 14/16] ARM: CCI: ensure powerdown-time data is flushed from cache Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 15/16] ARM: vexpress/dcscb: handle platform coherency exit/setup and CCI Nicolas Pitre
2013-01-24  6:27 ` [PATCH v2 16/16] ARM: vexpress/dcscb: probe via device tree Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).