LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 17/18] powerpc/boot: add support for 64bit little endian wrapper
From: Cédric Le Goater @ 2014-02-07 15:59 UTC (permalink / raw)
  To: benh; +Cc: Cédric Le Goater, linuxppc-dev
In-Reply-To: <1391788771-16405-1-git-send-email-clg@fr.ibm.com>

Compilation is changed for little endian and entry points between the
wrapper and the kernel are modified to fix endian order with the famous
FIXUP_ENDIAN trampoline. This is PowerPC magic.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
---
 arch/powerpc/boot/Makefile       |    7 +++++--
 arch/powerpc/boot/crt0.S         |    1 +
 arch/powerpc/boot/ppc_asm.h      |   12 ++++++++++++
 arch/powerpc/boot/pseries-head.S |    3 +++
 arch/powerpc/boot/wrapper        |    1 +
 5 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index bedbd46273e4..de71d2a0af3b 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -22,11 +22,14 @@ all: $(obj)/zImage
 BOOTCFLAGS    := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
 		 -fno-strict-aliasing -Os -msoft-float -pipe \
 		 -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
-		 -isystem $(shell $(CROSS32CC) -print-file-name=include) \
-		 -mbig-endian
+		 -isystem $(shell $(CROSS32CC) -print-file-name=include)
 ifdef CONFIG_PPC64
 BOOTCFLAGS	+= -m64
 endif
+ifdef CONFIG_CPU_BIG_ENDIAN
+BOOTCFLAGS	+= -mbig-endian
+endif
+
 BOOTAFLAGS	:= -D__ASSEMBLY__ $(BOOTCFLAGS) -traditional -nostdinc
 
 ifdef CONFIG_DEBUG_INFO
diff --git a/arch/powerpc/boot/crt0.S b/arch/powerpc/boot/crt0.S
index 689290561e69..14de4f8778a7 100644
--- a/arch/powerpc/boot/crt0.S
+++ b/arch/powerpc/boot/crt0.S
@@ -275,6 +275,7 @@ prom:
 	rfid
 
 1:	/* Return from OF */
+	FIXUP_ENDIAN
 
 	/* Restore registers and return. */
 	rldicl  r1,r1,0,32
diff --git a/arch/powerpc/boot/ppc_asm.h b/arch/powerpc/boot/ppc_asm.h
index eb0e98be69e0..35ea60c1f070 100644
--- a/arch/powerpc/boot/ppc_asm.h
+++ b/arch/powerpc/boot/ppc_asm.h
@@ -62,4 +62,16 @@
 #define SPRN_TBRL	268
 #define SPRN_TBRU	269
 
+#define FIXUP_ENDIAN						   \
+	tdi   0, 0, 0x48; /* Reverse endian of b . + 8		*/ \
+	b     $+36;	  /* Skip trampoline if endian is good	*/ \
+	.long 0x05009f42; /* bcl 20,31,$+4			*/ \
+	.long 0xa602487d; /* mflr r10				*/ \
+	.long 0x1c004a39; /* addi r10,r10,28			*/ \
+	.long 0xa600607d; /* mfmsr r11				*/ \
+	.long 0x01006b69; /* xori r11,r11,1			*/ \
+	.long 0xa6035a7d; /* mtsrr0 r10				*/ \
+	.long 0xa6037b7d; /* mtsrr1 r11				*/ \
+	.long 0x2400004c  /* rfid				*/
+
 #endif /* _PPC64_PPC_ASM_H */
diff --git a/arch/powerpc/boot/pseries-head.S b/arch/powerpc/boot/pseries-head.S
index 655c3d2c321b..6ef6e02e80f9 100644
--- a/arch/powerpc/boot/pseries-head.S
+++ b/arch/powerpc/boot/pseries-head.S
@@ -1,5 +1,8 @@
+#include "ppc_asm.h"
+
 	.text
 
 	.globl _zimage_start
 _zimage_start:
+	FIXUP_ENDIAN
 	b _zimage_start_lib
diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
index 6eca1b4ecfa4..a1654b510ff3 100755
--- a/arch/powerpc/boot/wrapper
+++ b/arch/powerpc/boot/wrapper
@@ -139,6 +139,7 @@ fi
 
 elfformat="`${CROSS}objdump -p "$kernel" | grep 'file format' | awk '{print $4}'`"
 case "$elfformat" in
+    elf64-powerpcle)	format=elf64lppc	;;
     elf64-powerpc)	format=elf64ppc	;;
     elf32-powerpc)	format=elf32ppc	;;
 esac
-- 
1.7.10.4

^ permalink raw reply related

* [RFC PATCH 15/18] powerpc/boot: add a global entry point for pseries
From: Cédric Le Goater @ 2014-02-07 15:59 UTC (permalink / raw)
  To: benh; +Cc: Cédric Le Goater, linuxppc-dev
In-Reply-To: <1391788771-16405-1-git-send-email-clg@fr.ibm.com>

When entering the boot wrapper in little endian, we will need to fix
the endian order using a fixup trampoline like in the kernel. This
patch overrides the _zimage_start entry point for this purpose.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
---
 arch/powerpc/boot/Makefile       |    2 ++
 arch/powerpc/boot/pseries-head.S |    5 +++++
 arch/powerpc/boot/wrapper        |    2 +-
 3 files changed, 8 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/boot/pseries-head.S

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index ca7f08cc4afd..4c4ec163a7a1 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -99,6 +99,8 @@ src-plat-$(CONFIG_EMBEDDED6xx) += cuboot-pq2.c cuboot-mpc7448hpc2.c \
 src-plat-$(CONFIG_AMIGAONE) += cuboot-amigaone.c
 src-plat-$(CONFIG_PPC_PS3) += ps3-head.S ps3-hvcall.S ps3.c
 src-plat-$(CONFIG_EPAPR_BOOT) += epapr.c epapr-wrapper.c
+src-plat-$(CONFIG_PPC_PSERIES) += pseries-head.S
+
 
 src-wlib := $(sort $(src-wlib-y))
 src-plat := $(sort $(src-plat-y))
diff --git a/arch/powerpc/boot/pseries-head.S b/arch/powerpc/boot/pseries-head.S
new file mode 100644
index 000000000000..655c3d2c321b
--- /dev/null
+++ b/arch/powerpc/boot/pseries-head.S
@@ -0,0 +1,5 @@
+	.text
+
+	.globl _zimage_start
+_zimage_start:
+	b _zimage_start_lib
diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
index 2e1af74a64be..cd0101f1d457 100755
--- a/arch/powerpc/boot/wrapper
+++ b/arch/powerpc/boot/wrapper
@@ -152,7 +152,7 @@ of)
     make_space=n
     ;;
 pseries)
-    platformo="$object/of.o $object/epapr.o"
+    platformo="$object/pseries-head.o $object/of.o $object/epapr.o"
     link_address='0x4000000'
     make_space=n
     ;;
-- 
1.7.10.4

^ permalink raw reply related

* [RFC PATCH 18/18] powerpc/boot: add PPC64_BOOT_WRAPPER config option
From: Cédric Le Goater @ 2014-02-07 15:59 UTC (permalink / raw)
  To: benh; +Cc: Cédric Le Goater, linuxppc-dev
In-Reply-To: <1391788771-16405-1-git-send-email-clg@fr.ibm.com>

The previous patch broke compatibility for 64bit big endian kernels.

This patch adds a config option to compile the boot wrapper in 64bit
only when CPU_LITTLE_ENDIAN is selected. It restores 32bit compilation
and linking for the big endian kernel.

Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
---
 arch/powerpc/boot/Makefile             |    2 +-
 arch/powerpc/boot/wrapper              |    2 +-
 arch/powerpc/boot/zImage.lds.S         |    8 ++++----
 arch/powerpc/platforms/Kconfig.cputype |    5 +++++
 4 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index de71d2a0af3b..7870620033c5 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -23,7 +23,7 @@ BOOTCFLAGS    := -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
 		 -fno-strict-aliasing -Os -msoft-float -pipe \
 		 -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
 		 -isystem $(shell $(CROSS32CC) -print-file-name=include)
-ifdef CONFIG_PPC64
+ifdef CONFIG_PPC64_BOOT_WRAPPER
 BOOTCFLAGS	+= -m64
 endif
 ifdef CONFIG_CPU_BIG_ENDIAN
diff --git a/arch/powerpc/boot/wrapper b/arch/powerpc/boot/wrapper
index a1654b510ff3..299f327fb510 100755
--- a/arch/powerpc/boot/wrapper
+++ b/arch/powerpc/boot/wrapper
@@ -140,7 +140,7 @@ fi
 elfformat="`${CROSS}objdump -p "$kernel" | grep 'file format' | awk '{print $4}'`"
 case "$elfformat" in
     elf64-powerpcle)	format=elf64lppc	;;
-    elf64-powerpc)	format=elf64ppc	;;
+    elf64-powerpc)	format=elf32ppc	;;
     elf32-powerpc)	format=elf32ppc	;;
 esac
 
diff --git a/arch/powerpc/boot/zImage.lds.S b/arch/powerpc/boot/zImage.lds.S
index afecab0aff5c..861e72109df2 100644
--- a/arch/powerpc/boot/zImage.lds.S
+++ b/arch/powerpc/boot/zImage.lds.S
@@ -1,6 +1,6 @@
 #include <asm-generic/vmlinux.lds.h>
 
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC64_BOOT_WRAPPER
 OUTPUT_ARCH(powerpc:common64)
 #else
 OUTPUT_ARCH(powerpc:common)
@@ -22,7 +22,7 @@ SECTIONS
     *(.rodata*)
     *(.data*)
     *(.sdata*)
-#ifdef CONFIG_PPC32
+#ifndef CONFIG_PPC64_BOOT_WRAPPER
     *(.got2)
 #endif
   }
@@ -37,7 +37,7 @@ SECTIONS
   .interp : { *(.interp) }
   .rela.dyn :
   {
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC64_BOOT_WRAPPER
     __rela_dyn_start = .;
 #endif
     *(.rela*)
@@ -67,7 +67,7 @@ SECTIONS
     _initrd_end =  .;
   }
 
-#ifdef CONFIG_PPC64
+#ifdef CONFIG_PPC64_BOOT_WRAPPER
   .got :
   {
     __toc_start = .;
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index bca2465a9c34..9476aacf59f6 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -420,6 +420,7 @@ config CPU_BIG_ENDIAN
 
 config CPU_LITTLE_ENDIAN
 	bool "Build little endian kernel"
+	select PPC64_BOOT_WRAPPER
 	help
 	  Build a little endian kernel.
 
@@ -428,3 +429,7 @@ config CPU_LITTLE_ENDIAN
 	  little endian powerpc.
 
 endchoice
+
+config PPC64_BOOT_WRAPPER
+	def_bool n
+	depends on CPU_LITTLE_ENDIAN
-- 
1.7.10.4

^ permalink raw reply related

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 16:10 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Tom Musta, linux-kernel, Torsten Duwe, Anton Blanchard,
	Scott Wood, Paul Mackerras, Paul E. McKenney, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <87C29DBB-41E7-4B6C-9089-3C7756FBAE07@kernel.crashing.org>

On Fri, Feb 07, 2014 at 09:51:16AM -0600, Kumar Gala wrote:
> 
> On Feb 7, 2014, at 3:02 AM, Torsten Duwe <duwe@lst.de> wrote:
> 
> > On Thu, Feb 06, 2014 at 02:19:52PM -0600, Scott Wood wrote:
> >> On Thu, 2014-02-06 at 18:37 +0100, Torsten Duwe wrote:
> >>> On Thu, Feb 06, 2014 at 05:38:37PM +0100, Peter Zijlstra wrote:
> >> 
> >>>> Can you pair lwarx with sthcx ? I couldn't immediately find the answer
> >>>> in the PowerISA doc. If so I think you can do better by being able to
> >>>> atomically load both tickets but only storing the head without affecting
> >>>> the tail.
> > 
> > Can I simply write the half word, without a reservation, or will the HW caches
> > mess up the other half? Will it ruin the cache coherency on some (sub)architectures?
> 
> The coherency should be fine, I just can’t remember if you’ll lose the reservation by doing this.

It should; I suppose; seeing how you 'destroy' the state it got from the
load.

> >> Plus, sthcx doesn't exist on all PPC chips.
> > 
> > Which ones are lacking it? Do all have at least a simple 16-bit store?
> 
> Everything implements a simple 16-bit store, just not everything implements the store conditional of 16-bit data.

Ok, so then the last version I posted should work on those machines.

^ permalink raw reply

* [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-07 16:58 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Anton Blanchard,
	Paul E. McKenney, Peter Zijlstra, Scott Wood, Tom Musta,
	Ingo Molnar
  Cc: linuxppc-dev, linux-kernel

Ticket locks for ppc, version 2. Changes since v1:
* The atomically exchanged entity is always 32 bits.
* asm inline string variations thus removed.
* Carry the additional holder hint only #if defined(CONFIG_PPC_SPLPAR)

Signed-off-by: Torsten Duwe <duwe@suse.de>
--
 arch/powerpc/include/asm/spinlock_types.h |   29 +++++
 arch/powerpc/include/asm/spinlock.h       |  146 +++++++++++++++++++++++-------
 arch/powerpc/lib/locks.c                  |    6 -
 3 files changed, 144 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock_types.h b/arch/powerpc/include/asm/spinlock_types.h
index 2351adc..fce1383 100644
--- a/arch/powerpc/include/asm/spinlock_types.h
+++ b/arch/powerpc/include/asm/spinlock_types.h
@@ -5,11 +5,34 @@
 # error "please don't include this file directly"
 #endif
 
+typedef u16 __ticket_t;
+typedef u32 __ticketpair_t;
+
+#define TICKET_LOCK_INC	((__ticket_t)1)
+
+#define TICKET_SHIFT	(sizeof(__ticket_t) * 8)
+
 typedef struct {
-	volatile unsigned int slock;
-} arch_spinlock_t;
+	union {
+		__ticketpair_t head_tail;
+		struct __raw_tickets {
+#ifdef __BIG_ENDIAN__		/* The "tail" part should be in the MSBs */
+			__ticket_t tail, head;
+#else
+			__ticket_t head, tail;
+#endif
+		} tickets;
+	};
+#if defined(CONFIG_PPC_SPLPAR)
+	u32 holder;
+#endif
+} arch_spinlock_t __aligned(4);
 
-#define __ARCH_SPIN_LOCK_UNLOCKED	{ 0 }
+#if defined(CONFIG_PPC_SPLPAR)
+#define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 }, 0 }
+#else
+#define __ARCH_SPIN_LOCK_UNLOCKED	{ { 0 } }
+#endif
 
 typedef struct {
 	volatile signed int lock;
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index 5f54a74..a931514 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -9,8 +9,7 @@
  * Copyright (C) 2001 Anton Blanchard <anton@au.ibm.com>, IBM
  * Copyright (C) 2002 Dave Engebretsen <engebret@us.ibm.com>, IBM
  *	Rework to support virtual processors
- *
- * Type of int is used as a full 64b word is not necessary.
+ * Copyright (C) 2014 Torsten Duwe <duwe@suse.de>, ticket lock port
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public License
@@ -28,7 +27,20 @@
 #include <asm/synch.h>
 #include <asm/ppc-opcode.h>
 
-#define arch_spin_is_locked(x)		((x)->slock != 0)
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
+{
+	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
+
+	return tmp.tail != tmp.head;
+}
+
+static inline int arch_spin_is_contended(arch_spinlock_t *lock)
+{
+	struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
+
+	return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
+}
+#define arch_spin_is_contended	arch_spin_is_contended
 
 #ifdef CONFIG_PPC64
 /* use 0x800000yy when locked, where yy == CPU number */
@@ -55,33 +67,59 @@
 #endif
 
 /*
- * This returns the old value in the lock, so we succeeded
- * in getting the lock if the return value is 0.
+ * Our own cmpxchg, operating on spinlock_t's.  Returns 0 iff value
+ * read at lock was equal to "old" AND the cmpxchg succeeded
+ * uninterruptedly.
  */
-static inline unsigned long __arch_spin_trylock(arch_spinlock_t *lock)
+static __always_inline int __arch_spin_cmpxchg_eq(arch_spinlock_t *lock,
+						  __ticketpair_t old,
+						  __ticketpair_t new)
 {
-	unsigned long tmp, token;
+	register int retval = 1;
+	register __ticketpair_t tmp;
 
-	token = LOCK_TOKEN;
-	__asm__ __volatile__(
-"1:	" PPC_LWARX(%0,0,%2,1) "\n\
-	cmpwi		0,%0,0\n\
-	bne-		2f\n\
-	stwcx.		%1,0,%2\n\
-	bne-		1b\n"
-	PPC_ACQUIRE_BARRIER
+	__asm__ __volatile__ (
+"	li %0,1\n"		/* default to "fail" */
+	PPC_RELEASE_BARRIER
+"1:	lwarx	%2,0,%5		# __arch_spin_cmpxchg_eq\n"
+"	cmp	0,0,%3,%2\n"
+"	bne-	2f\n"
+	PPC405_ERR77(0, "%5")
+"	stwcx.	%4,0,%5\n"
+"	bne-	1b\n"
+"	isync\n"
+"	li %0,0\n"
 "2:"
-	: "=&r" (tmp)
-	: "r" (token), "r" (&lock->slock)
-	: "cr0", "memory");
+	: "=&r" (retval), "+m" (*lock)
+	: "r" (tmp), "r" (old), "r" (new), "r" (lock)
+	: "cc", "memory");
 
-	return tmp;
+	return retval;
+}
+
+static __always_inline int __arch_spin_trylock(arch_spinlock_t *lock)
+{
+	arch_spinlock_t old, new;
+
+	old.tickets = ACCESS_ONCE(lock->tickets);
+	if (old.tickets.head != old.tickets.tail)
+		return 0;
+
+	new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+
+	if (__arch_spin_cmpxchg_eq(lock, old.head_tail, new.head_tail))
+		return 0;
+
+#if defined(CONFIG_PPC_SPLPAR)
+	lock->holder = LOCK_TOKEN;
+#endif
+	return 1;
 }
 
 static inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
 	CLEAR_IO_SYNC;
-	return __arch_spin_trylock(lock) == 0;
+	return  __arch_spin_trylock(lock);
 }
 
 /*
@@ -93,9 +131,8 @@ static inline int arch_spin_trylock(arch_spinlock_t *lock)
  * rest of our timeslice to the lock holder.
  *
  * So that we can tell which virtual processor is holding a lock,
- * we put 0x80000000 | smp_processor_id() in the lock when it is
- * held.  Conveniently, we have a word in the paca that holds this
- * value.
+ * we put 0x80000000 | smp_processor_id() into lock->holder.
+ * Conveniently, we have a word in the paca that holds this value.
  */
 
 #if defined(CONFIG_PPC_SPLPAR)
@@ -109,19 +146,55 @@ extern void __rw_yield(arch_rwlock_t *lock);
 #define SHARED_PROCESSOR	0
 #endif
 
-static inline void arch_spin_lock(arch_spinlock_t *lock)
+/*
+ * Ticket locks are conceptually two parts, one indicating the current head of
+ * the queue, and the other indicating the current tail. The lock is acquired
+ * by atomically noting the tail and incrementing it by one (thus adding
+ * ourself to the queue and noting our position), then waiting until the head
+ * becomes equal to the the initial value of the tail.
+ *
+ * We use an asm covering *both* parts of the lock, to increment the tail and
+ * also load the position of the head, which takes care of memory ordering
+ * issues and should be optimal for the uncontended case. Note the tail must be
+ * in the high part, because a wide add increment of the low part would carry
+ * up and contaminate the high part.
+ */
+static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
 {
+	register struct __raw_tickets old, tmp,
+		inc = { .tail = TICKET_LOCK_INC };
+
 	CLEAR_IO_SYNC;
-	while (1) {
-		if (likely(__arch_spin_trylock(lock) == 0))
-			break;
+	__asm__ __volatile__(
+"1:	lwarx	%0,0,%4		# arch_spin_lock\n"
+"	add	%1,%3,%0\n"
+	PPC405_ERR77(0, "%4")
+"	stwcx.	%1,0,%4\n"
+"	bne-	1b"
+	: "=&r" (old), "=&r" (tmp), "+m" (lock->tickets)
+	: "r" (inc), "r" (&lock->tickets)
+	: "cc");
+
+	if (likely(old.head == old.tail))
+		goto out;
+
+	for (;;) {
+		unsigned count = 100;
+
 		do {
+			if (ACCESS_ONCE(lock->tickets.head) == old.tail)
+				goto out;
 			HMT_low();
 			if (SHARED_PROCESSOR)
 				__spin_yield(lock);
-		} while (unlikely(lock->slock != 0));
+		} while (--count);
 		HMT_medium();
 	}
+out:
+#if defined(CONFIG_PPC_SPLPAR)
+	lock->holder = LOCK_TOKEN;
+#endif
+	barrier();	/* make sure nothing creeps before the lock is taken */
 }
 
 static inline
@@ -131,7 +204,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 
 	CLEAR_IO_SYNC;
 	while (1) {
-		if (likely(__arch_spin_trylock(lock) == 0))
+		if (likely(__arch_spin_trylock(lock)))
 			break;
 		local_save_flags(flags_dis);
 		local_irq_restore(flags);
@@ -139,7 +212,7 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 			HMT_low();
 			if (SHARED_PROCESSOR)
 				__spin_yield(lock);
-		} while (unlikely(lock->slock != 0));
+		} while (arch_spin_is_locked(lock));
 		HMT_medium();
 		local_irq_restore(flags_dis);
 	}
@@ -147,10 +220,21 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
 
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
+	arch_spinlock_t old, new;
+
+#if defined(CONFIG_PPC_SPLPAR)
+	lock->holder = 0;
+#endif
+	do {
+		old.tickets = ACCESS_ONCE(lock->tickets);
+		new.tickets.head = old.tickets.head + TICKET_LOCK_INC;
+		new.tickets.tail = old.tickets.tail;
+	} while (unlikely(__arch_spin_cmpxchg_eq(lock,
+						 old.head_tail,
+						 new.head_tail)));
 	SYNC_IO;
 	__asm__ __volatile__("# arch_spin_unlock\n\t"
 				PPC_RELEASE_BARRIER: : :"memory");
-	lock->slock = 0;
 }
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/lib/locks.c b/arch/powerpc/lib/locks.c
index 0c9c8d7..4a57e32 100644
--- a/arch/powerpc/lib/locks.c
+++ b/arch/powerpc/lib/locks.c
@@ -27,7 +27,7 @@ void __spin_yield(arch_spinlock_t *lock)
 {
 	unsigned int lock_value, holder_cpu, yield_count;
 
-	lock_value = lock->slock;
+	lock_value = lock->holder;
 	if (lock_value == 0)
 		return;
 	holder_cpu = lock_value & 0xffff;
@@ -36,7 +36,7 @@ void __spin_yield(arch_spinlock_t *lock)
 	if ((yield_count & 1) == 0)
 		return;		/* virtual cpu is currently running */
 	rmb();
-	if (lock->slock != lock_value)
+	if (lock->holder != lock_value)
 		return;		/* something has changed */
 	plpar_hcall_norets(H_CONFER,
 		get_hard_smp_processor_id(holder_cpu), yield_count);
@@ -70,7 +70,7 @@ void __rw_yield(arch_rwlock_t *rw)
 
 void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
-	while (lock->slock) {
+	while (arch_spin_is_locked(lock)) {
 		HMT_low();
 		if (SHARED_PROCESSOR)
 			__spin_yield(lock);

^ permalink raw reply related

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Torsten Duwe @ 2014-02-07 17:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207151847.GB3104@twins.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 04:18:47PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 01:28:37PM +0100, Peter Zijlstra wrote:
> > Anyway, you can do a version with lwarx/stwcx if you're looking get rid
> > of lharx.
> 
> the below seems to compile into relatively ok asm. It can be done better
> if you write the entire thing by hand though.

[...]

> 
> static inline unsigned int xadd(unsigned int *v, unsigned int i)
> {
> 	int t, ret;
> 	
> 	__asm__ __volatile__ (
> "1:	lwarx	%0, 0, %4\n"
> "	mr	%1, %0\n"
> "	add	%0, %3, %0\n"
> "	stwcx.	%0, %0, %4\n"
> "	bne-	1b\n"
> 	: "=&r" (t), "=&r" (ret), "+m" (*v)
> 	: "r" (i), "r" (v)
> 	: "cc");
> 
> 	return ret;
> }
> 
I don't like this xadd thing -- it's so x86 ;)
x86 has its LOCK prefix, ppc has ll/sc.
That should be reflected somehow IMHO.

Maybe if xadd became mandatory for some kernel library.

> 
> void ticket_unlock(tickets_t *lock)
> {
> 	ticket_t tail = lock->tail + 1;
> 
> 	/*
> 	 * The store is save against the xadd for it will make the ll/sc fail
> 	 * and try again. Aside from that PowerISA guarantees single-copy
> 	 * atomicy for half-word writes.
> 	 *
> 	 * And since only the lock owner will ever write the tail, we're good.
> 	 */
> 	smp_store_release(&lock->tail, tail);
> }

Yeah, let's try that on top of v2 (just posted).
First, I want to see v2 work as nicely as v1 --
compiling a debug kernel takes a while...

	Torsten

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Peter Zijlstra @ 2014-02-07 17:12 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207165801.GC2107@lst.de>

On Fri, Feb 07, 2014 at 05:58:01PM +0100, Torsten Duwe wrote:
> +static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
>  {
> +	register struct __raw_tickets old, tmp,
> +		inc = { .tail = TICKET_LOCK_INC };
> +
>  	CLEAR_IO_SYNC;
> +	__asm__ __volatile__(
> +"1:	lwarx	%0,0,%4		# arch_spin_lock\n"
> +"	add	%1,%3,%0\n"
> +	PPC405_ERR77(0, "%4")
> +"	stwcx.	%1,0,%4\n"
> +"	bne-	1b"
> +	: "=&r" (old), "=&r" (tmp), "+m" (lock->tickets)
> +	: "r" (inc), "r" (&lock->tickets)
> +	: "cc");
> +
> +	if (likely(old.head == old.tail))
> +		goto out;

I would have expected an lwsync someplace hereabouts.

> +	for (;;) {
> +		unsigned count = 100;
> +
>  		do {
> +			if (ACCESS_ONCE(lock->tickets.head) == old.tail)
> +				goto out;
>  			HMT_low();
>  			if (SHARED_PROCESSOR)
>  				__spin_yield(lock);
> +		} while (--count);
>  		HMT_medium();
>  	}
> +out:
> +#if defined(CONFIG_PPC_SPLPAR)
> +	lock->holder = LOCK_TOKEN;
> +#endif
> +	barrier();	/* make sure nothing creeps before the lock is taken */
>  }
>  
>  static inline

> @@ -147,10 +220,21 @@ void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags)
>  
>  static inline void arch_spin_unlock(arch_spinlock_t *lock)
>  {
> +	arch_spinlock_t old, new;
> +
> +#if defined(CONFIG_PPC_SPLPAR)
> +	lock->holder = 0;
> +#endif
> +	do {
> +		old.tickets = ACCESS_ONCE(lock->tickets);
> +		new.tickets.head = old.tickets.head + TICKET_LOCK_INC;
> +		new.tickets.tail = old.tickets.tail;
> +	} while (unlikely(__arch_spin_cmpxchg_eq(lock,
> +						 old.head_tail,
> +						 new.head_tail)));
>  	SYNC_IO;
>  	__asm__ __volatile__("# arch_spin_unlock\n\t"
>  				PPC_RELEASE_BARRIER: : :"memory");

Doens't your cmpxchg_eq not already imply a lwsync?

> -	lock->slock = 0;
>  }

I'm still failing to see why you need an ll/sc pair for unlock.

^ permalink raw reply

* Re: [PATCH] Convert powerpc simple spinlocks into ticket locks
From: Peter Zijlstra @ 2014-02-07 17:19 UTC (permalink / raw)
  To: Torsten Duwe
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207170845.GD2107@lst.de>

On Fri, Feb 07, 2014 at 06:08:45PM +0100, Torsten Duwe wrote:
> > static inline unsigned int xadd(unsigned int *v, unsigned int i)
> > {
> > 	int t, ret;
> > 	
> > 	__asm__ __volatile__ (
> > "1:	lwarx	%0, 0, %4\n"
> > "	mr	%1, %0\n"
> > "	add	%0, %3, %0\n"
> > "	stwcx.	%0, %0, %4\n"
> > "	bne-	1b\n"
> > 	: "=&r" (t), "=&r" (ret), "+m" (*v)
> > 	: "r" (i), "r" (v)
> > 	: "cc");
> > 
> > 	return ret;
> > }
> > 
> I don't like this xadd thing -- it's so x86 ;)
> x86 has its LOCK prefix, ppc has ll/sc.
> That should be reflected somehow IMHO.

Its the operational semantics I care about; this version is actually
nicer in that it doesn't actually imply all sorts of barriers :-)

> Maybe if xadd became mandatory for some kernel library.

call it fetch_add() its not an uncommon operation and many people
understand the semantics.

But you can simply include the asm bits in ticket_lock() and be done
with it. In that case you can also replace the add with an addi which
might be a little more efficient.

> > void ticket_unlock(tickets_t *lock)
> > {
> > 	ticket_t tail = lock->tail + 1;
> > 
> > 	/*
> > 	 * The store is save against the xadd for it will make the ll/sc fail
> > 	 * and try again. Aside from that PowerISA guarantees single-copy
> > 	 * atomicy for half-word writes.
> > 	 *
> > 	 * And since only the lock owner will ever write the tail, we're good.
> > 	 */
> > 	smp_store_release(&lock->tail, tail);
> > }
> 
> Yeah, let's try that on top of v2 (just posted).
> First, I want to see v2 work as nicely as v1 --
> compiling a debug kernel takes a while...

Use a faster machine... it can be done < 1 minute :-)

^ permalink raw reply

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Christoph Lameter @ 2014-02-07 17:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
	Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207054119.GA28952@lge.com>

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > This check wouild need to be something that checks for other contigencies
> > in the page allocator as well. A simple solution would be to actually run
> > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > If that fails then fallback. See how fallback_alloc() does it in slab.
> >
>
> Hello, Christoph.
>
> This !node_present_pages() ensure that allocation on this node cannot succeed.
> So we can directly use numa_mem_id() here.

Yes of course we can use numa_mem_id().

But the check is only for not having any memory at all on a node. There
are other reason for allocations to fail on a certain node. The node could
have memory that cannot be reclaimed, all dirty, beyond certain
thresholds, not in the current set of allowed nodes etc etc.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-07 17:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207054819.GC28952@lge.com>

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

^ permalink raw reply

* Re: [PATCH v2] powerpc ticket locks
From: Torsten Duwe @ 2014-02-07 17:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tom Musta, linux-kernel, Paul Mackerras, Anton Blanchard,
	Scott Wood, Paul E. McKenney, linuxppc-dev, Ingo Molnar
In-Reply-To: <20140207171224.GR5002@laptop.programming.kicks-ass.net>

On Fri, Feb 07, 2014 at 06:12:24PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 05:58:01PM +0100, Torsten Duwe wrote:
> > +static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
> >  {
> > +	register struct __raw_tickets old, tmp,
> > +		inc = { .tail = TICKET_LOCK_INC };
> > +
> >  	CLEAR_IO_SYNC;
> > +	__asm__ __volatile__(
> > +"1:	lwarx	%0,0,%4		# arch_spin_lock\n"
> > +"	add	%1,%3,%0\n"
> > +	PPC405_ERR77(0, "%4")
> > +"	stwcx.	%1,0,%4\n"
> > +"	bne-	1b"
> > +	: "=&r" (old), "=&r" (tmp), "+m" (lock->tickets)
> > +	: "r" (inc), "r" (&lock->tickets)
> > +	: "cc");
> > +
> > +	if (likely(old.head == old.tail))
> > +		goto out;
> 
> I would have expected an lwsync someplace hereabouts.

Let me reconsider this. The v1 code worked on an 8 core,
maybe I didn't beat it enough.

> >  static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >  {
> > +	arch_spinlock_t old, new;
> > +
> > +#if defined(CONFIG_PPC_SPLPAR)
> > +	lock->holder = 0;
> > +#endif
> > +	do {
> > +		old.tickets = ACCESS_ONCE(lock->tickets);
> > +		new.tickets.head = old.tickets.head + TICKET_LOCK_INC;
> > +		new.tickets.tail = old.tickets.tail;
> > +	} while (unlikely(__arch_spin_cmpxchg_eq(lock,
> > +						 old.head_tail,
> > +						 new.head_tail)));
> >  	SYNC_IO;
> >  	__asm__ __volatile__("# arch_spin_unlock\n\t"
> >  				PPC_RELEASE_BARRIER: : :"memory");
> 
> Doens't your cmpxchg_eq not already imply a lwsync?

Right.

> > -	lock->slock = 0;
> >  }
> 
> I'm still failing to see why you need an ll/sc pair for unlock.

Like so:
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
	arch_spinlock_t tmp;

#if defined(CONFIG_PPC_SPLPAR)
	lock->holder = 0;
#endif
	tmp.tickets = ACCESS_ONCE(lock->tickets);
	tmp.tickets.head += TICKET_LOCK_INC;
	lock->tickets.head = tmp.tickets.head;
	SYNC_IO;
	__asm__ __volatile__("# arch_spin_unlock\n\t"
				PPC_RELEASE_BARRIER: : :"memory");
}
?

I'll wrap it all up next week. I only wanted to post an updated v2
with the agreed-upon changes for BenH.

Thanks so far!

	Torsten

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Christoph Lameter @ 2014-02-07 18:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071150090.15168@nuc>

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 	int order;
+	int alloc_node;

 	BUG_ON(flags & GFP_SLAB_BUG_MASK);

 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
+	if (!page) {
+		if (node != NUMA_NO_NODE)
+			empty_node[node] = 1;
 		goto out;
+	}

 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	alloc_node = page_to_nid(page);
+	empty_node[alloc_node] = 0;
+	inc_slabs_node(s, alloc_node, page->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+	int page_node;
+
+	/* No data means no match */
+	if (!page)
 		return 0;
+
+	/* Node does not matter. Therefore anything is a match */
+	if (node == NUMA_NO_NODE)
+		return 1;
+
+	/* Did we hit the requested node ? */
+	page_node = page_to_nid(page);
+	if (page_node == node)
+		return 1;
+
+	/* If the node has available data then we can use it. Mismatch */
+	return !empty_node[page_node];
+
+	/* Target node empty so just take anything */
 #endif
 	return 1;
 }

^ permalink raw reply

* [PATCH] powerpc: fix build failure in sysdev/mpic.c for MPIC_WEIRD=y
From: Paul Gortmaker @ 2014-02-07 19:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras; +Cc: Paul Gortmaker, linuxppc-dev

Commit 446f6d06fab0b49c61887ecbe8286d6aaa796637 ("powerpc/mpic: Properly
set default triggers") breaks the mpc7447_hpc_defconfig as follows:

  CC      arch/powerpc/sysdev/mpic.o
arch/powerpc/sysdev/mpic.c: In function 'mpic_set_irq_type':
arch/powerpc/sysdev/mpic.c:886:9: error: case label does not reduce to an integer constant
arch/powerpc/sysdev/mpic.c:890:9: error: case label does not reduce to an integer constant
arch/powerpc/sysdev/mpic.c:894:9: error: case label does not reduce to an integer constant
arch/powerpc/sysdev/mpic.c:898:9: error: case label does not reduce to an integer constant

Looking at the cpp output (gcc 4.7.3), I see:

   case mpic->hw_set[MPIC_IDX_VECPRI_SENSE_EDGE] |
        mpic->hw_set[MPIC_IDX_VECPRI_POLARITY_POSITIVE]:

The pointer into an array appears because CONFIG_MPIC_WEIRD=y is set
for this platform, thus enabling the following:

  -------------------
  #ifdef CONFIG_MPIC_WEIRD
  static u32 mpic_infos[][MPIC_IDX_END] = {
        [0] = { /* Original OpenPIC compatible MPIC */

  [...]

  #define MPIC_INFO(name) mpic->hw_set[MPIC_IDX_##name]

  #else /* CONFIG_MPIC_WEIRD */

  #define MPIC_INFO(name) MPIC_##name

  #endif /* CONFIG_MPIC_WEIRD */
  -------------------

Here we convert the case section to if/else if, and also add
the equivalent of a default case to warn about unknown types.
Boot tested on sbc8548, build tested on all defconfigs.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
 arch/powerpc/sysdev/mpic.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index 0e166ed4cd16..8209744b2829 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -886,25 +886,25 @@ int mpic_set_irq_type(struct irq_data *d, unsigned int flow_type)
 
 	/* Default: read HW settings */
 	if (flow_type == IRQ_TYPE_DEFAULT) {
-		switch(vold & (MPIC_INFO(VECPRI_POLARITY_MASK) |
-			       MPIC_INFO(VECPRI_SENSE_MASK))) {
-			case MPIC_INFO(VECPRI_SENSE_EDGE) |
-			     MPIC_INFO(VECPRI_POLARITY_POSITIVE):
-				flow_type = IRQ_TYPE_EDGE_RISING;
-				break;
-			case MPIC_INFO(VECPRI_SENSE_EDGE) |
-			     MPIC_INFO(VECPRI_POLARITY_NEGATIVE):
-				flow_type = IRQ_TYPE_EDGE_FALLING;
-				break;
-			case MPIC_INFO(VECPRI_SENSE_LEVEL) |
-			     MPIC_INFO(VECPRI_POLARITY_POSITIVE):
-				flow_type = IRQ_TYPE_LEVEL_HIGH;
-				break;
-			case MPIC_INFO(VECPRI_SENSE_LEVEL) |
-			     MPIC_INFO(VECPRI_POLARITY_NEGATIVE):
-				flow_type = IRQ_TYPE_LEVEL_LOW;
-				break;
-		}
+		int vold_ps;
+
+		vold_ps = vold & (MPIC_INFO(VECPRI_POLARITY_MASK) |
+				  MPIC_INFO(VECPRI_SENSE_MASK));
+
+		if (vold_ps == (MPIC_INFO(VECPRI_SENSE_EDGE) |
+				MPIC_INFO(VECPRI_POLARITY_POSITIVE)))
+			flow_type = IRQ_TYPE_EDGE_RISING;
+		else if	(vold_ps == (MPIC_INFO(VECPRI_SENSE_EDGE) |
+				     MPIC_INFO(VECPRI_POLARITY_NEGATIVE)))
+			flow_type = IRQ_TYPE_EDGE_FALLING;
+		else if (vold_ps == (MPIC_INFO(VECPRI_SENSE_LEVEL) |
+				     MPIC_INFO(VECPRI_POLARITY_POSITIVE)))
+			flow_type = IRQ_TYPE_LEVEL_HIGH;
+		else if (vold_ps == (MPIC_INFO(VECPRI_SENSE_LEVEL) |
+				     MPIC_INFO(VECPRI_POLARITY_NEGATIVE)))
+			flow_type = IRQ_TYPE_LEVEL_LOW;
+		else
+			WARN_ONCE(1, "mpic: unknown IRQ type %d\n", vold);
 	}
 
 	/* Apply to irq desc */
-- 
1.8.5.2

^ permalink raw reply related

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: James Yang @ 2014-02-07 20:49 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Chris Proctor, Stephen N Chivers, linuxppc-dev
In-Reply-To: <20140207101036.GA823@visitor2.iram.es>

On Fri, 7 Feb 2014, Gabriel Paubert wrote:

> 	Hi Stephen,
> 
> On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > 
> > > From: Gabriel Paubert <paubert@iram.es>
> > > To: Stephen N Chivers <schivers@csc.com.au>
> > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor <cproctor@csc.com.au>
> > > Date: 02/06/2014 07:26 PM
> > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > 
> > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:

> > > > 
> > > >                 mask = 0;
> > > >                 if (FM & (1 << 0))
> > > >                         mask |= 0x0000000f;
> > > >                 if (FM & (1 << 1))
> > > >                         mask |= 0x000000f0;
> > > >                 if (FM & (1 << 2))
> > > >                         mask |= 0x00000f00;
> > > >                 if (FM & (1 << 3))
> > > >                         mask |= 0x0000f000;
> > > >                 if (FM & (1 << 4))
> > > >                         mask |= 0x000f0000;
> > > >                 if (FM & (1 << 5))
> > > >                         mask |= 0x00f00000;
> > > >                 if (FM & (1 << 6))
> > > >                         mask |= 0x0f000000;
> > > >                 if (FM & (1 << 7))
> > > >                         mask |= 0x90000000;
> > > > 
> > > > With the above mask computation I get consistent results for 
> > > > both the MPC8548 and MPC7410 boards.
> > > > 
> > > > Am I missing something subtle?
> > > 
> > > No I think you are correct. This said, this code may probably be 
> > optimized 
> > > to eliminate a lot of the conditional branches. I think that:


If the compiler is enabled to generate isel instructions, it would not 
use a conditional branch for this code. (ignore the andi's values, 
this is an old compile)

c0037c2c <mtfsf>:
c0037c2c:       2c 03 00 00     cmpwi   r3,0
c0037c30:       41 82 01 1c     beq-    c0037d4c <mtfsf+0x120>
c0037c34:       2f 83 00 ff     cmpwi   cr7,r3,255
c0037c38:       41 9e 01 28     beq-    cr7,c0037d60 <mtfsf+0x134>
c0037c3c:       70 66 00 01     andi.   r6,r3,1
c0037c40:       3d 00 90 00     lis     r8,-28672
c0037c44:       7d 20 40 9e     iseleq  r9,r0,r8
c0037c48:       70 6a 00 02     andi.   r10,r3,2
c0037c4c:       65 28 0f 00     oris    r8,r9,3840
c0037c50:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c54:       70 66 00 04     andi.   r6,r3,4
c0037c58:       65 28 00 f0     oris    r8,r9,240
c0037c5c:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c60:       70 6a 00 08     andi.   r10,r3,8
c0037c64:       65 28 00 0f     oris    r8,r9,15
c0037c68:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c6c:       70 66 00 10     andi.   r6,r3,16
c0037c70:       61 28 f0 00     ori     r8,r9,61440
c0037c74:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c78:       70 6a 00 20     andi.   r10,r3,32
c0037c7c:       61 28 0f 00     ori     r8,r9,3840
c0037c80:       54 6a cf fe     rlwinm  r10,r3,25,31,31
c0037c84:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c88:       2f 8a 00 00     cmpwi   cr7,r10,0
c0037c8c:       70 66 00 40     andi.   r6,r3,64
c0037c90:       61 28 00 f0     ori     r8,r9,240
c0037c94:       7d 29 40 9e     iseleq  r9,r9,r8
c0037c98:       41 9e 00 08     beq-    cr7,c0037ca0 <mtfsf+0x74>
c0037c9c:       61 29 00 0f     ori     r9,r9,15
	...
 
However, your other solutions are better.


> > > 
> > > mask = (FM & 1);
> > > mask |= (FM << 3) & 0x10;
> > > mask |= (FM << 6) & 0x100;
> > > mask |= (FM << 9) & 0x1000;
> > > mask |= (FM << 12) & 0x10000;
> > > mask |= (FM << 15) & 0x100000;
> > > mask |= (FM << 18) & 0x1000000;
> > > mask |= (FM << 21) & 0x10000000;
> > > mask *= 15;
> > > 
> > > should do the job, in less code space and without a single branch.
> > > 
> > > Each one of the "mask |=" lines should be translated into an
> > > rlwinm instruction followed by an "or". Actually it should be possible
> > > to transform each of these lines into a single "rlwimi" instruction
> > > but I don't know how to coerce gcc to reach this level of optimization.
> > > 
> > > Another way of optomizing this could be:
> > > 
> > > mask = (FM & 0x0f) | ((FM << 12) & 0x000f0000);
> > > mask = (mask & 0x00030003) | ((mask << 6) & 0x03030303);
> > > mask = (mask & 0x01010101) | ((mask << 3) & 0x10101010);
> > > mask *= 15;
> > > 


> Ok, I finally edited my sources and test compiled the suggestions
> I gave. I'd say that method2 is the best overall indeed. You can
> actually save one more instruction by setting mask to all ones in 
> the case FM=0xff, but that's about all in this area.


My measurements show method1 to be smaller and faster than method2 due 
to the number of instructions needed to generate the constant masks in 
method2, but it may depend upon your compiler and hardware.  Both are 
faster than the original with isel.

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Nishanth Aravamudan @ 2014-02-07 21:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.

Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Thanks,
Nish

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: David Rientjes @ 2014-02-08  9:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207054819.GC28952@lge.com>

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > It seems like a better approach would be to do this when a node is brought 
> > online and determine the fallback node based not on the zonelists as you 
> > do here but rather on locality (such as through a SLIT if provided, see 
> > node_distance()).
> 
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)
> 

The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
If your solution is going to become the generic kernel API that determines 
what node has local memory for a particular node, then it will have to 
support all definitions of node.  That includes nodes that consist solely 
of I/O, chipsets, networking, or storage devices.  These nodes may not 
have memory or cpus, so doing it as part of onlining cpus isn't going to 
be generic enough.  You want a node_to_mem_node() API for all possible 
node types (the possible node types listed above are straight from the 
ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
X and we can optimize for that, but any solution that relies on cpu online 
is probably shortsighted right now.

I think it would be much better to do this as a part of setting a node to 
be online.

^ permalink raw reply

* Re: [PATCH v2] powerpc/powernv: Platform dump interface
From: Anton Blanchard @ 2014-02-08 21:20 UTC (permalink / raw)
  To: Vasant Hegde; +Cc: linuxppc-dev
In-Reply-To: <20140116121411.624.55662.stgit@hegdevasant.in.ibm.com>


Hi Vasant,

> +static void free_dump_sg_list(struct opal_sg_list *list)
> +{
> +	struct opal_sg_list *sg1;
> +	while (list) {
> +		sg1 = list->next;
> +		kfree(list);
> +		list = sg1;
> +	}
> +	list = NULL;
> +}
> +
> +/*
> + * Build dump buffer scatter gather list
> + */
> +static struct opal_sg_list *dump_data_to_sglist(void)
> +{
> +	struct opal_sg_list *sg1, *list = NULL;
> +	void *addr;
> +	int64_t size;
> +
> +	addr = dump_record.buffer;
> +	size = dump_record.size;
> +
> +	sg1 = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!sg1)
> +		goto nomem;
> +
> +	list = sg1;
> +	sg1->num_entries = 0;
> +	while (size > 0) {
> +		/* Translate virtual address to physical address */
> +		sg1->entry[sg1->num_entries].data =
> +			(void *)(vmalloc_to_pfn(addr) << PAGE_SHIFT);
> +
> +		if (size > PAGE_SIZE)
> +			sg1->entry[sg1->num_entries].length =
> PAGE_SIZE;
> +		else
> +			sg1->entry[sg1->num_entries].length = size;
> +
> +		sg1->num_entries++;
> +		if (sg1->num_entries >= SG_ENTRIES_PER_NODE) {
> +			sg1->next = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +			if (!sg1->next)
> +				goto nomem;
> +
> +			sg1 = sg1->next;
> +			sg1->num_entries = 0;
> +		}
> +		addr += PAGE_SIZE;
> +		size -= PAGE_SIZE;
> +	}
> +	return list;
> +
> +nomem:
> +	pr_err("%s : Failed to allocate memory\n", __func__);
> +	free_dump_sg_list(list);
> +	return NULL;
> +}
> +
> +/*
> + * Translate sg list address to absolute
> + */
> +static void sglist_to_phy_addr(struct opal_sg_list *list)
> +{
> +	struct opal_sg_list *sg, *next;
> +
> +	for (sg = list; sg; sg = next) {
> +		next = sg->next;
> +		/* Don't translate NULL pointer for last entry */
> +		if (sg->next)
> +			sg->next = (struct opal_sg_list
> *)__pa(sg->next);
> +		else
> +			sg->next = NULL;
> +
> +		/* Convert num_entries to length */
> +		sg->num_entries =
> +			sg->num_entries * sizeof(struct
> opal_sg_entry) + 16;
> +	}
> +}
> +
> +static void free_dump_data_buf(void)
> +{
> +	vfree(dump_record.buffer);
> +	dump_record.size = 0;
> +}

This looks identical to the code in opal-flash.c. Considering how
complicated it is, can we put it somewhere common?

Anton

^ permalink raw reply

* [PATCH RFC/RFT v2 5/8] powerpc: move cacheinfo sysfs to generic cacheinfo infrastructure
From: Sudeep Holla @ 2014-02-07 16:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: Paul Mackerras, linuxppc-dev, sudeep.holla
In-Reply-To: <1391791763-28518-1-git-send-email-sudeep.holla@arm.com>

From: Sudeep Holla <sudeep.holla@arm.com>

This patch removes the redundant sysfs cacheinfo code by making use of
the newly introduced generic cacheinfo infrastructure.

Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/cacheinfo.c | 828 ++++++----------------------------------
 arch/powerpc/kernel/cacheinfo.h |   8 -
 arch/powerpc/kernel/sysfs.c     |   4 -
 3 files changed, 109 insertions(+), 731 deletions(-)
 delete mode 100644 arch/powerpc/kernel/cacheinfo.h

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index abfa011..05b7580 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -10,38 +10,10 @@
  * 2 as published by the Free Software Foundation.
  */
 
+#include <linux/cacheinfo.h>
 #include <linux/cpu.h>
-#include <linux/cpumask.h>
 #include <linux/kernel.h>
-#include <linux/kobject.h>
-#include <linux/list.h>
-#include <linux/notifier.h>
 #include <linux/of.h>
-#include <linux/percpu.h>
-#include <linux/slab.h>
-#include <asm/prom.h>
-
-#include "cacheinfo.h"
-
-/* per-cpu object for tracking:
- * - a "cache" kobject for the top-level directory
- * - a list of "index" objects representing the cpu's local cache hierarchy
- */
-struct cache_dir {
-	struct kobject *kobj; /* bare (not embedded) kobject for cache
-			       * directory */
-	struct cache_index_dir *index; /* list of index objects */
-};
-
-/* "index" object: each cpu's cache directory has an index
- * subdirectory corresponding to a cache object associated with the
- * cpu.  This object's lifetime is managed via the embedded kobject.
- */
-struct cache_index_dir {
-	struct kobject kobj;
-	struct cache_index_dir *next; /* next index in parent directory */
-	struct cache *cache;
-};
 
 /* Template for determining which OF properties to query for a given
  * cache type */
@@ -60,11 +32,6 @@ struct cache_type_info {
 	const char *nr_sets_prop;
 };
 
-/* These are used to index the cache_type_info array. */
-#define CACHE_TYPE_UNIFIED     0
-#define CACHE_TYPE_INSTRUCTION 1
-#define CACHE_TYPE_DATA        2
-
 static const struct cache_type_info cache_type_info[] = {
 	{
 		/* PowerPC Processor binding says the [di]-cache-*
@@ -77,246 +44,115 @@ static const struct cache_type_info cache_type_info[] = {
 		.nr_sets_prop    = "d-cache-sets",
 	},
 	{
-		.name            = "Instruction",
-		.size_prop       = "i-cache-size",
-		.line_size_props = { "i-cache-line-size",
-				     "i-cache-block-size", },
-		.nr_sets_prop    = "i-cache-sets",
-	},
-	{
 		.name            = "Data",
 		.size_prop       = "d-cache-size",
 		.line_size_props = { "d-cache-line-size",
 				     "d-cache-block-size", },
 		.nr_sets_prop    = "d-cache-sets",
 	},
+	{
+		.name            = "Instruction",
+		.size_prop       = "i-cache-size",
+		.line_size_props = { "i-cache-line-size",
+				     "i-cache-block-size", },
+		.nr_sets_prop    = "i-cache-sets",
+	},
 };
 
-/* Cache object: each instance of this corresponds to a distinct cache
- * in the system.  There are separate objects for Harvard caches: one
- * each for instruction and data, and each refers to the same OF node.
- * The refcount of the OF node is elevated for the lifetime of the
- * cache object.  A cache object is released when its shared_cpu_map
- * is cleared (see cache_cpu_clear).
- *
- * A cache object is on two lists: an unsorted global list
- * (cache_list) of cache objects; and a singly-linked list
- * representing the local cache hierarchy, which is ordered by level
- * (e.g. L1d -> L1i -> L2 -> L3).
- */
-struct cache {
-	struct device_node *ofnode;    /* OF node for this cache, may be cpu */
-	struct cpumask shared_cpu_map; /* online CPUs using this cache */
-	int type;                      /* split cache disambiguation */
-	int level;                     /* level not explicit in device tree */
-	struct list_head list;         /* global list of cache objects */
-	struct cache *next_local;      /* next cache of >= level */
-};
-
-static DEFINE_PER_CPU(struct cache_dir *, cache_dir_pcpu);
-
-/* traversal/modification of this list occurs only at cpu hotplug time;
- * access is serialized by cpu hotplug locking
- */
-static LIST_HEAD(cache_list);
-
-static struct cache_index_dir *kobj_to_cache_index_dir(struct kobject *k)
-{
-	return container_of(k, struct cache_index_dir, kobj);
-}
-
-static const char *cache_type_string(const struct cache *cache)
+static inline int get_cacheinfo_idx(enum cache_type type)
 {
-	return cache_type_info[cache->type].name;
-}
-
-static void cache_init(struct cache *cache, int type, int level,
-		       struct device_node *ofnode)
-{
-	cache->type = type;
-	cache->level = level;
-	cache->ofnode = of_node_get(ofnode);
-	INIT_LIST_HEAD(&cache->list);
-	list_add(&cache->list, &cache_list);
-}
-
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
-{
-	struct cache *cache;
-
-	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
-	if (cache)
-		cache_init(cache, type, level, ofnode);
-
-	return cache;
-}
-
-static void release_cache_debugcheck(struct cache *cache)
-{
-	struct cache *iter;
-
-	list_for_each_entry(iter, &cache_list, list)
-		WARN_ONCE(iter->next_local == cache,
-			  "cache for %s(%s) refers to cache for %s(%s)\n",
-			  iter->ofnode->full_name,
-			  cache_type_string(iter),
-			  cache->ofnode->full_name,
-			  cache_type_string(cache));
-}
-
-static void release_cache(struct cache *cache)
-{
-	if (!cache)
-		return;
-
-	pr_debug("freeing L%d %s cache for %s\n", cache->level,
-		 cache_type_string(cache), cache->ofnode->full_name);
-
-	release_cache_debugcheck(cache);
-	list_del(&cache->list);
-	of_node_put(cache->ofnode);
-	kfree(cache);
-}
-
-static void cache_cpu_set(struct cache *cache, int cpu)
-{
-	struct cache *next = cache;
-
-	while (next) {
-		WARN_ONCE(cpumask_test_cpu(cpu, &next->shared_cpu_map),
-			  "CPU %i already accounted in %s(%s)\n",
-			  cpu, next->ofnode->full_name,
-			  cache_type_string(next));
-		cpumask_set_cpu(cpu, &next->shared_cpu_map);
-		next = next->next_local;
-	}
+	if (type == CACHE_TYPE_UNIFIED)
+		return 0;
+	else
+		return type;
 }
 
-static int cache_size(const struct cache *cache, unsigned int *ret)
+static int cache_size(struct cache_info *this_leaf)
 {
 	const char *propname;
 	const __be32 *cache_size;
+	int ct_idx;
 
-	propname = cache_type_info[cache->type].size_prop;
-
-	cache_size = of_get_property(cache->ofnode, propname, NULL);
-	if (!cache_size)
-		return -ENODEV;
-
-	*ret = of_read_number(cache_size, 1);
-	return 0;
-}
-
-static int cache_size_kb(const struct cache *cache, unsigned int *ret)
-{
-	unsigned int size;
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	propname = cache_type_info[ct_idx].size_prop;
 
-	if (cache_size(cache, &size))
+	cache_size = of_get_property(this_leaf->of_node, propname, NULL);
+	if (!cache_size) {
+		this_leaf->size = 0;
 		return -ENODEV;
-
-	*ret = size / 1024;
-	return 0;
+	} else {
+		this_leaf->size = of_read_number(cache_size, 1);
+		return 0;
+	}
 }
 
 /* not cache_line_size() because that's a macro in include/linux/cache.h */
-static int cache_get_line_size(const struct cache *cache, unsigned int *ret)
+static int cache_get_line_size(struct cache_info *this_leaf)
 {
 	const __be32 *line_size;
-	int i, lim;
+	int i, lim, ct_idx;
 
-	lim = ARRAY_SIZE(cache_type_info[cache->type].line_size_props);
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	lim = ARRAY_SIZE(cache_type_info[ct_idx].line_size_props);
 
 	for (i = 0; i < lim; i++) {
 		const char *propname;
 
-		propname = cache_type_info[cache->type].line_size_props[i];
-		line_size = of_get_property(cache->ofnode, propname, NULL);
+		propname = cache_type_info[ct_idx].line_size_props[i];
+		line_size = of_get_property(this_leaf->of_node, propname, NULL);
 		if (line_size)
 			break;
 	}
 
-	if (!line_size)
+	if (!line_size) {
+		this_leaf->coherency_line_size = 0;
 		return -ENODEV;
-
-	*ret = of_read_number(line_size, 1);
-	return 0;
+	} else {
+		this_leaf->coherency_line_size = of_read_number(line_size, 1);
+		return 0;
+	}
 }
 
-static int cache_nr_sets(const struct cache *cache, unsigned int *ret)
+static int cache_nr_sets(struct cache_info *this_leaf)
 {
 	const char *propname;
 	const __be32 *nr_sets;
+	int ct_idx;
 
-	propname = cache_type_info[cache->type].nr_sets_prop;
+	ct_idx = get_cacheinfo_idx(this_leaf->type);
+	propname = cache_type_info[ct_idx].nr_sets_prop;
 
-	nr_sets = of_get_property(cache->ofnode, propname, NULL);
-	if (!nr_sets)
+	nr_sets = of_get_property(this_leaf->of_node, propname, NULL);
+	if (!nr_sets) {
+		this_leaf->number_of_sets = 0;
 		return -ENODEV;
-
-	*ret = of_read_number(nr_sets, 1);
-	return 0;
+	} else {
+		this_leaf->number_of_sets = of_read_number(nr_sets, 1);
+		return 0;
+	}
 }
 
-static int cache_associativity(const struct cache *cache, unsigned int *ret)
+static int cache_associativity(struct cache_info *this_leaf)
 {
-	unsigned int line_size;
-	unsigned int nr_sets;
-	unsigned int size;
-
-	if (cache_nr_sets(cache, &nr_sets))
-		goto err;
+	unsigned int line_size = this_leaf->coherency_line_size;
+	unsigned int nr_sets = this_leaf->number_of_sets;
+	unsigned int size = this_leaf->size;
 
 	/* If the cache is fully associative, there is no need to
 	 * check the other properties.
 	 */
 	if (nr_sets == 1) {
-		*ret = 0;
+		this_leaf->ways_of_associativity = 0;
 		return 0;
 	}
 
-	if (cache_get_line_size(cache, &line_size))
-		goto err;
-	if (cache_size(cache, &size))
-		goto err;
-
-	if (!(nr_sets > 0 && size > 0 && line_size > 0))
-		goto err;
-
-	*ret = (size / nr_sets) / line_size;
-	return 0;
-err:
-	return -ENODEV;
-}
-
-/* helper for dealing with split caches */
-static struct cache *cache_find_first_sibling(struct cache *cache)
-{
-	struct cache *iter;
-
-	if (cache->type == CACHE_TYPE_UNIFIED)
-		return cache;
-
-	list_for_each_entry(iter, &cache_list, list)
-		if (iter->ofnode == cache->ofnode && iter->next_local == cache)
-			return iter;
-
-	return cache;
-}
-
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
-{
-	struct cache *cache = NULL;
-	struct cache *iter;
-
-	list_for_each_entry(iter, &cache_list, list) {
-		if (iter->ofnode != node)
-			continue;
-		cache = cache_find_first_sibling(iter);
-		break;
+	if (!(nr_sets > 0 && size > 0 && line_size > 0)) {
+		this_leaf->ways_of_associativity = 0;
+		return -ENODEV;
+	} else {
+		this_leaf->ways_of_associativity = (size / nr_sets) / line_size;
+		return 0;
 	}
-
-	return cache;
 }
 
 static bool cache_node_is_unified(const struct device_node *np)
@@ -324,520 +160,74 @@ static bool cache_node_is_unified(const struct device_node *np)
 	return of_get_property(np, "cache-unified", NULL);
 }
 
-static struct cache *cache_do_one_devnode_unified(struct device_node *node,
-						  int level)
-{
-	struct cache *cache;
-
-	pr_debug("creating L%d ucache for %s\n", level, node->full_name);
-
-	cache = new_cache(CACHE_TYPE_UNIFIED, level, node);
-
-	return cache;
-}
-
-static struct cache *cache_do_one_devnode_split(struct device_node *node,
-						int level)
+static void ci_leaf_init(struct cache_info *this_leaf,
+				enum cache_type type, unsigned int level)
 {
-	struct cache *dcache, *icache;
-
-	pr_debug("creating L%d dcache and icache for %s\n", level,
-		 node->full_name);
-
-	dcache = new_cache(CACHE_TYPE_DATA, level, node);
-	icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node);
-
-	if (!dcache || !icache)
-		goto err;
-
-	dcache->next_local = icache;
-
-	return dcache;
-err:
-	release_cache(dcache);
-	release_cache(icache);
-	return NULL;
+	this_leaf->level = level;
+	this_leaf->type = type;
+	cache_size(this_leaf);
+	cache_get_line_size(this_leaf);
+	cache_nr_sets(this_leaf);
+	cache_associativity(this_leaf);
 }
 
-static struct cache *cache_do_one_devnode(struct device_node *node, int level)
+int init_cache_level(unsigned int cpu)
 {
-	struct cache *cache;
-
-	if (cache_node_is_unified(node))
-		cache = cache_do_one_devnode_unified(node, level);
-	else
-		cache = cache_do_one_devnode_split(node, level);
-
-	return cache;
-}
+	struct device_node *np;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	unsigned int level = 0, leaves = 0;
 
-static struct cache *cache_lookup_or_instantiate(struct device_node *node,
-						 int level)
-{
-	struct cache *cache;
-
-	cache = cache_lookup_by_node(node);
-
-	WARN_ONCE(cache && cache->level != level,
-		  "cache level mismatch on lookup (got %d, expected %d)\n",
-		  cache->level, level);
-
-	if (!cache)
-		cache = cache_do_one_devnode(node, level);
-
-	return cache;
-}
-
-static void link_cache_lists(struct cache *smaller, struct cache *bigger)
-{
-	while (smaller->next_local) {
-		if (smaller->next_local == bigger)
-			return; /* already linked */
-		smaller = smaller->next_local;
-	}
-
-	smaller->next_local = bigger;
-}
-
-static void do_subsidiary_caches_debugcheck(struct cache *cache)
-{
-	WARN_ON_ONCE(cache->level != 1);
-	WARN_ON_ONCE(strcmp(cache->ofnode->type, "cpu"));
-}
-
-static void do_subsidiary_caches(struct cache *cache)
-{
-	struct device_node *subcache_node;
-	int level = cache->level;
-
-	do_subsidiary_caches_debugcheck(cache);
-
-	while ((subcache_node = of_find_next_cache_node(cache->ofnode))) {
-		struct cache *subcache;
-
-		level++;
-		subcache = cache_lookup_or_instantiate(subcache_node, level);
-		of_node_put(subcache_node);
-		if (!subcache)
-			break;
-
-		link_cache_lists(cache, subcache);
-		cache = subcache;
-	}
-}
-
-static struct cache *cache_chain_instantiate(unsigned int cpu_id)
-{
-	struct device_node *cpu_node;
-	struct cache *cpu_cache = NULL;
-
-	pr_debug("creating cache object(s) for CPU %i\n", cpu_id);
-
-	cpu_node = of_get_cpu_node(cpu_id, NULL);
-	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
-	if (!cpu_node)
-		goto out;
-
-	cpu_cache = cache_lookup_or_instantiate(cpu_node, 1);
-	if (!cpu_cache)
-		goto out;
-
-	do_subsidiary_caches(cpu_cache);
-
-	cache_cpu_set(cpu_cache, cpu_id);
-out:
-	of_node_put(cpu_node);
-
-	return cpu_cache;
-}
-
-static struct cache_dir *cacheinfo_create_cache_dir(unsigned int cpu_id)
-{
-	struct cache_dir *cache_dir;
-	struct device *dev;
-	struct kobject *kobj = NULL;
-
-	dev = get_cpu_device(cpu_id);
-	WARN_ONCE(!dev, "no dev for CPU %i\n", cpu_id);
-	if (!dev)
-		goto err;
-
-	kobj = kobject_create_and_add("cache", &dev->kobj);
-	if (!kobj)
-		goto err;
-
-	cache_dir = kzalloc(sizeof(*cache_dir), GFP_KERNEL);
-	if (!cache_dir)
-		goto err;
-
-	cache_dir->kobj = kobj;
-
-	WARN_ON_ONCE(per_cpu(cache_dir_pcpu, cpu_id) != NULL);
-
-	per_cpu(cache_dir_pcpu, cpu_id) = cache_dir;
-
-	return cache_dir;
-err:
-	kobject_put(kobj);
-	return NULL;
-}
-
-static void cache_index_release(struct kobject *kobj)
-{
-	struct cache_index_dir *index;
-
-	index = kobj_to_cache_index_dir(kobj);
-
-	pr_debug("freeing index directory for L%d %s cache\n",
-		 index->cache->level, cache_type_string(index->cache));
-
-	kfree(index);
-}
-
-static ssize_t cache_index_show(struct kobject *k, struct attribute *attr, char *buf)
-{
-	struct kobj_attribute *kobj_attr;
-
-	kobj_attr = container_of(attr, struct kobj_attribute, attr);
-
-	return kobj_attr->show(k, kobj_attr, buf);
-}
-
-static struct cache *index_kobj_to_cache(struct kobject *k)
-{
-	struct cache_index_dir *index;
-
-	index = kobj_to_cache_index_dir(k);
-
-	return index->cache;
-}
-
-static ssize_t size_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int size_kb;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_size_kb(cache, &size_kb))
+	if (!cpu_dev) {
+		pr_err("No cpu device for CPU %d\n", cpu);
 		return -ENODEV;
-
-	return sprintf(buf, "%uK\n", size_kb);
-}
-
-static struct kobj_attribute cache_size_attr =
-	__ATTR(size, 0444, size_show, NULL);
-
-
-static ssize_t line_size_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int line_size;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_get_line_size(cache, &line_size))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", line_size);
-}
-
-static struct kobj_attribute cache_line_size_attr =
-	__ATTR(coherency_line_size, 0444, line_size_show, NULL);
-
-static ssize_t nr_sets_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int nr_sets;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_nr_sets(cache, &nr_sets))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", nr_sets);
-}
-
-static struct kobj_attribute cache_nr_sets_attr =
-	__ATTR(number_of_sets, 0444, nr_sets_show, NULL);
-
-static ssize_t associativity_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	unsigned int associativity;
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	if (cache_associativity(cache, &associativity))
-		return -ENODEV;
-
-	return sprintf(buf, "%u\n", associativity);
-}
-
-static struct kobj_attribute cache_assoc_attr =
-	__ATTR(ways_of_associativity, 0444, associativity_show, NULL);
-
-static ssize_t type_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache *cache;
-
-	cache = index_kobj_to_cache(k);
-
-	return sprintf(buf, "%s\n", cache_type_string(cache));
-}
-
-static struct kobj_attribute cache_type_attr =
-	__ATTR(type, 0444, type_show, NULL);
-
-static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache_index_dir *index;
-	struct cache *cache;
-
-	index = kobj_to_cache_index_dir(k);
-	cache = index->cache;
-
-	return sprintf(buf, "%d\n", cache->level);
-}
-
-static struct kobj_attribute cache_level_attr =
-	__ATTR(level, 0444, level_show, NULL);
-
-static ssize_t shared_cpu_map_show(struct kobject *k, struct kobj_attribute *attr, char *buf)
-{
-	struct cache_index_dir *index;
-	struct cache *cache;
-	int len;
-	int n = 0;
-
-	index = kobj_to_cache_index_dir(k);
-	cache = index->cache;
-	len = PAGE_SIZE - 2;
-
-	if (len > 1) {
-		n = cpumask_scnprintf(buf, len, &cache->shared_cpu_map);
-		buf[n++] = '\n';
-		buf[n] = '\0';
 	}
-	return n;
-}
-
-static struct kobj_attribute cache_shared_cpu_map_attr =
-	__ATTR(shared_cpu_map, 0444, shared_cpu_map_show, NULL);
-
-/* Attributes which should always be created -- the kobject/sysfs core
- * does this automatically via kobj_type->default_attrs.  This is the
- * minimum data required to uniquely identify a cache.
- */
-static struct attribute *cache_index_default_attrs[] = {
-	&cache_type_attr.attr,
-	&cache_level_attr.attr,
-	&cache_shared_cpu_map_attr.attr,
-	NULL,
-};
-
-/* Attributes which should be created if the cache device node has the
- * right properties -- see cacheinfo_create_index_opt_attrs
- */
-static struct kobj_attribute *cache_index_opt_attrs[] = {
-	&cache_size_attr,
-	&cache_line_size_attr,
-	&cache_nr_sets_attr,
-	&cache_assoc_attr,
-};
-
-static const struct sysfs_ops cache_index_ops = {
-	.show = cache_index_show,
-};
-
-static struct kobj_type cache_index_type = {
-	.release = cache_index_release,
-	.sysfs_ops = &cache_index_ops,
-	.default_attrs = cache_index_default_attrs,
-};
-
-static void cacheinfo_create_index_opt_attrs(struct cache_index_dir *dir)
-{
-	const char *cache_name;
-	const char *cache_type;
-	struct cache *cache;
-	char *buf;
-	int i;
-
-	buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
-	if (!buf)
-		return;
-
-	cache = dir->cache;
-	cache_name = cache->ofnode->full_name;
-	cache_type = cache_type_string(cache);
-
-	/* We don't want to create an attribute that can't provide a
-	 * meaningful value.  Check the return value of each optional
-	 * attribute's ->show method before registering the
-	 * attribute.
-	 */
-	for (i = 0; i < ARRAY_SIZE(cache_index_opt_attrs); i++) {
-		struct kobj_attribute *attr;
-		ssize_t rc;
-
-		attr = cache_index_opt_attrs[i];
-
-		rc = attr->show(&dir->kobj, attr, buf);
-		if (rc <= 0) {
-			pr_debug("not creating %s attribute for "
-				 "%s(%s) (rc = %zd)\n",
-				 attr->attr.name, cache_name,
-				 cache_type, rc);
-			continue;
-		}
-		if (sysfs_create_file(&dir->kobj, &attr->attr))
-			pr_debug("could not create %s attribute for %s(%s)\n",
-				 attr->attr.name, cache_name, cache_type);
+	np = cpu_dev->of_node;
+	if (!np) {
+		pr_err("Failed to find cpu%d device node\n", cpu);
+		return -ENOENT;
 	}
 
-	kfree(buf);
-}
-
-static void cacheinfo_create_index_dir(struct cache *cache, int index,
-				       struct cache_dir *cache_dir)
-{
-	struct cache_index_dir *index_dir;
-	int rc;
-
-	index_dir = kzalloc(sizeof(*index_dir), GFP_KERNEL);
-	if (!index_dir)
-		goto err;
-
-	index_dir->cache = cache;
-
-	rc = kobject_init_and_add(&index_dir->kobj, &cache_index_type,
-				  cache_dir->kobj, "index%d", index);
-	if (rc)
-		goto err;
-
-	index_dir->next = cache_dir->index;
-	cache_dir->index = index_dir;
-
-	cacheinfo_create_index_opt_attrs(index_dir);
-
-	return;
-err:
-	kfree(index_dir);
-}
-
-static void cacheinfo_sysfs_populate(unsigned int cpu_id,
-				     struct cache *cache_list)
-{
-	struct cache_dir *cache_dir;
-	struct cache *cache;
-	int index = 0;
-
-	cache_dir = cacheinfo_create_cache_dir(cpu_id);
-	if (!cache_dir)
-		return;
-
-	cache = cache_list;
-	while (cache) {
-		cacheinfo_create_index_dir(cache, index, cache_dir);
-		index++;
-		cache = cache->next_local;
+	while (np) {
+		leaves += cache_node_is_unified(np) ? 1 : 2;
+		level++;
+		of_node_put(np);
+		np = of_find_next_cache_node(np);
 	}
-}
-
-void cacheinfo_cpu_online(unsigned int cpu_id)
-{
-	struct cache *cache;
-
-	cache = cache_chain_instantiate(cpu_id);
-	if (!cache)
-		return;
-
-	cacheinfo_sysfs_populate(cpu_id, cache);
-}
-
-#ifdef CONFIG_HOTPLUG_CPU /* functions needed for cpu offline */
+	this_cpu_ci->num_levels = level;
+	this_cpu_ci->num_leaves = leaves;
 
-static struct cache *cache_lookup_by_cpu(unsigned int cpu_id)
-{
-	struct device_node *cpu_node;
-	struct cache *cache;
-
-	cpu_node = of_get_cpu_node(cpu_id, NULL);
-	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
-	if (!cpu_node)
-		return NULL;
-
-	cache = cache_lookup_by_node(cpu_node);
-	of_node_put(cpu_node);
-
-	return cache;
+	return 0;
 }
 
-static void remove_index_dirs(struct cache_dir *cache_dir)
+int populate_cache_leaves(unsigned int cpu)
 {
-	struct cache_index_dir *index;
+	struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu);
+	struct cache_info *this_leaf = this_cpu_ci->info_list;
+	struct device *cpu_dev = get_cpu_device(cpu);
+	struct device_node *np;
+	unsigned int level, idx;
 
-	index = cache_dir->index;
-
-	while (index) {
-		struct cache_index_dir *next;
-
-		next = index->next;
-		kobject_put(&index->kobj);
-		index = next;
+	np = of_node_get(cpu_dev->of_node);
+	if (!np) {
+		pr_err("Failed to find cpu%d device node\n", cpu);
+		return -ENOENT;
 	}
-}
 
-static void remove_cache_dir(struct cache_dir *cache_dir)
-{
-	remove_index_dirs(cache_dir);
-
-	kobject_put(cache_dir->kobj);
-
-	kfree(cache_dir);
-}
-
-static void cache_cpu_clear(struct cache *cache, int cpu)
-{
-	while (cache) {
-		struct cache *next = cache->next_local;
-
-		WARN_ONCE(!cpumask_test_cpu(cpu, &cache->shared_cpu_map),
-			  "CPU %i not accounted in %s(%s)\n",
-			  cpu, cache->ofnode->full_name,
-			  cache_type_string(cache));
-
-		cpumask_clear_cpu(cpu, &cache->shared_cpu_map);
-
-		/* Release the cache object if all the cpus using it
-		 * are offline */
-		if (cpumask_empty(&cache->shared_cpu_map))
-			release_cache(cache);
-
-		cache = next;
+	for (idx = 0, level = 1; level <= this_cpu_ci->num_levels &&
+			idx < this_cpu_ci->num_leaves; idx++, level++) {
+		if (!this_leaf)
+			return -EINVAL;
+
+		this_leaf->of_node = np;
+		if (cache_node_is_unified(np)) {
+			ci_leaf_init(this_leaf++, CACHE_TYPE_UNIFIED, level);
+		} else {
+			ci_leaf_init(this_leaf++, CACHE_TYPE_DATA, level);
+			ci_leaf_init(this_leaf++, CACHE_TYPE_INST, level);
+		}
+		np = of_find_next_cache_node(np);
 	}
+	return 0;
 }
 
-void cacheinfo_cpu_offline(unsigned int cpu_id)
-{
-	struct cache_dir *cache_dir;
-	struct cache *cache;
-
-	/* Prevent userspace from seeing inconsistent state - remove
-	 * the sysfs hierarchy first */
-	cache_dir = per_cpu(cache_dir_pcpu, cpu_id);
-
-	/* careful, sysfs population may have failed */
-	if (cache_dir)
-		remove_cache_dir(cache_dir);
-
-	per_cpu(cache_dir_pcpu, cpu_id) = NULL;
-
-	/* clear the CPU's bit in its cache chain, possibly freeing
-	 * cache objects */
-	cache = cache_lookup_by_cpu(cpu_id);
-	if (cache)
-		cache_cpu_clear(cache, cpu_id);
-}
-#endif /* CONFIG_HOTPLUG_CPU */
diff --git a/arch/powerpc/kernel/cacheinfo.h b/arch/powerpc/kernel/cacheinfo.h
deleted file mode 100644
index a7b74d3..0000000
--- a/arch/powerpc/kernel/cacheinfo.h
+++ /dev/null
@@ -1,8 +0,0 @@
-#ifndef _PPC_CACHEINFO_H
-#define _PPC_CACHEINFO_H
-
-/* These are just hooks for sysfs.c to use. */
-extern void cacheinfo_cpu_online(unsigned int cpu_id);
-extern void cacheinfo_cpu_offline(unsigned int cpu_id);
-
-#endif /* _PPC_CACHEINFO_H */
diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index d4a43e6..935929b 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -19,8 +19,6 @@
 #include <asm/pmc.h>
 #include <asm/firmware.h>
 
-#include "cacheinfo.h"
-
 #ifdef CONFIG_PPC64
 #include <asm/paca.h>
 #include <asm/lppaca.h>
@@ -732,7 +730,6 @@ static void register_cpu_online(unsigned int cpu)
 		device_create_file(s, &dev_attr_altivec_idle_wait_time);
 	}
 #endif
-	cacheinfo_cpu_online(cpu);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -813,7 +810,6 @@ static void unregister_cpu_online(unsigned int cpu)
 		device_remove_file(s, &dev_attr_altivec_idle_wait_time);
 	}
 #endif
-	cacheinfo_cpu_offline(cpu);
 }
 
 #ifdef CONFIG_ARCH_CPU_PROBE_RELEASE
-- 
1.8.3.2

^ permalink raw reply related

* Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
From: Stephen N Chivers @ 2014-02-09 19:42 UTC (permalink / raw)
  To: James Yang; +Cc: Chris Proctor, linuxppc-dev, Stephen N Chivers
In-Reply-To: <alpine.LRH.2.00.1402071348380.10318@ra8135-ec1.am.freescale.net>

James Yang <James.Yang@freescale.com> wrote on 02/08/2014 07:49:40 AM:

> From: James Yang <James.Yang@freescale.com>
> To: Gabriel Paubert <paubert@iram.es>
> Cc: Stephen N Chivers <schivers@csc.com.au>, Chris Proctor 
> <cproctor@csc.com.au>, <linuxppc-dev@lists.ozlabs.org>
> Date: 02/08/2014 07:49 AM
> Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> 
> On Fri, 7 Feb 2014, Gabriel Paubert wrote:
> 
> >    Hi Stephen,
> > 
> > On Fri, Feb 07, 2014 at 11:27:57AM +1000, Stephen N Chivers wrote:
> > > Gabriel Paubert <paubert@iram.es> wrote on 02/06/2014 07:26:37 PM:
> > > 
> > > > From: Gabriel Paubert <paubert@iram.es>
> > > > To: Stephen N Chivers <schivers@csc.com.au>
> > > > Cc: linuxppc-dev@lists.ozlabs.org, Chris Proctor 
<cproctor@csc.com.au>
> > > > Date: 02/06/2014 07:26 PM
> > > > Subject: Re: arch/powerpc/math-emu/mtfsf.c - incorrect mask?
> > > > 
> > > > On Thu, Feb 06, 2014 at 12:09:00PM +1000, Stephen N Chivers wrote:
> 
> > > > > 
> > > > >                 mask = 0;
> > > > >                 if (FM & (1 << 0))
> > > > >                         mask |= 0x0000000f;
> > > > >                 if (FM & (1 << 1))
> > > > >                         mask |= 0x000000f0;
> > > > >                 if (FM & (1 << 2))
> > > > >                         mask |= 0x00000f00;
> > > > >                 if (FM & (1 << 3))
> > > > >                         mask |= 0x0000f000;
> > > > >                 if (FM & (1 << 4))
> > > > >                         mask |= 0x000f0000;
> > > > >                 if (FM & (1 << 5))
> > > > >                         mask |= 0x00f00000;
> > > > >                 if (FM & (1 << 6))
> > > > >                         mask |= 0x0f000000;
> > > > >                 if (FM & (1 << 7))
> > > > >                         mask |= 0x90000000;
> > > > > 
> > > > > With the above mask computation I get consistent results for 
> > > > > both the MPC8548 and MPC7410 boards.
> > > > > 
> > > > > Am I missing something subtle?
> > > > 
> > > > No I think you are correct. This said, this code may probably be 
> > > optimized 
> > > > to eliminate a lot of the conditional branches. I think that:
> 
> 
> If the compiler is enabled to generate isel instructions, it would not 
> use a conditional branch for this code. (ignore the andi's values, 
> this is an old compile)
> 
>From limited research, the 440GP is a processor
that doesn't implement the isel instruction and it does
not implement floating point.

The kernel emulates isel and so using that instruction
for the 440GP would have a double trap penalty.

Correct me if I am wrong, the isel instruction first appears
in PowerPC ISA v2.04 around mid 2007. 

Stephen Chivers,
CSC Australia Pty. Ltd. 

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.02.1402080154140.9668@chino.kir.corp.google.com>

On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > It seems like a better approach would be to do this when a node is brought 
> > > online and determine the fallback node based not on the zonelists as you 
> > > do here but rather on locality (such as through a SLIT if provided, see 
> > > node_distance()).
> > 
> > Hmm...
> > I guess that zonelist is base on locality. Zonelist is generated using
> > node_distance(), so I think that it reflects locality. But, I'm not expert
> > on NUMA, so please let me know what I am missing here :)
> > 
> 
> The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> If your solution is going to become the generic kernel API that determines 
> what node has local memory for a particular node, then it will have to 
> support all definitions of node.  That includes nodes that consist solely 
> of I/O, chipsets, networking, or storage devices.  These nodes may not 
> have memory or cpus, so doing it as part of onlining cpus isn't going to 
> be generic enough.  You want a node_to_mem_node() API for all possible 
> node types (the possible node types listed above are straight from the 
> ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> X and we can optimize for that, but any solution that relies on cpu online 
> is probably shortsighted right now.
> 
> I think it would be much better to do this as a part of setting a node to 
> be online.

Okay. I got your point.
I will change it to rely on node online if this patch is really needed.

Thanks!

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:15 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li
In-Reply-To: <20140207213855.GA24989@linux.vnet.ibm.com>

On Fri, Feb 07, 2014 at 01:38:55PM -0800, Nishanth Aravamudan wrote:
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> 
> Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Hello,

I guess that your system has another problem that makes my patches inactive.
Maybe it will also affect to the Christoph's one. Could you confirm page_to_nid(),
numa_mem_id() and node_present_pages although I doubt mostly about page_to_nid()?

Thanks.

^ permalink raw reply

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
From: Joonsoo Kim @ 2014-02-10  1:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm, paulus,
	Anton Blanchard, David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071147390.15168@nuc>

On Fri, Feb 07, 2014 at 11:49:57AM -0600, Christoph Lameter wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > This check wouild need to be something that checks for other contigencies
> > > in the page allocator as well. A simple solution would be to actually run
> > > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > > If that fails then fallback. See how fallback_alloc() does it in slab.
> > >
> >
> > Hello, Christoph.
> >
> > This !node_present_pages() ensure that allocation on this node cannot succeed.
> > So we can directly use numa_mem_id() here.
> 
> Yes of course we can use numa_mem_id().
> 
> But the check is only for not having any memory at all on a node. There
> are other reason for allocations to fail on a certain node. The node could
> have memory that cannot be reclaimed, all dirty, beyond certain
> thresholds, not in the current set of allowed nodes etc etc.

Yes. There are many other cases, but I prefer that we think them separately.
Maybe they needs another approach. For now, to solve memoryless node problem,
my solution is enough and safe.

Thanks.

^ permalink raw reply

* Re: [PATCH 2/2] clocksource: Make clocksource register functions void
From: Yijing Wang @ 2014-02-10  1:13 UTC (permalink / raw)
  To: Thomas Gleixner, David Laight
  Cc: linux-mips@linux-mips.org, x86@kernel.org, Kevin Hilman,
	linux@lists.openrisc.net, Hanjun Guo, Sekhar Nori, Michal Simek,
	Paul Mackerras, Ralf Baechle, H. Peter Anvin, Daniel Walker,
	Hans-Christian Egtvedt, Jonas Bonn, Kukjin Kim, Russell King,
	Richard Weinberger, Daniel Lezcano, Tony Lindgren, Ingo Molnar,
	microblaze-uclinux@itee.uq.edu.au, David Brown,
	Haavard Skinnemoen, Mike Frysinger,
	user-mode-linux-devel@lists.sourceforge.net,
	linux-arm-msm@vger.kernel.org, Jeff Dike,
	davinci-linux-open-source@linux.davincidsp.com,
	linux-samsung-soc@vger.kernel.org, John Stultz,
	user-mode-linux-user@lists.sourceforge.net,
	linux-omap@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Barry Song, Jim Cromie, linux-kernel@vger.kernel.org,
	Nicolas Ferre, 'Tony Prisk', Bryan Huntsman,
	uclinux-dist-devel@blackfin.uclinux.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <alpine.DEB.2.02.1402052139560.24986@ionos.tec.linutronix.de>

On 2014/2/6 4:40, Thomas Gleixner wrote:
> Yijing,
> 
> On Thu, 23 Jan 2014, David Laight wrote:
> 
>> From: Linuxppc-dev Tony Prisk
>>> On 23/01/14 20:12, Yijing Wang wrote:
>>>> Currently, clocksource_register() and __clocksource_register_scale()
>>>> functions always return 0, it's pointless, make functions void.
>>>> And remove the dead code that check the clocksource_register_hz()
>>>> return value.
>>> ......
>>>> -static inline int clocksource_register_hz(struct clocksource *cs, u32 hz)
>>>> +static inline void clocksource_register_hz(struct clocksource *cs, u32 hz)
>>>>   {
>>>>   	return __clocksource_register_scale(cs, 1, hz);
>>>>   }
>>>
>>> This doesn't make sense - you are still returning a value on a function
>>> declared void, and the return is now from a function that doesn't return
>>> anything either ?!?!
>>> Doesn't this throw a compile-time warning??
>>
>> It depends on the compiler.
>> Recent gcc allow it.
>> I don't know if it is actually valid C though.
>>
>> There is no excuse for it on lines like the above though.
> 
> Can you please resend with that fixed against 3.14-rc1 ?

OK, I will resend later.

Thanks!
Yijing.


> 
> .
> 


-- 
Thanks!
Yijing

^ permalink raw reply

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
From: Joonsoo Kim @ 2014-02-10  1:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li
In-Reply-To: <alpine.DEB.2.10.1402071245040.20246@nuc>

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.

^ permalink raw reply

* [RESEND PATCH 0/3] powerpc: Free up an IPI message slot for tick broadcast IPIs
From: Preeti U Murthy @ 2014-02-10  2:37 UTC (permalink / raw)
  To: benh, tglx, linux-kernel, srivatsa.bhat
  Cc: deepthi, arnd, geoff, paul.gortmaker, paulus, linuxppc-dev

This patchset is a precursor for enabling deep idle states on powerpc,
when the local CPU timers stop. The tick broadcast framework in
the Linux Kernel today handles wakeup of such CPUs at their next timer event
by using an external clock device. At the expiry of this clock device, IPIs
are sent to the CPUs in deep idle states  so that they wakeup to handle their
respective timers. This patchset frees up one of the IPI slots on powerpc
so as to be used to handle the tick broadcast IPI.

On certain implementations of powerpc, such an external clock device is absent.
The support in the tick broadcast framework to handle wakeup of CPUs from
deep idle states on such implementations is currently in the tip tree.
https://lkml.org/lkml/2014/2/7/906
https://lkml.org/lkml/2014/2/7/876
https://lkml.org/lkml/2014/2/7/608

With the above support in place, this patchset is next in line to enable deep
idle states on powerpc.

The patchset has been appended by a RESEND tag since nothing has changed from
the previous post except for an added config condition around
tick_broadcast() which handles sending broadcast IPIs, and the update in the cover
letter.
---

Preeti U Murthy (1):
      cpuidle/ppc: Split timer_interrupt() into timer handling and interrupt handling routines

Srivatsa S. Bhat (2):
      powerpc: Free up the slot of PPC_MSG_CALL_FUNC_SINGLE IPI message
      powerpc: Implement tick broadcast IPI as a fixed IPI message

 arch/powerpc/include/asm/smp.h          |    2 -
 arch/powerpc/include/asm/time.h         |    1 
 arch/powerpc/kernel/smp.c               |   25 ++++++---
 arch/powerpc/kernel/time.c              |   86 ++++++++++++++++++-------------
 arch/powerpc/platforms/cell/interrupt.c |    2 -
 arch/powerpc/platforms/ps3/smp.c        |    2 -
 6 files changed, 73 insertions(+), 45 deletions(-)

-- 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox