public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
@ 2012-05-09 14:29 tip-bot for Peter Zijlstra
  2012-05-10 17:30 ` Yinghai Lu
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: tip-bot for Peter Zijlstra @ 2012-05-09 14:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, a.p.zijlstra, cmetcalf, tony.luck, sivanich,
	akpm, ralf, greg.pearson, ink, tglx, kamezawa.hiroyu, rth,
	linux-kernel, hpa, anton, paulus, lethal, davem, dhowells, benh,
	fenghua.yu, mattst88

Commit-ID:  cb83b629bae0327cf9f44f096adc38d150ceb913
Gitweb:     http://git.kernel.org/tip/cb83b629bae0327cf9f44f096adc38d150ceb913
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 17 Apr 2012 15:49:36 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 9 May 2012 15:00:55 +0200

sched/numa: Rewrite the CONFIG_NUMA sched domain support

The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.

Therefore, build a proper numa topology based on node_distance().

Since there's no fixed numa layers anymore, the static SD_NODE_INIT
and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
construct something similar and scales some values either on the
number of cpus in the domain and/or the node_distance() ratio.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: linux-alpha@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: sparclinux@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Greg Pearson <greg.pearson@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: bob.picco@oracle.com
Cc: chris.mason@oracle.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/ia64/include/asm/topology.h           |   25 ---
 arch/mips/include/asm/mach-ip27/topology.h |   17 --
 arch/powerpc/include/asm/topology.h        |   36 ----
 arch/sh/include/asm/topology.h             |   25 ---
 arch/sparc/include/asm/topology_64.h       |   19 --
 arch/tile/include/asm/topology.h           |   26 ---
 arch/x86/include/asm/topology.h            |   38 ----
 include/linux/topology.h                   |   37 ----
 kernel/sched/core.c                        |  280 ++++++++++++++++++----------
 9 files changed, 185 insertions(+), 318 deletions(-)

diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 09f6467..a2496e4 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -70,31 +70,6 @@ void build_cpu_to_node_map(void);
 	.nr_balance_failed	= 0,			\
 }
 
-/* sched_domains SD_NODE_INIT for IA64 NUMA machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 8*(min(num_online_cpus(), 32U)), \
-	.busy_factor		= 64,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_FORK	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 64,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_SMP
diff --git a/arch/mips/include/asm/mach-ip27/topology.h b/arch/mips/include/asm/mach-ip27/topology.h
index 1b1a7d1..b2cf641 100644
--- a/arch/mips/include/asm/mach-ip27/topology.h
+++ b/arch/mips/include/asm/mach-ip27/topology.h
@@ -36,23 +36,6 @@ extern unsigned char __node_distances[MAX_COMPACT_NODES][MAX_COMPACT_NODES];
 
 #define node_distance(from, to)	(__node_distances[(from)][(to)])
 
-/* sched_domains SD_NODE_INIT for SGI IP27 machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 1,			\
-	.flags			= SD_LOAD_BALANCE |	\
-				  SD_BALANCE_EXEC,	\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #include <asm-generic/topology.h>
 
 #endif /* _ASM_MACH_TOPOLOGY_H */
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index c971858..852ed1b 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -18,12 +18,6 @@ struct device_node;
  */
 #define RECLAIM_DISTANCE 10
 
-/*
- * Avoid creating an extra level of balancing (SD_ALLNODES) on the largest
- * POWER7 boxes which have a maximum of 32 nodes.
- */
-#define SD_NODES_PER_DOMAIN 32
-
 #include <asm/mmzone.h>
 
 static inline int cpu_to_node(int cpu)
@@ -51,36 +45,6 @@ static inline int pcibus_to_node(struct pci_bus *bus)
 				 cpu_all_mask :				\
 				 cpumask_of_node(pcibus_to_node(bus)))
 
-/* sched_domains SD_NODE_INIT for PPC64 machines */
-#define SD_NODE_INIT (struct sched_domain) {				\
-	.min_interval		= 8,					\
-	.max_interval		= 32,					\
-	.busy_factor		= 32,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 3,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 0*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_PREFER_LOCAL			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-}
-
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
diff --git a/arch/sh/include/asm/topology.h b/arch/sh/include/asm/topology.h
index 88e7340..b0a282d 100644
--- a/arch/sh/include/asm/topology.h
+++ b/arch/sh/include/asm/topology.h
@@ -3,31 +3,6 @@
 
 #ifdef CONFIG_NUMA
 
-/* sched_domains SD_NODE_INIT for sh machines */
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.parent			= NULL,			\
-	.child			= NULL,			\
-	.groups			= NULL,			\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_BALANCE_NEWIDLE	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-	.nr_balance_failed	= 0,			\
-}
-
 #define cpu_to_node(cpu)	((void)(cpu),0)
 #define parent_node(node)	((void)(node),0)
 
diff --git a/arch/sparc/include/asm/topology_64.h b/arch/sparc/include/asm/topology_64.h
index 8b9c556..1754390 100644
--- a/arch/sparc/include/asm/topology_64.h
+++ b/arch/sparc/include/asm/topology_64.h
@@ -31,25 +31,6 @@ static inline int pcibus_to_node(struct pci_bus *pbus)
 	 cpu_all_mask : \
 	 cpumask_of_node(pcibus_to_node(bus)))
 
-#define SD_NODE_INIT (struct sched_domain) {		\
-	.min_interval		= 8,			\
-	.max_interval		= 32,			\
-	.busy_factor		= 32,			\
-	.imbalance_pct		= 125,			\
-	.cache_nice_tries	= 2,			\
-	.busy_idx		= 3,			\
-	.idle_idx		= 2,			\
-	.newidle_idx		= 0, 			\
-	.wake_idx		= 0,			\
-	.forkexec_idx		= 0,			\
-	.flags			= SD_LOAD_BALANCE	\
-				| SD_BALANCE_FORK	\
-				| SD_BALANCE_EXEC	\
-				| SD_SERIALIZE,		\
-	.last_balance		= jiffies,		\
-	.balance_interval	= 1,			\
-}
-
 #else /* CONFIG_NUMA */
 
 #include <asm-generic/topology.h>
diff --git a/arch/tile/include/asm/topology.h b/arch/tile/include/asm/topology.h
index 6fdd0c8..7a7ce39 100644
--- a/arch/tile/include/asm/topology.h
+++ b/arch/tile/include/asm/topology.h
@@ -78,32 +78,6 @@ static inline const struct cpumask *cpumask_of_node(int node)
 	.balance_interval	= 32,					\
 }
 
-/* sched_domains SD_NODE_INIT for TILE architecture */
-#define SD_NODE_INIT (struct sched_domain) {				\
-	.min_interval		= 16,					\
-	.max_interval		= 512,					\
-	.busy_factor		= 32,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 3,					\
-	.idle_idx		= 1,					\
-	.newidle_idx		= 2,					\
-	.wake_idx		= 1,					\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_PREFER_LOCAL			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 128,					\
-}
-
 /* By definition, we create nodes based on online memory. */
 #define node_has_online_mem(nid) 1
 
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index b9676ae..095b215 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -92,44 +92,6 @@ extern void setup_node_to_cpumask_map(void);
 
 #define pcibus_to_node(bus) __pcibus_to_node(bus)
 
-#ifdef CONFIG_X86_32
-# define SD_CACHE_NICE_TRIES	1
-# define SD_IDLE_IDX		1
-#else
-# define SD_CACHE_NICE_TRIES	2
-# define SD_IDLE_IDX		2
-#endif
-
-/* sched_domains SD_NODE_INIT for NUMA machines */
-#define SD_NODE_INIT (struct sched_domain) {				\
-	.min_interval		= 8,					\
-	.max_interval		= 32,					\
-	.busy_factor		= 32,					\
-	.imbalance_pct		= 125,					\
-	.cache_nice_tries	= SD_CACHE_NICE_TRIES,			\
-	.busy_idx		= 3,					\
-	.idle_idx		= SD_IDLE_IDX,				\
-	.newidle_idx		= 0,					\
-	.wake_idx		= 0,					\
-	.forkexec_idx		= 0,					\
-									\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 1*SD_BALANCE_EXEC			\
-				| 1*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 1*SD_WAKE_AFFINE			\
-				| 0*SD_PREFER_LOCAL			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 1,					\
-}
-
 extern int __node_distance(int, int);
 #define node_distance(a, b) __node_distance(a, b)
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index e26db03..4f59bf3 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -70,7 +70,6 @@ int arch_update_cpu_topology(void);
  * Below are the 3 major initializers used in building sched_domains:
  * SD_SIBLING_INIT, for SMT domains
  * SD_CPU_INIT, for SMP domains
- * SD_NODE_INIT, for NUMA domains
  *
  * Any architecture that cares to do any tuning to these values should do so
  * by defining their own arch-specific initializer in include/asm/topology.h.
@@ -176,48 +175,12 @@ int arch_update_cpu_topology(void);
 }
 #endif
 
-/* sched_domains SD_ALLNODES_INIT for NUMA machines */
-#define SD_ALLNODES_INIT (struct sched_domain) {			\
-	.min_interval		= 64,					\
-	.max_interval		= 64*num_online_cpus(),			\
-	.busy_factor		= 128,					\
-	.imbalance_pct		= 133,					\
-	.cache_nice_tries	= 1,					\
-	.busy_idx		= 3,					\
-	.idle_idx		= 3,					\
-	.flags			= 1*SD_LOAD_BALANCE			\
-				| 1*SD_BALANCE_NEWIDLE			\
-				| 0*SD_BALANCE_EXEC			\
-				| 0*SD_BALANCE_FORK			\
-				| 0*SD_BALANCE_WAKE			\
-				| 0*SD_WAKE_AFFINE			\
-				| 0*SD_SHARE_CPUPOWER			\
-				| 0*SD_POWERSAVINGS_BALANCE		\
-				| 0*SD_SHARE_PKG_RESOURCES		\
-				| 1*SD_SERIALIZE			\
-				| 0*SD_PREFER_SIBLING			\
-				,					\
-	.last_balance		= jiffies,				\
-	.balance_interval	= 64,					\
-}
-
-#ifndef SD_NODES_PER_DOMAIN
-#define SD_NODES_PER_DOMAIN 16
-#endif
-
 #ifdef CONFIG_SCHED_BOOK
 #ifndef SD_BOOK_INIT
 #error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
 #endif
 #endif /* CONFIG_SCHED_BOOK */
 
-#ifdef CONFIG_NUMA
-#ifndef SD_NODE_INIT
-#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
-#endif
-
-#endif /* CONFIG_NUMA */
-
 #ifdef CONFIG_USE_PERCPU_NUMA_NODE_ID
 DECLARE_PER_CPU(int, numa_node);
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6001e5c..b4f2096 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5560,7 +5560,8 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		if (cpumask_intersects(groupmask, sched_group_cpus(group))) {
+		if (!(sd->flags & SD_OVERLAP) &&
+		    cpumask_intersects(groupmask, sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: repeated CPUs\n");
 			break;
@@ -5898,92 +5899,6 @@ static int __init isolated_cpu_setup(char *str)
 
 __setup("isolcpus=", isolated_cpu_setup);
 
-#ifdef CONFIG_NUMA
-
-/**
- * find_next_best_node - find the next node to include in a sched_domain
- * @node: node whose sched_domain we're building
- * @used_nodes: nodes already in the sched_domain
- *
- * Find the next node to include in a given scheduling domain. Simply
- * finds the closest node not already in the @used_nodes map.
- *
- * Should use nodemask_t.
- */
-static int find_next_best_node(int node, nodemask_t *used_nodes)
-{
-	int i, n, val, min_val, best_node = -1;
-
-	min_val = INT_MAX;
-
-	for (i = 0; i < nr_node_ids; i++) {
-		/* Start at @node */
-		n = (node + i) % nr_node_ids;
-
-		if (!nr_cpus_node(n))
-			continue;
-
-		/* Skip already used nodes */
-		if (node_isset(n, *used_nodes))
-			continue;
-
-		/* Simple min distance search */
-		val = node_distance(node, n);
-
-		if (val < min_val) {
-			min_val = val;
-			best_node = n;
-		}
-	}
-
-	if (best_node != -1)
-		node_set(best_node, *used_nodes);
-	return best_node;
-}
-
-/**
- * sched_domain_node_span - get a cpumask for a node's sched_domain
- * @node: node whose cpumask we're constructing
- * @span: resulting cpumask
- *
- * Given a node, construct a good cpumask for its sched_domain to span. It
- * should be one that prevents unnecessary balancing, but also spreads tasks
- * out optimally.
- */
-static void sched_domain_node_span(int node, struct cpumask *span)
-{
-	nodemask_t used_nodes;
-	int i;
-
-	cpumask_clear(span);
-	nodes_clear(used_nodes);
-
-	cpumask_or(span, span, cpumask_of_node(node));
-	node_set(node, used_nodes);
-
-	for (i = 1; i < SD_NODES_PER_DOMAIN; i++) {
-		int next_node = find_next_best_node(node, &used_nodes);
-		if (next_node < 0)
-			break;
-		cpumask_or(span, span, cpumask_of_node(next_node));
-	}
-}
-
-static const struct cpumask *cpu_node_mask(int cpu)
-{
-	lockdep_assert_held(&sched_domains_mutex);
-
-	sched_domain_node_span(cpu_to_node(cpu), sched_domains_tmpmask);
-
-	return sched_domains_tmpmask;
-}
-
-static const struct cpumask *cpu_allnodes_mask(int cpu)
-{
-	return cpu_possible_mask;
-}
-#endif /* CONFIG_NUMA */
-
 static const struct cpumask *cpu_cpu_mask(int cpu)
 {
 	return cpumask_of_node(cpu_to_node(cpu));
@@ -6020,6 +5935,7 @@ struct sched_domain_topology_level {
 	sched_domain_init_f init;
 	sched_domain_mask_f mask;
 	int		    flags;
+	int		    numa_level;
 	struct sd_data      data;
 };
 
@@ -6213,10 +6129,6 @@ sd_init_##type(struct sched_domain_topology_level *tl, int cpu) 	\
 }
 
 SD_INIT_FUNC(CPU)
-#ifdef CONFIG_NUMA
- SD_INIT_FUNC(ALLNODES)
- SD_INIT_FUNC(NODE)
-#endif
 #ifdef CONFIG_SCHED_SMT
  SD_INIT_FUNC(SIBLING)
 #endif
@@ -6338,15 +6250,191 @@ static struct sched_domain_topology_level default_topology[] = {
 	{ sd_init_BOOK, cpu_book_mask, },
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
-#ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, SDTL_OVERLAP, },
-	{ sd_init_ALLNODES, cpu_allnodes_mask, },
-#endif
 	{ NULL, },
 };
 
 static struct sched_domain_topology_level *sched_domain_topology = default_topology;
 
+#ifdef CONFIG_NUMA
+
+static int sched_domains_numa_levels;
+static int sched_domains_numa_scale;
+static int *sched_domains_numa_distance;
+static struct cpumask ***sched_domains_numa_masks;
+static int sched_domains_curr_level;
+
+static inline unsigned long numa_scale(unsigned long x, int level)
+{
+	return x * sched_domains_numa_distance[level] / sched_domains_numa_scale;
+}
+
+static inline int sd_local_flags(int level)
+{
+	if (sched_domains_numa_distance[level] > REMOTE_DISTANCE)
+		return 0;
+
+	return SD_BALANCE_EXEC | SD_BALANCE_FORK | SD_WAKE_AFFINE;
+}
+
+static struct sched_domain *
+sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
+{
+	struct sched_domain *sd = *per_cpu_ptr(tl->data.sd, cpu);
+	int level = tl->numa_level;
+	int sd_weight = cpumask_weight(
+			sched_domains_numa_masks[level][cpu_to_node(cpu)]);
+
+	*sd = (struct sched_domain){
+		.min_interval		= sd_weight,
+		.max_interval		= 2*sd_weight,
+		.busy_factor		= 32,
+		.imbalance_pct		= 100 + numa_scale(25, level),
+		.cache_nice_tries	= 2,
+		.busy_idx		= 3,
+		.idle_idx		= 2,
+		.newidle_idx		= 0,
+		.wake_idx		= 0,
+		.forkexec_idx		= 0,
+
+		.flags			= 1*SD_LOAD_BALANCE
+					| 1*SD_BALANCE_NEWIDLE
+					| 0*SD_BALANCE_EXEC
+					| 0*SD_BALANCE_FORK
+					| 0*SD_BALANCE_WAKE
+					| 0*SD_WAKE_AFFINE
+					| 0*SD_PREFER_LOCAL
+					| 0*SD_SHARE_CPUPOWER
+					| 0*SD_POWERSAVINGS_BALANCE
+					| 0*SD_SHARE_PKG_RESOURCES
+					| 1*SD_SERIALIZE
+					| 0*SD_PREFER_SIBLING
+					| sd_local_flags(level)
+					,
+		.last_balance		= jiffies,
+		.balance_interval	= sd_weight,
+	};
+	SD_INIT_NAME(sd, NUMA);
+	sd->private = &tl->data;
+
+	/*
+	 * Ugly hack to pass state to sd_numa_mask()...
+	 */
+	sched_domains_curr_level = tl->numa_level;
+
+	return sd;
+}
+
+static const struct cpumask *sd_numa_mask(int cpu)
+{
+	return sched_domains_numa_masks[sched_domains_curr_level][cpu_to_node(cpu)];
+}
+
+static void sched_init_numa(void)
+{
+	int next_distance, curr_distance = node_distance(0, 0);
+	struct sched_domain_topology_level *tl;
+	int level = 0;
+	int i, j, k;
+
+	sched_domains_numa_scale = curr_distance;
+	sched_domains_numa_distance = kzalloc(sizeof(int) * nr_node_ids, GFP_KERNEL);
+	if (!sched_domains_numa_distance)
+		return;
+
+	/*
+	 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
+	 * unique distances in the node_distance() table.
+	 *
+	 * Assumes node_distance(0,j) includes all distances in
+	 * node_distance(i,j) in order to avoid cubic time.
+	 *
+	 * XXX: could be optimized to O(n log n) by using sort()
+	 */
+	next_distance = curr_distance;
+	for (i = 0; i < nr_node_ids; i++) {
+		for (j = 0; j < nr_node_ids; j++) {
+			int distance = node_distance(0, j);
+			if (distance > curr_distance &&
+					(distance < next_distance ||
+					 next_distance == curr_distance))
+				next_distance = distance;
+		}
+		if (next_distance != curr_distance) {
+			sched_domains_numa_distance[level++] = next_distance;
+			sched_domains_numa_levels = level;
+			curr_distance = next_distance;
+		} else break;
+	}
+	/*
+	 * 'level' contains the number of unique distances, excluding the
+	 * identity distance node_distance(i,i).
+	 *
+	 * The sched_domains_nume_distance[] array includes the actual distance
+	 * numbers.
+	 */
+
+	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
+	if (!sched_domains_numa_masks)
+		return;
+
+	/*
+	 * Now for each level, construct a mask per node which contains all
+	 * cpus of nodes that are that many hops away from us.
+	 */
+	for (i = 0; i < level; i++) {
+		sched_domains_numa_masks[i] =
+			kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
+		if (!sched_domains_numa_masks[i])
+			return;
+
+		for (j = 0; j < nr_node_ids; j++) {
+			struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+			if (!mask)
+				return;
+
+			sched_domains_numa_masks[i][j] = mask;
+
+			for (k = 0; k < nr_node_ids; k++) {
+				if (node_distance(cpu_to_node(j), k) >
+						sched_domains_numa_distance[i])
+					continue;
+
+				cpumask_or(mask, mask, cpumask_of_node(k));
+			}
+		}
+	}
+
+	tl = kzalloc((ARRAY_SIZE(default_topology) + level) *
+			sizeof(struct sched_domain_topology_level), GFP_KERNEL);
+	if (!tl)
+		return;
+
+	/*
+	 * Copy the default topology bits..
+	 */
+	for (i = 0; default_topology[i].init; i++)
+		tl[i] = default_topology[i];
+
+	/*
+	 * .. and append 'j' levels of NUMA goodness.
+	 */
+	for (j = 0; j < level; i++, j++) {
+		tl[i] = (struct sched_domain_topology_level){
+			.init = sd_numa_init,
+			.mask = sd_numa_mask,
+			.flags = SDTL_OVERLAP,
+			.numa_level = j,
+		};
+	}
+
+	sched_domain_topology = tl;
+}
+#else
+static inline void sched_init_numa(void)
+{
+}
+#endif /* CONFIG_NUMA */
+
 static int __sdt_alloc(const struct cpumask *cpu_map)
 {
 	struct sched_domain_topology_level *tl;
@@ -6840,6 +6928,8 @@ void __init sched_init_smp(void)
 	alloc_cpumask_var(&non_isolated_cpus, GFP_KERNEL);
 	alloc_cpumask_var(&fallback_doms, GFP_KERNEL);
 
+	sched_init_numa();
+
 	get_online_cpus();
 	mutex_lock(&sched_domains_mutex);
 	init_sched_domains(cpu_active_mask);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-09 14:29 [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support tip-bot for Peter Zijlstra
@ 2012-05-10 17:30 ` Yinghai Lu
  2012-05-10 17:44   ` Peter Zijlstra
  2012-05-24 21:23 ` Tony Luck
  2012-06-06  7:43 ` Alex Shi
  2 siblings, 1 reply; 16+ messages in thread
From: Yinghai Lu @ 2012-05-10 17:30 UTC (permalink / raw)
  To: mingo, a.p.zijlstra, torvalds, cmetcalf, tony.luck, sivanich,
	akpm, ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu
  Cc: linux-tip-commits

On Wed, May 9, 2012 at 7:29 AM, tip-bot for Peter Zijlstra
<a.p.zijlstra@chello.nl> wrote:
> Commit-ID:  cb83b629bae0327cf9f44f096adc38d150ceb913
> Gitweb:     http://git.kernel.org/tip/cb83b629bae0327cf9f44f096adc38d150ceb913
> Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> AuthorDate: Tue, 17 Apr 2012 15:49:36 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 9 May 2012 15:00:55 +0200
>
> sched/numa: Rewrite the CONFIG_NUMA sched domain support
>
> The current code groups up to 16 nodes in a level and then puts an
> ALLNODES domain spanning the entire tree on top of that. This doesn't
> reflect the numa topology and esp for the smaller not-fully-connected
> machines out there today this might make a difference.
>
> Therefore, build a proper numa topology based on node_distance().
>
> Since there's no fixed numa layers anymore, the static SD_NODE_INIT
> and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
> construct something similar and scales some values either on the
> number of cpus in the domain and/or the node_distance() ratio.
>


not sure if this one or other is related....

got this from 8 socket Nehalem-ex box.

[   25.549259] mtrr_aps_init() done
[   25.554298] ------------[ cut here ]------------
[   25.554549] WARNING: at kernel/sched/core.c:6086
build_sched_domains+0x1a9/0x2d0()
[   25.565131] Hardware name: unknown
[   25.565318] Modules linked in:
[   25.584922] Pid: 1, comm: swapper/0 Not tainted
3.4.0-rc6-yh-03548-gecc3211-dirty #312
[   25.585308] Call Trace:
[   25.585464]  [<ffffffff8106a7d1>] warn_slowpath_common+0x83/0x9b
[   25.605128]  [<ffffffff8106a803>] warn_slowpath_null+0x1a/0x1c
[   25.624828]  [<ffffffff81097628>] build_sched_domains+0x1a9/0x2d0
[   25.625154]  [<ffffffff8113db34>] ? __kmalloc+0x82/0x15c
[   25.644820]  [<ffffffff828e9151>] sched_init_smp+0x7f/0x194
[   25.645080]  [<ffffffff828d0fdc>] kernel_init+0xa7/0x19f
[   25.664792]  [<ffffffff81dd0954>] kernel_thread_helper+0x4/0x10
[   25.665094]  [<ffffffff81dc8a59>] ? retint_restore_args+0xe/0xe
[   25.684762]  [<ffffffff828d0f35>] ? do_initcalls+0xc9/0xc9
[   25.685019]  [<ffffffff81dd0950>] ? gs_change+0xb/0xb
[   25.704713] ---[ end trace 5003353dd8ff0030 ]---
[   25.704967] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000020
[   25.724721] IP: [<ffffffff813cf408>] __bitmap_weight+0x1a/0x67
[   25.725011] PGD 0
[   25.725107] Oops: 0000 [#1] SMP
[   25.749960] CPU 0
[   25.750088] Modules linked in:
[   25.750224]
[   25.750301] Pid: 1, comm: swapper/0 Tainted: G        W
3.4.0-rc6-yh-03548-gecc3211-dirty #312 Oracle Corporation  unknown
  /
[   25.765035] RIP: 0010:[<ffffffff813cf408>]  [<ffffffff813cf408>]
__bitmap_weight+0x1a/0x67
[   25.784842] RSP: 0018:ffff8810374c1e70  EFLAGS: 00010206
[   25.804557] RAX: 0000000000000003 RBX: 000000000000007f RCX: 0000000000000003
[   25.804940] RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000000000020
[   25.824665] RBP: ffff8810374c1e70 R08: 0000000000000020 R09: 0000000000000000
[   25.844504] R10: 0000000000000000 R11: 0000000000000082 R12: ffff8880373bcfc0
[   25.844882] R13: 0000000000000000 R14: ffff8880373eae00 R15: fffffffffffffc08
[   25.864512] FS:  0000000000000000(0000) GS:ffff88103de00000(0000)
knlGS:0000000000000000
[   25.884400] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[   25.884695] CR2: 0000000000000020 CR3: 00000000025af000 CR4: 00000000000007f0
[   25.904389] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   25.904753] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   25.924501] Process swapper/0 (pid: 1, threadinfo ffff8810374c0000,
task ffff8810374b8000)
[   25.944730] Stack:
[   25.944856]  ffff8810374c1ee0 ffffffff81097636 ffff8810374c1ed0
ffffffff8113db34
[   25.964506]  2222222222222222 ffff8880373ebe00 00000000001d6828
ffff88803706a000
[   25.964870]  ffff8810374b85c8 ffffffff829c73f8 ffff8810374b85c8
00000000000000ff
[   25.984495] Call Trace:
[   25.984624]  [<ffffffff81097636>] build_sched_domains+0x1b7/0x2d0
[   26.004343]  [<ffffffff8113db34>] ? __kmalloc+0x82/0x15c
[   26.004607]  [<ffffffff828e9151>] sched_init_smp+0x7f/0x194
[   26.024288]  [<ffffffff828d0fdc>] kernel_init+0xa7/0x19f
[   26.024560]  [<ffffffff81dd0954>] kernel_thread_helper+0x4/0x10
[   26.044222]  [<ffffffff81dc8a59>] ? retint_restore_args+0xe/0xe
[   26.044539]  [<ffffffff828d0f35>] ? do_initcalls+0xc9/0xc9
[   26.064134]  [<ffffffff81dd0950>] ? gs_change+0xb/0xb
[   26.064410] Code: 48 8b 0c d6 48 89 0c d7 48 ff c2 39 d0 7f f1 5d
c3 89 f0 b9 40 00 00 00 55 99 49 89 f8 45 31 c9 f7 f9 48 89 e5 31 d2
89 c1 eb 0f <49> 8b 3c d0 48 ff c2 f3 48 0f b8 c7 41 01 c1 39 d1 7f ed
45 31
[   26.104070] RIP  [<ffffffff813cf408>] __bitmap_weight+0x1a/0x67
[   26.123783]  RSP <ffff8810374c1e70>
[   26.123947] CR2: 0000000000000020
[   26.124143] ---[ end trace 5003353dd8ff0031 ]---
[   26.143813] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x00000009

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-10 17:30 ` Yinghai Lu
@ 2012-05-10 17:44   ` Peter Zijlstra
  2012-05-10 17:54     ` Yinghai Lu
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2012-05-10 17:44 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: mingo, torvalds, cmetcalf, tony.luck, sivanich, akpm, ralf,
	greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, linux-tip-commits

On Thu, 2012-05-10 at 10:30 -0700, Yinghai Lu wrote:
> not sure if this one or other is related....
> 
> got this from 8 socket Nehalem-ex box.
> 
> [   25.549259] mtrr_aps_init() done
> [   25.554298] ------------[ cut here ]------------
> [   25.554549] WARNING: at kernel/sched/core.c:6086 build_sched_domains+0x1a9/0x2d0() 

oops,.. could you get me the output of:

 cat /sys/devices/system/node/node*/distance

for that machine? I'll see if I can reproduce using numa=fake.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-10 17:44   ` Peter Zijlstra
@ 2012-05-10 17:54     ` Yinghai Lu
  2012-05-29  0:32       ` Jiang Liu
  0 siblings, 1 reply; 16+ messages in thread
From: Yinghai Lu @ 2012-05-10 17:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, cmetcalf, tony.luck, sivanich, akpm, ralf,
	greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, linux-tip-commits

On Thu, May 10, 2012 at 10:44 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2012-05-10 at 10:30 -0700, Yinghai Lu wrote:
>> not sure if this one or other is related....
>>
>> got this from 8 socket Nehalem-ex box.
>>
>> [   25.549259] mtrr_aps_init() done
>> [   25.554298] ------------[ cut here ]------------
>> [   25.554549] WARNING: at kernel/sched/core.c:6086 build_sched_domains+0x1a9/0x2d0()
>
> oops,.. could you get me the output of:
>
>  cat /sys/devices/system/node/node*/distance
>
> for that machine? I'll see if I can reproduce using numa=fake.

[    0.000000] ACPI: SLIT: nodes = 8
[    0.000000]    10 15 20 15 15 20 20 20
[    0.000000]    15 10 15 20 20 15 20 20
[    0.000000]    20 15 10 15 20 20 15 20
[    0.000000]    15 20 15 10 20 20 20 15
[    0.000000]    15 20 20 20 10 15 15 20
[    0.000000]    20 15 20 20 15 10 20 15
[    0.000000]    20 20 15 20 15 20 10 15
[    0.000000]    20 20 20 15 20 15 15 10


[root@yhlu-pc2 ~]# cat /sys/devices/system/node/node*/distance
10 15 15 20 15 20 20 20
15 10 20 15 20 15 20 20
15 20 10 15 20 20 15 20
20 15 15 10 20 20 20 15
15 20 20 20 10 15 20 15
20 15 20 20 15 10 15 20
20 20 15 20 20 15 10 15
20 20 20 15 15 20 15 10

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-09 14:29 [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support tip-bot for Peter Zijlstra
  2012-05-10 17:30 ` Yinghai Lu
@ 2012-05-24 21:23 ` Tony Luck
  2012-05-25  7:31   ` Peter Zijlstra
  2012-06-06  7:43 ` Alex Shi
  2 siblings, 1 reply; 16+ messages in thread
From: Tony Luck @ 2012-05-24 21:23 UTC (permalink / raw)
  To: mingo, a.p.zijlstra, torvalds, cmetcalf, tony.luck, sivanich,
	akpm, ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu

On Wed, May 9, 2012 at 7:29 AM, tip-bot for Peter Zijlstra
<a.p.zijlstra@chello.nl> wrote:
> Commit-ID:  cb83b629bae0327cf9f44f096adc38d150ceb913
> Gitweb:     http://git.kernel.org/tip/cb83b629bae0327cf9f44f096adc38d150ceb913
> Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
> AuthorDate: Tue, 17 Apr 2012 15:49:36 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 9 May 2012 15:00:55 +0200
>
> sched/numa: Rewrite the CONFIG_NUMA sched domain support

This is upstream in Linus' tree now - and seems to be the cause of
an ia64 boot failure. The zonelist that arrives at __alloc_pages_nodemask
is garbage. Changing both the kzalloc_node() calls in sched_init_numa()
into plain kzalloc() calls seems to fix things. So it looks like we are trying
to allocate on a node before the node has been fully set up.

Call Trace:
 [<a0000001000165e0>] show_stack+0x80/0xa0
                                sp=e000000301b7f6f0 bsp=e000000301b71348
 [<a000000100016c40>] show_regs+0x640/0x920
                                sp=e000000301b7f8c0 bsp=e000000301b712f0
 [<a0000001000417f0>] die+0x190/0x2c0
                                sp=e000000301b7f8d0 bsp=e000000301b712b0
 [<a000000100074a90>] ia64_do_page_fault+0x6b0/0xac0
                                sp=e000000301b7f8d0 bsp=e000000301b71258
 [<a00000010000c100>] ia64_native_leave_kernel+0x0/0x270
                                sp=e000000301b7f960 bsp=e000000301b71258
 [<a00000010016b3a0>] __alloc_pages_nodemask+0x140/0xce0
                                sp=e000000301b7fb30 bsp=e000000301b710f0
 [<a0000001001ec970>] allocate_slab+0x130/0x3c0
                                sp=e000000301b7fb50 bsp=e000000301b71098
 [<a0000001001ecc40>] new_slab+0x40/0x680
                                sp=e000000301b7fb50 bsp=e000000301b71040
 [<a0000001001ed960>] __slab_alloc+0x6e0/0x8e0
                                sp=e000000301b7fb50 bsp=e000000301b70fa8
 [<a0000001001ef9a0>] kmem_cache_alloc_node+0xc0/0x3a0
                                sp=e000000301b7fb90 bsp=e000000301b70f70
 [<a0000001000df8a0>] sched_init_numa+0x360/0x780
                                sp=e000000301b7fb90 bsp=e000000301b70ed0
 [<a000000100d6be80>] sched_init_smp+0x30/0x300
                                sp=e000000301b7fbb0 bsp=e000000301b70eb0
 [<a000000100d50760>] kernel_init+0x230/0x340
                                sp=e000000301b7fdb0 bsp=e000000301b70e88
 [<a0000001000145f0>] kernel_thread_helper+0x30/0x60
                                sp=e000000301b7fe30 bsp=e000000301b70e60
 [<a00000010000a0c0>] start_kernel_thread+0x20/0x40
                                sp=e000000301b7fe30 bsp=e000000301b70e60
Disabling lock debugging due to kernel taint

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-24 21:23 ` Tony Luck
@ 2012-05-25  7:31   ` Peter Zijlstra
  2012-05-25 14:24     ` Tony Luck
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Peter Zijlstra @ 2012-05-25  7:31 UTC (permalink / raw)
  To: Tony Luck
  Cc: mingo, torvalds, cmetcalf, sivanich, akpm, ralf, greg.pearson,
	ink, tglx, rth, kamezawa.hiroyu, paulus, linux-kernel, hpa, anton,
	lethal, davem, benh, dhowells, mattst88, fenghua.yu

On Thu, 2012-05-24 at 14:23 -0700, Tony Luck wrote:
> Changing both the kzalloc_node() calls in sched_init_numa()
> into plain kzalloc() calls seems to fix things. So it looks like we are trying
> to allocate on a node before the node has been fully set up. 

Right,.. and its not too important either, so lets just use regular
allocations.

That said, I can only find the 1 alloc_node() in sched_init_numa()


---
Subject: sched: Don't try allocating memory from offline nodes
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri May 25 09:26:43 CEST 2012

Allocators don't appreciate it when you try and allocate memory from
offline nodes.

Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/core.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -6449,7 +6449,7 @@ static void sched_init_numa(void)
 			return;
 
 		for (j = 0; j < nr_node_ids; j++) {
-			struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
 			if (!mask)
 				return;
 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-25  7:31   ` Peter Zijlstra
@ 2012-05-25 14:24     ` Tony Luck
  2012-05-25 16:26       ` Tony Luck
  2012-05-29  0:19     ` Anton Blanchard
  2012-06-05  7:16     ` Alex Shi
  2 siblings, 1 reply; 16+ messages in thread
From: Tony Luck @ 2012-05-25 14:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, cmetcalf, sivanich, akpm, ralf, greg.pearson,
	ink, tglx, rth, kamezawa.hiroyu, paulus, linux-kernel, hpa, anton,
	lethal, davem, benh, dhowells, mattst88, fenghua.yu

On Fri, May 25, 2012 at 12:31 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Right,.. and its not too important either, so lets just use regular
> allocations.

Thanks.

> That said, I can only find the 1 alloc_node() in sched_init_numa()

Doh - I must have searched for the next, and not noticed that
I had skipped into a different function,

-Tony

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-25 14:24     ` Tony Luck
@ 2012-05-25 16:26       ` Tony Luck
  0 siblings, 0 replies; 16+ messages in thread
From: Tony Luck @ 2012-05-25 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, torvalds, cmetcalf, sivanich, akpm, ralf, greg.pearson,
	ink, tglx, rth, kamezawa.hiroyu, paulus, linux-kernel, hpa, anton,
	lethal, davem, benh, dhowells, mattst88, fenghua.yu

On Fri, May 25, 2012 at 7:24 AM, Tony Luck <tony.luck@intel.com> wrote:
>> That said, I can only find the 1 alloc_node() in sched_init_numa()

Just to complete the loop - your patch is good ... it isn't necessary to
also change another random kzalloc_node() in an unrelated function
that just happens to be where "n" in vi jumps to :-)

Tested-by: Tony Luck <tony.luck@intel.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-25  7:31   ` Peter Zijlstra
  2012-05-25 14:24     ` Tony Luck
@ 2012-05-29  0:19     ` Anton Blanchard
  2012-06-05  7:16     ` Alex Shi
  2 siblings, 0 replies; 16+ messages in thread
From: Anton Blanchard @ 2012-05-29  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Luck, mingo, torvalds, cmetcalf, sivanich, akpm, ralf,
	greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, shangw


Hi Peter,

We have a number of ppc64 boxes that are hitting this and have
verified that the patch fixes it.

Tested-by: Anton Blanchard <anton@samba.org>

Thanks!
Anton

---
Subject: sched: Don't try allocating memory from offline nodes
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri May 25 09:26:43 CEST 2012

Allocators don't appreciate it when you try and allocate memory from
offline nodes.

Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched/core.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/sched/core.c
===================================================================
--- linux-2.6.orig/kernel/sched/core.c
+++ linux-2.6/kernel/sched/core.c
@@ -6449,7 +6449,7 @@ static void sched_init_numa(void)
 			return;
 
 		for (j = 0; j < nr_node_ids; j++) {
-			struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
+			struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
 			if (!mask)
 				return;
 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-10 17:54     ` Yinghai Lu
@ 2012-05-29  0:32       ` Jiang Liu
  2012-05-29 12:13         ` Peter Zijlstra
  2012-05-29 17:12         ` Yinghai Lu
  0 siblings, 2 replies; 16+ messages in thread
From: Jiang Liu @ 2012-05-29  0:32 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Peter Zijlstra, mingo, torvalds, cmetcalf, tony.luck, sivanich,
	akpm, ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, linux-tip-commits

Hi Yinghai,
	Does this patch fix your issue? https://lkml.org/lkml/2012/5/9/183.
I have encountered a similar issue on an IA64 platform and the patch above 
works around it. But the root cause is a BIOS bug that the order of CPUs 
in MADT table doesn't conform to the ACPI specification and the first CPU 
in MADT is not the BSP, which breaks some assumption of the booting code
and causes the core dump.
	Thanks!

On 05/11/2012 01:54 AM, Yinghai Lu wrote:
> On Thu, May 10, 2012 at 10:44 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> On Thu, 2012-05-10 at 10:30 -0700, Yinghai Lu wrote:
>>> not sure if this one or other is related....
>>>
>>> got this from 8 socket Nehalem-ex box.
>>>
>>> [   25.549259] mtrr_aps_init() done
>>> [   25.554298] ------------[ cut here ]------------
>>> [   25.554549] WARNING: at kernel/sched/core.c:6086 build_sched_domains+0x1a9/0x2d0()
>>
>> oops,.. could you get me the output of:
>>
>>  cat /sys/devices/system/node/node*/distance
>>
>> for that machine? I'll see if I can reproduce using numa=fake.
> 
> [    0.000000] ACPI: SLIT: nodes = 8
> [    0.000000]    10 15 20 15 15 20 20 20
> [    0.000000]    15 10 15 20 20 15 20 20
> [    0.000000]    20 15 10 15 20 20 15 20
> [    0.000000]    15 20 15 10 20 20 20 15
> [    0.000000]    15 20 20 20 10 15 15 20
> [    0.000000]    20 15 20 20 15 10 20 15
> [    0.000000]    20 20 15 20 15 20 10 15
> [    0.000000]    20 20 20 15 20 15 15 10
> 
> 
> [root@yhlu-pc2 ~]# cat /sys/devices/system/node/node*/distance
> 10 15 15 20 15 20 20 20
> 15 10 20 15 20 15 20 20
> 15 20 10 15 20 20 15 20
> 20 15 15 10 20 20 20 15
> 15 20 20 20 10 15 20 15
> 20 15 20 20 15 10 15 20
> 20 20 15 20 20 15 10 15
> 20 20 20 15 15 20 15 10
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-29  0:32       ` Jiang Liu
@ 2012-05-29 12:13         ` Peter Zijlstra
  2012-05-29 17:12         ` Yinghai Lu
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2012-05-29 12:13 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Yinghai Lu, mingo, torvalds, cmetcalf, tony.luck, sivanich, akpm,
	ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, linux-tip-commits

On Tue, 2012-05-29 at 08:32 +0800, Jiang Liu wrote:
>         Does this patch fix your issue? https://lkml.org/lkml/2012/5/9/183.
> I have encountered a similar issue on an IA64 platform and the patch above 
> works around it. But the root cause is a BIOS bug that the order of CPUs 
> in MADT table doesn't conform to the ACPI specification and the first CPU 
> in MADT is not the BSP, which breaks some assumption of the booting code
> and causes the core dump. 

Is that IA64 arch code that contains those false assumptions or is it
generic (sched) code that contains them? Esp in the latter case I'd be
very interested to hear where these are so we can fix them.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-29  0:32       ` Jiang Liu
  2012-05-29 12:13         ` Peter Zijlstra
@ 2012-05-29 17:12         ` Yinghai Lu
  1 sibling, 0 replies; 16+ messages in thread
From: Yinghai Lu @ 2012-05-29 17:12 UTC (permalink / raw)
  To: Jiang Liu
  Cc: Peter Zijlstra, mingo, torvalds, cmetcalf, tony.luck, sivanich,
	akpm, ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, linux-tip-commits

On Mon, May 28, 2012 at 5:32 PM, Jiang Liu <liuj97@gmail.com> wrote:
> Hi Yinghai,
>        Does this patch fix your issue? https://lkml.org/lkml/2012/5/9/183.
> I have encountered a similar issue on an IA64 platform and the patch above
> works around it. But the root cause is a BIOS bug that the order of CPUs
> in MADT table doesn't conform to the ACPI specification and the first CPU
> in MADT is not the BSP, which breaks some assumption of the booting code
> and causes the core dump.

yes, with another patch from PeterZ.

---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6396,8 +6396,7 @@ static void sched_init_numa(void)
                       sched_domains_numa_masks[i][j] = mask;

                       for (k = 0; k < nr_node_ids; k++) {
-                               if (node_distance(cpu_to_node(j), k) >
-                                               sched_domains_numa_distance[i])
+                               if (node_distance(j, k) >
sched_domains_numa_distance[i])
                                       continue;

                               cpumask_or(mask, mask, cpumask_of_node(k));

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-25  7:31   ` Peter Zijlstra
  2012-05-25 14:24     ` Tony Luck
  2012-05-29  0:19     ` Anton Blanchard
@ 2012-06-05  7:16     ` Alex Shi
  2 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-06-05  7:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tony Luck, mingo, torvalds, cmetcalf, sivanich, akpm, ralf,
	greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, Alex Shi

LKP performance set 'mem=2g' for some benchmarks, that cmdline hit
kernel panic on __alloc_pages_mask on 3.5-rc1. and this patch can fix
it.
Thanks!

reported-tested-by alex.shi@intel.com


On Fri, May 25, 2012 at 3:31 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2012-05-24 at 14:23 -0700, Tony Luck wrote:
>> Changing both the kzalloc_node() calls in sched_init_numa()
>> into plain kzalloc() calls seems to fix things. So it looks like we are trying
>> to allocate on a node before the node has been fully set up.
>
> Right,.. and its not too important either, so lets just use regular
> allocations.
>
> That said, I can only find the 1 alloc_node() in sched_init_numa()
>
>
> ---
> Subject: sched: Don't try allocating memory from offline nodes
> From: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Date: Fri May 25 09:26:43 CEST 2012
>
> Allocators don't appreciate it when you try and allocate memory from
> offline nodes.
>
> Reported-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/sched/core.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> Index: linux-2.6/kernel/sched/core.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched/core.c
> +++ linux-2.6/kernel/sched/core.c
> @@ -6449,7 +6449,7 @@ static void sched_init_numa(void)
>                        return;
>
>                for (j = 0; j < nr_node_ids; j++) {
> -                       struct cpumask *mask = kzalloc_node(cpumask_size(), GFP_KERNEL, j);
> +                       struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);
>                        if (!mask)
>                                return;
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-05-09 14:29 [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support tip-bot for Peter Zijlstra
  2012-05-10 17:30 ` Yinghai Lu
  2012-05-24 21:23 ` Tony Luck
@ 2012-06-06  7:43 ` Alex Shi
  2012-06-06  9:15   ` Peter Zijlstra
  2 siblings, 1 reply; 16+ messages in thread
From: Alex Shi @ 2012-06-06  7:43 UTC (permalink / raw)
  To: mingo, a.p.zijlstra, torvalds, cmetcalf, tony.luck, sivanich,
	akpm, ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, Alex Shi
  Cc: linux-tip-commits

> +       /*
> +        * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> +        * unique distances in the node_distance() table.
> +        *
> +        * Assumes node_distance(0,j) includes all distances in
> +        * node_distance(i,j) in order to avoid cubic time.

Curious for other platforms node_distance number, actually, this
assumption is right for what I saw Intel platforms. but it is not
match acpispec50.pdf:

Table 6-152 Example Relative Distances Between Proximity Domains
Proximity Domain 0 1 2 3
0 10 15 20 18
1 15 10 16 24
2 20 16 10 12
3 18 24 12 10


Alex

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-06-06  7:43 ` Alex Shi
@ 2012-06-06  9:15   ` Peter Zijlstra
  2012-06-07  0:34     ` Alex Shi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2012-06-06  9:15 UTC (permalink / raw)
  To: Alex Shi
  Cc: mingo, torvalds, cmetcalf, tony.luck, sivanich, akpm, ralf,
	greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, Alex Shi, linux-tip-commits

On Wed, 2012-06-06 at 15:43 +0800, Alex Shi wrote:
> > +       /*
> > +        * O(nr_nodes^2) deduplicating selection sort -- in order to find the
> > +        * unique distances in the node_distance() table.
> > +        *
> > +        * Assumes node_distance(0,j) includes all distances in
> > +        * node_distance(i,j) in order to avoid cubic time.
> 
> Curious for other platforms node_distance number, actually, this
> assumption is right for what I saw Intel platforms. but it is not
> match acpispec50.pdf:
> 
> Table 6-152 Example Relative Distances Between Proximity Domains
> Proximity Domain 0 1 2 3
> 0 10 15 20 18
> 1 15 10 16 24
> 2 20 16 10 12
> 3 18 24 12 10

Yes I know its allowed, I just haven't seen it in practice.

I've got a patch that validates this assumption if you boot with
"sched_debug". If we ever run into such a setup we might need to fix
this -- it shouldn't be too hard, just expensive.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support
  2012-06-06  9:15   ` Peter Zijlstra
@ 2012-06-07  0:34     ` Alex Shi
  0 siblings, 0 replies; 16+ messages in thread
From: Alex Shi @ 2012-06-07  0:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alex Shi, mingo, torvalds, cmetcalf, tony.luck, sivanich, akpm,
	ralf, greg.pearson, ink, tglx, rth, kamezawa.hiroyu, paulus,
	linux-kernel, hpa, anton, lethal, davem, benh, dhowells, mattst88,
	fenghua.yu, linux-tip-commits

On 06/06/2012 05:15 PM, Peter Zijlstra wrote:

> On Wed, 2012-06-06 at 15:43 +0800, Alex Shi wrote:
>>> +       /*
>>> +        * O(nr_nodes^2) deduplicating selection sort -- in order to find the
>>> +        * unique distances in the node_distance() table.
>>> +        *
>>> +        * Assumes node_distance(0,j) includes all distances in
>>> +        * node_distance(i,j) in order to avoid cubic time.
>>
>> Curious for other platforms node_distance number, actually, this
>> assumption is right for what I saw Intel platforms. but it is not
>> match acpispec50.pdf:
>>
>> Table 6-152 Example Relative Distances Between Proximity Domains
>> Proximity Domain 0 1 2 3
>> 0 10 15 20 18
>> 1 15 10 16 24
>> 2 20 16 10 12
>> 3 18 24 12 10
> 
> Yes I know its allowed, I just haven't seen it in practice.


I see. Thanks.

> 
> I've got a patch that validates this assumption if you boot with
> "sched_debug". If we ever run into such a setup we might need to fix
> this -- it shouldn't be too hard, just expensive.


Sure.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-06-07  0:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-09 14:29 [tip:sched/core] sched/numa: Rewrite the CONFIG_NUMA sched domain support tip-bot for Peter Zijlstra
2012-05-10 17:30 ` Yinghai Lu
2012-05-10 17:44   ` Peter Zijlstra
2012-05-10 17:54     ` Yinghai Lu
2012-05-29  0:32       ` Jiang Liu
2012-05-29 12:13         ` Peter Zijlstra
2012-05-29 17:12         ` Yinghai Lu
2012-05-24 21:23 ` Tony Luck
2012-05-25  7:31   ` Peter Zijlstra
2012-05-25 14:24     ` Tony Luck
2012-05-25 16:26       ` Tony Luck
2012-05-29  0:19     ` Anton Blanchard
2012-06-05  7:16     ` Alex Shi
2012-06-06  7:43 ` Alex Shi
2012-06-06  9:15   ` Peter Zijlstra
2012-06-07  0:34     ` Alex Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox