* [PATCH/RFC 0/5] sched: add new 'book' scheduling domain
@ 2010-08-12 17:25 Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions Heiko Carstens
` (5 more replies)
0 siblings, 6 replies; 17+ messages in thread
From: Heiko Carstens @ 2010-08-12 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
Andreas Herrmann
Cc: linux-kernel, Martin Schwidefsky
This patch set adds (yet) another scheduling domain to the scheduler. The
reason for this is that the recent (s390) z196 architecture has four cache
levels and uniform memory access (sort of -- see below).
The cpu/cache/memory hierarchy is as follows:
Each cpu has its private L1 (64KB I-cache + 128KB D-cache) and L2 (1.5MB)
cache.
A core consists of four cpus with a 24MB shared L3 cache.
A book consists of six cores with a 192MB shared L4 cache.
The z196 architecture has no SMT.
Also the statement that we have uniform memory access is not entirely
correct. Actually the machine uses memory striping, so it "looks" like
we have UMA until the next slice of memory gets accessed.
However there is no interface which tells us which piece of memory is local
or remote. So we (have to) simplify and assume that the cost of each memory
access with L4 cache miss is the same.
In order to somehow use the information about the cache hierarchy so that
the scheduler can make some decisions that improves cache hits I added the
'BOOK' scheduling domain between the MC and CPU domains.
First performance measurements however show now effect - neither good nor
bad. So it might be that the workloads aren't good enough, or that the
implementation is simply wrong.
Either way, since its currently very hard to get machine time for additional
measurements I thought it might be a good idea to post the patches as an RFC
even if we do not have any convincing arguments.
Also please note that the scheduling domain initializers certainly need some
tuning:
The line
#define SD_BOOK_INIT SD_CPU_INIT
within the arch support patch is just there so it compiles and until we have
something that really works.
As for the patches, I thinks that the first two patches could be merged
anytime since those are only cleanup/preparation patches.
Patch three adds the new scheduling domain and patch four the code needed
to represent books via the cpu topology sysfs interface.
Patch five is just the architecture backend.
A boot of a logical partition with 20 cpus, shared on two books, gives these
initializion output to the console:
Brought up 20 CPUs
CPU0 attaching sched-domain:
domain 0: span 0-5 level BOOK
groups: 0 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048)
domain 1: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU1 attaching sched-domain:
domain 0: span 1-3 level MC
groups: 1 2 3
domain 1: span 0-5 level BOOK
groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU2 attaching sched-domain:
domain 0: span 1-3 level MC
groups: 2 3 1
domain 1: span 0-5 level BOOK
groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU3 attaching sched-domain:
domain 0: span 1-3 level MC
groups: 3 1 2
domain 1: span 0-5 level BOOK
groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU4 attaching sched-domain:
domain 0: span 4-5 level MC
groups: 4 5
domain 1: span 0-5 level BOOK
groups: 4-5 (cpu_power = 2048) 0 1-3 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU5 attaching sched-domain:
domain 0: span 4-5 level MC
groups: 5 4
domain 1: span 0-5 level BOOK
groups: 4-5 (cpu_power = 2048) 0 1-3 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
CPU6 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 6 7 8 9
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU7 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 7 8 9 6
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU8 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 8 9 6 7
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU9 attaching sched-domain:
domain 0: span 6-9 level MC
groups: 9 6 7 8
domain 1: span 6-19 level BOOK
groups: 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU10 attaching sched-domain:
domain 0: span 10-11 level MC
groups: 10 11
domain 1: span 6-19 level BOOK
groups: 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU11 attaching sched-domain:
domain 0: span 10-11 level MC
groups: 11 10
domain 1: span 6-19 level BOOK
groups: 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU12 attaching sched-domain:
domain 0: span 12-13 level MC
groups: 12 13
domain 1: span 6-19 level BOOK
groups: 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU13 attaching sched-domain:
domain 0: span 12-13 level MC
groups: 13 12
domain 1: span 6-19 level BOOK
groups: 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU14 attaching sched-domain:
domain 0: span 14-16 level MC
groups: 14 15 16
domain 1: span 6-19 level BOOK
groups: 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU15 attaching sched-domain:
domain 0: span 14-16 level MC
groups: 15 16 14
domain 1: span 6-19 level BOOK
groups: 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU16 attaching sched-domain:
domain 0: span 14-16 level MC
groups: 16 14 15
domain 1: span 6-19 level BOOK
groups: 14-16 (cpu_power = 3072) 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU17 attaching sched-domain:
domain 0: span 17-19 level MC
groups: 17 18 19
domain 1: span 6-19 level BOOK
groups: 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU18 attaching sched-domain:
domain 0: span 17-19 level MC
groups: 18 19 17
domain 1: span 6-19 level BOOK
groups: 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
CPU19 attaching sched-domain:
domain 0: span 17-19 level MC
groups: 19 17 18
domain 1: span 6-19 level BOOK
groups: 17-19 (cpu_power = 3072) 6-9 (cpu_power = 4096) 10-11 (cpu_power = 2048) 12-13 (cpu_power = 2048) 14-16 (cpu_power = 3072)
domain 2: span 0-19 level CPU
groups: 6-19 (cpu_power = 14336) 0-5 (cpu_power = 6144)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
@ 2010-08-12 17:25 ` Heiko Carstens
2010-08-13 21:11 ` Suresh Siddha
2010-08-12 17:25 ` [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store Heiko Carstens
` (4 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Heiko Carstens @ 2010-08-12 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
Andreas Herrmann
Cc: linux-kernel, Martin Schwidefsky, Heiko Carstens
[-- Attachment #1: 01-sched-cputocore.diff --]
[-- Type: text/plain, Size: 1634 bytes --]
From: Heiko Carstens <heiko.carstens@de.ibm.com>
Merge and simplify the two cpu_to_core_group variants so that the
resulting function follows the same pattern like cpu_to_phys_group.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
kernel/sched.c | 18 +++++-------------
1 file changed, 5 insertions(+), 13 deletions(-)
diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
--- linux-2.6/kernel/sched.c 2010-08-11 13:47:16.000000000 +0200
+++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
@@ -6546,31 +6546,23 @@ cpu_to_cpu_group(int cpu, const struct c
#ifdef CONFIG_SCHED_MC
static DEFINE_PER_CPU(struct static_sched_domain, core_domains);
static DEFINE_PER_CPU(struct static_sched_group, sched_group_core);
-#endif /* CONFIG_SCHED_MC */
-#if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT)
static int
cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
struct sched_group **sg, struct cpumask *mask)
{
int group;
-
+#ifdef CONFIG_SCHED_SMT
cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map);
group = cpumask_first(mask);
+#else
+ group = cpu;
+#endif
if (sg)
*sg = &per_cpu(sched_group_core, group).sg;
return group;
}
-#elif defined(CONFIG_SCHED_MC)
-static int
-cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
- struct sched_group **sg, struct cpumask *unused)
-{
- if (sg)
- *sg = &per_cpu(sched_group_core, cpu).sg;
- return cpu;
-}
-#endif
+#endif /* CONFIG_SCHED_MC */
static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);
static DEFINE_PER_CPU(struct static_sched_group, sched_group_phys);
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions Heiko Carstens
@ 2010-08-12 17:25 ` Heiko Carstens
2010-08-13 21:13 ` Suresh Siddha
2010-08-16 8:29 ` Peter Zijlstra
2010-08-12 17:25 ` [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain Heiko Carstens
` (3 subsequent siblings)
5 siblings, 2 replies; 17+ messages in thread
From: Heiko Carstens @ 2010-08-12 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
Andreas Herrmann
Cc: linux-kernel, Martin Schwidefsky, Heiko Carstens
[-- Attachment #1: 02-sched-powersavings.diff --]
[-- Type: text/plain, Size: 2037 bytes --]
From: Heiko Carstens <heiko.carstens@de.ibm.com>
Pass the corresponding sched domain level to sched_power_savings_store instead
of a yes/no flag which indicates if the level is SMT or MC.
This is needed to easily extend the function so it can be used for a third
level.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
kernel/sched.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
--- linux-2.6/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
+++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
@@ -7380,7 +7380,8 @@ static void arch_reinit_sched_domains(vo
put_online_cpus();
}
-static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
+static ssize_t sched_power_savings_store(const char *buf, size_t count,
+ enum sched_domain_level sd_level)
{
unsigned int level = 0;
@@ -7397,10 +7398,16 @@ static ssize_t sched_power_savings_store
if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
return -EINVAL;
- if (smt)
+ switch (sd_level) {
+ case SD_LV_SIBLING:
sched_smt_power_savings = level;
- else
+ break;
+ case SD_LV_MC:
sched_mc_power_savings = level;
+ break;
+ default:
+ break;
+ }
arch_reinit_sched_domains();
@@ -7418,7 +7425,7 @@ static ssize_t sched_mc_power_savings_st
struct sysdev_class_attribute *attr,
const char *buf, size_t count)
{
- return sched_power_savings_store(buf, count, 0);
+ return sched_power_savings_store(buf, count, SD_LV_MC);
}
static SYSDEV_CLASS_ATTR(sched_mc_power_savings, 0644,
sched_mc_power_savings_show,
@@ -7436,7 +7443,7 @@ static ssize_t sched_smt_power_savings_s
struct sysdev_class_attribute *attr,
const char *buf, size_t count)
{
- return sched_power_savings_store(buf, count, 1);
+ return sched_power_savings_store(buf, count, SD_LV_SIBLING);
}
static SYSDEV_CLASS_ATTR(sched_smt_power_savings, 0644,
sched_smt_power_savings_show,
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store Heiko Carstens
@ 2010-08-12 17:25 ` Heiko Carstens
2010-08-13 21:22 ` Suresh Siddha
2010-08-12 17:25 ` [PATCH/RFC 4/5] [PATCH] topology/sysfs: provide book id and siblings attributes Heiko Carstens
` (2 subsequent siblings)
5 siblings, 1 reply; 17+ messages in thread
From: Heiko Carstens @ 2010-08-12 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
Andreas Herrmann
Cc: linux-kernel, Martin Schwidefsky, Heiko Carstens
[-- Attachment #1: 03-sched-book.diff --]
[-- Type: text/plain, Size: 12431 bytes --]
From: Heiko Carstens <heiko.carstens@de.ibm.com>
On top of the SMT and MC scheduling domains this adds the BOOK scheduling
domain. This is useful for machines that have a four level cache hierarchy
and but do not fall into the NUMA category.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
arch/s390/defconfig | 1
include/linux/sched.h | 19 +++++++
include/linux/topology.h | 6 ++
kernel/sched.c | 112 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched_fair.c | 11 ++--
5 files changed, 137 insertions(+), 12 deletions(-)
diff -urpN linux-2.6/arch/s390/defconfig linux-2.6-patched/arch/s390/defconfig
--- linux-2.6/arch/s390/defconfig 2010-08-02 00:11:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/defconfig 2010-08-11 13:47:23.000000000 +0200
@@ -248,6 +248,7 @@ CONFIG_64BIT=y
CONFIG_SMP=y
CONFIG_NR_CPUS=32
CONFIG_HOTPLUG_CPU=y
+# CONFIG_SCHED_BOOK is not set
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_AUDIT_ARCH=y
diff -urpN linux-2.6/include/linux/sched.h linux-2.6-patched/include/linux/sched.h
--- linux-2.6/include/linux/sched.h 2010-08-11 13:47:16.000000000 +0200
+++ linux-2.6-patched/include/linux/sched.h 2010-08-11 13:47:23.000000000 +0200
@@ -807,7 +807,9 @@ enum powersavings_balance_level {
MAX_POWERSAVINGS_BALANCE_LEVELS
};
-extern int sched_mc_power_savings, sched_smt_power_savings;
+extern int sched_smt_power_savings;
+extern int sched_mc_power_savings;
+extern int sched_book_power_savings;
static inline int sd_balance_for_mc_power(void)
{
@@ -820,11 +822,23 @@ static inline int sd_balance_for_mc_powe
return 0;
}
-static inline int sd_balance_for_package_power(void)
+static inline int sd_balance_for_book_power(void)
{
if (sched_mc_power_savings | sched_smt_power_savings)
return SD_POWERSAVINGS_BALANCE;
+ if (!sched_book_power_savings)
+ return SD_PREFER_SIBLING;
+
+ return 0;
+}
+
+static inline int sd_balance_for_package_power(void)
+{
+ if (sched_book_power_savings | sched_mc_power_savings |
+ sched_smt_power_savings)
+ return SD_POWERSAVINGS_BALANCE;
+
return SD_PREFER_SIBLING;
}
@@ -875,6 +889,7 @@ enum sched_domain_level {
SD_LV_NONE = 0,
SD_LV_SIBLING,
SD_LV_MC,
+ SD_LV_BOOK,
SD_LV_CPU,
SD_LV_NODE,
SD_LV_ALLNODES,
diff -urpN linux-2.6/include/linux/topology.h linux-2.6-patched/include/linux/topology.h
--- linux-2.6/include/linux/topology.h 2010-08-11 13:47:16.000000000 +0200
+++ linux-2.6-patched/include/linux/topology.h 2010-08-11 13:47:23.000000000 +0200
@@ -201,6 +201,12 @@ int arch_update_cpu_topology(void);
.balance_interval = 64, \
}
+#ifdef CONFIG_SCHED_BOOK
+#ifndef SD_BOOK_INIT
+#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
+#endif
+#endif /* CONFIG_SCHED_BOOK */
+
#ifdef CONFIG_NUMA
#ifndef SD_NODE_INIT
#error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
--- linux-2.6/kernel/sched.c 2010-08-11 13:47:23.000000000 +0200
+++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:23.000000000 +0200
@@ -6472,7 +6472,9 @@ static void sched_domain_node_span(int n
}
#endif /* CONFIG_NUMA */
-int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
+int sched_smt_power_savings;
+int sched_mc_power_savings;
+int sched_book_power_savings;
/*
* The cpus mask in sched_group and sched_domain hangs off the end.
@@ -6500,6 +6502,7 @@ struct s_data {
cpumask_var_t nodemask;
cpumask_var_t this_sibling_map;
cpumask_var_t this_core_map;
+ cpumask_var_t this_book_map;
cpumask_var_t send_covered;
cpumask_var_t tmpmask;
struct sched_group **sched_group_nodes;
@@ -6511,6 +6514,7 @@ enum s_alloc {
sa_rootdomain,
sa_tmpmask,
sa_send_covered,
+ sa_this_book_map,
sa_this_core_map,
sa_this_sibling_map,
sa_nodemask,
@@ -6564,6 +6568,31 @@ cpu_to_core_group(int cpu, const struct
}
#endif /* CONFIG_SCHED_MC */
+/*
+ * book sched-domains:
+ */
+#ifdef CONFIG_SCHED_BOOK
+static DEFINE_PER_CPU(struct static_sched_domain, book_domains);
+static DEFINE_PER_CPU(struct static_sched_group, sched_group_book);
+
+static int
+cpu_to_book_group(int cpu, const struct cpumask *cpu_map,
+ struct sched_group **sg, struct cpumask *mask)
+{
+ int group = cpu;
+#ifdef CONFIG_SCHED_MC
+ cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
+ group = cpumask_first(mask);
+#elif defined(CONFIG_SCHED_SMT)
+ cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map);
+ group = cpumask_first(mask);
+#endif
+ if (sg)
+ *sg = &per_cpu(sched_group_book, group).sg;
+ return group;
+}
+#endif /* CONFIG_SCHED_BOOK */
+
static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);
static DEFINE_PER_CPU(struct static_sched_group, sched_group_phys);
@@ -6572,7 +6601,10 @@ cpu_to_phys_group(int cpu, const struct
struct sched_group **sg, struct cpumask *mask)
{
int group;
-#ifdef CONFIG_SCHED_MC
+#ifdef CONFIG_SCHED_BOOK
+ cpumask_and(mask, cpu_book_mask(cpu), cpu_map);
+ group = cpumask_first(mask);
+#elif defined(CONFIG_SCHED_MC)
cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
group = cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
@@ -6833,6 +6865,9 @@ SD_INIT_FUNC(CPU)
#ifdef CONFIG_SCHED_MC
SD_INIT_FUNC(MC)
#endif
+#ifdef CONFIG_SCHED_BOOK
+ SD_INIT_FUNC(BOOK)
+#endif
static int default_relax_domain_level = -1;
@@ -6882,6 +6917,8 @@ static void __free_domain_allocs(struct
free_cpumask_var(d->tmpmask); /* fall through */
case sa_send_covered:
free_cpumask_var(d->send_covered); /* fall through */
+ case sa_this_book_map:
+ free_cpumask_var(d->this_book_map); /* fall through */
case sa_this_core_map:
free_cpumask_var(d->this_core_map); /* fall through */
case sa_this_sibling_map:
@@ -6928,8 +6965,10 @@ static enum s_alloc __visit_domain_alloc
return sa_nodemask;
if (!alloc_cpumask_var(&d->this_core_map, GFP_KERNEL))
return sa_this_sibling_map;
- if (!alloc_cpumask_var(&d->send_covered, GFP_KERNEL))
+ if (!alloc_cpumask_var(&d->this_book_map, GFP_KERNEL))
return sa_this_core_map;
+ if (!alloc_cpumask_var(&d->send_covered, GFP_KERNEL))
+ return sa_this_book_map;
if (!alloc_cpumask_var(&d->tmpmask, GFP_KERNEL))
return sa_send_covered;
d->rd = alloc_rootdomain();
@@ -6987,6 +7026,23 @@ static struct sched_domain *__build_cpu_
return sd;
}
+static struct sched_domain *__build_book_sched_domain(struct s_data *d,
+ const struct cpumask *cpu_map, struct sched_domain_attr *attr,
+ struct sched_domain *parent, int i)
+{
+ struct sched_domain *sd = parent;
+#ifdef CONFIG_SCHED_BOOK
+ sd = &per_cpu(book_domains, i).sd;
+ SD_INIT(sd, BOOK);
+ set_domain_attribute(sd, attr);
+ cpumask_and(sched_domain_span(sd), cpu_map, cpu_book_mask(i));
+ sd->parent = parent;
+ parent->child = sd;
+ cpu_to_book_group(i, cpu_map, &sd->groups, d->tmpmask);
+#endif
+ return sd;
+}
+
static struct sched_domain *__build_mc_sched_domain(struct s_data *d,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
struct sched_domain *parent, int i)
@@ -7044,6 +7100,15 @@ static void build_sched_groups(struct s_
d->send_covered, d->tmpmask);
break;
#endif
+#ifdef CONFIG_SCHED_BOOK
+ case SD_LV_BOOK: /* set up book groups */
+ cpumask_and(d->this_book_map, cpu_map, cpu_book_mask(cpu));
+ if (cpu == cpumask_first(d->this_book_map))
+ init_sched_build_groups(d->this_book_map, cpu_map,
+ &cpu_to_book_group,
+ d->send_covered, d->tmpmask);
+ break;
+#endif
case SD_LV_CPU: /* set up physical groups */
cpumask_and(d->nodemask, cpumask_of_node(cpu), cpu_map);
if (!cpumask_empty(d->nodemask))
@@ -7091,12 +7156,14 @@ static int __build_sched_domains(const s
sd = __build_numa_sched_domains(&d, cpu_map, attr, i);
sd = __build_cpu_sched_domain(&d, cpu_map, attr, sd, i);
+ sd = __build_book_sched_domain(&d, cpu_map, attr, sd, i);
sd = __build_mc_sched_domain(&d, cpu_map, attr, sd, i);
sd = __build_smt_sched_domain(&d, cpu_map, attr, sd, i);
}
for_each_cpu(i, cpu_map) {
build_sched_groups(&d, SD_LV_SIBLING, cpu_map, i);
+ build_sched_groups(&d, SD_LV_BOOK, cpu_map, i);
build_sched_groups(&d, SD_LV_MC, cpu_map, i);
}
@@ -7127,6 +7194,12 @@ static int __build_sched_domains(const s
init_sched_groups_power(i, sd);
}
#endif
+#ifdef CONFIG_SCHED_BOOK
+ for_each_cpu(i, cpu_map) {
+ sd = &per_cpu(book_domains, i).sd;
+ init_sched_groups_power(i, sd);
+ }
+#endif
for_each_cpu(i, cpu_map) {
sd = &per_cpu(phys_domains, i).sd;
@@ -7152,6 +7225,8 @@ static int __build_sched_domains(const s
sd = &per_cpu(cpu_domains, i).sd;
#elif defined(CONFIG_SCHED_MC)
sd = &per_cpu(core_domains, i).sd;
+#elif defined(CONFIG_SCHED_BOOK)
+ sd = &per_cpu(book_domains, i).sd;
#else
sd = &per_cpu(phys_domains, i).sd;
#endif
@@ -7368,7 +7443,8 @@ match2:
mutex_unlock(&sched_domains_mutex);
}
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+#if defined(CONFIG_SCHED_BOOK) || defined(CONFIG_SCHED_MC) || \
+ defined(CONFIG_SCHED_SMT)
static void arch_reinit_sched_domains(void)
{
get_online_cpus();
@@ -7405,6 +7481,9 @@ static ssize_t sched_power_savings_store
case SD_LV_MC:
sched_mc_power_savings = level;
break;
+ case SD_LV_BOOK:
+ sched_book_power_savings = level;
+ break;
default:
break;
}
@@ -7414,6 +7493,24 @@ static ssize_t sched_power_savings_store
return count;
}
+#ifdef CONFIG_SCHED_BOOK
+static ssize_t sched_book_power_savings_show(struct sysdev_class *class,
+ struct sysdev_class_attribute *attr,
+ char *page)
+{
+ return sprintf(page, "%u\n", sched_book_power_savings);
+}
+static ssize_t sched_book_power_savings_store(struct sysdev_class *class,
+ struct sysdev_class_attribute *attr,
+ const char *buf, size_t count)
+{
+ return sched_power_savings_store(buf, count, SD_LV_BOOK);
+}
+static SYSDEV_CLASS_ATTR(sched_book_power_savings, 0644,
+ sched_book_power_savings_show,
+ sched_book_power_savings_store);
+#endif
+
#ifdef CONFIG_SCHED_MC
static ssize_t sched_mc_power_savings_show(struct sysdev_class *class,
struct sysdev_class_attribute *attr,
@@ -7464,9 +7561,14 @@ int __init sched_create_sysfs_power_savi
err = sysfs_create_file(&cls->kset.kobj,
&attr_sched_mc_power_savings.attr);
#endif
+#ifdef CONFIG_SCHED_BOOK
+ if (!err && book_capable())
+ err = sysfs_create_file(&cls->kset.kobj,
+ &attr_sched_book_power_savings.attr);
+#endif
return err;
}
-#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
+#endif /* CONFIG_SCHED_BOOK || CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
/*
* Update cpusets according to cpu_active mask. If cpusets are
diff -urpN linux-2.6/kernel/sched_fair.c linux-2.6-patched/kernel/sched_fair.c
--- linux-2.6/kernel/sched_fair.c 2010-08-11 13:47:16.000000000 +0200
+++ linux-2.6-patched/kernel/sched_fair.c 2010-08-11 13:47:23.000000000 +0200
@@ -2039,7 +2039,8 @@ struct sd_lb_stats {
unsigned long busiest_group_capacity;
int group_imb; /* Is there imbalance in this sd */
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+#if defined(CONFIG_SCHED_BOOK) || defined(CONFIG_SCHED_MC) || \
+ defined(CONFIG_SCHED_SMT)
int power_savings_balance; /* Is powersave balance needed for this sd */
struct sched_group *group_min; /* Least loaded group in sd */
struct sched_group *group_leader; /* Group which relieves group_min */
@@ -2096,8 +2097,8 @@ static inline int get_sd_load_idx(struct
return load_idx;
}
-
-#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
+#if defined(CONFIG_SCHED_BOOK) || defined(CONFIG_SCHED_MC) || \
+ defined(CONFIG_SCHED_SMT)
/**
* init_sd_power_savings_stats - Initialize power savings statistics for
* the given sched_domain, during load balancing.
@@ -2217,7 +2218,7 @@ static inline int check_power_save_busie
return 1;
}
-#else /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
+#else /* CONFIG_SCHED_BOOK || CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
static inline void init_sd_power_savings_stats(struct sched_domain *sd,
struct sd_lb_stats *sds, enum cpu_idle_type idle)
{
@@ -2235,7 +2236,7 @@ static inline int check_power_save_busie
{
return 0;
}
-#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
+#endif /* CONFIG_SCHED_BOOK || CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH/RFC 4/5] [PATCH] topology/sysfs: provide book id and siblings attributes
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
` (2 preceding siblings ...)
2010-08-12 17:25 ` [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain Heiko Carstens
@ 2010-08-12 17:25 ` Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 5/5] [PATCH] topology: add z196 cpu topology support Heiko Carstens
2010-08-19 12:22 ` [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Andreas Herrmann
5 siblings, 0 replies; 17+ messages in thread
From: Heiko Carstens @ 2010-08-12 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
Andreas Herrmann
Cc: linux-kernel, Martin Schwidefsky, Heiko Carstens
[-- Attachment #1: 04-topology-sysfs-book.diff --]
[-- Type: text/plain, Size: 4483 bytes --]
From: Heiko Carstens <heiko.carstens@de.ibm.com>
Create attributes
/sys/devices/system/cpu/cpuX/topology/book_id
/sys/devices/system/cpu/cpuX/topology/book_siblings
which show the book id and the book siblings of a cpu.
Unlike the attributes for SMT and MC these attributes are only present if
CONFIG_SCHED_BOOK is set. There is no reason to pollute sysfs for every
architecture with unused attributes.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
Documentation/cputopology.txt | 23 ++++++++++++++++++++---
drivers/base/topology.c | 16 +++++++++++++++-
2 files changed, 35 insertions(+), 4 deletions(-)
diff -urpN linux-2.6/Documentation/cputopology.txt linux-2.6-patched/Documentation/cputopology.txt
--- linux-2.6/Documentation/cputopology.txt 2010-08-02 00:11:14.000000000 +0200
+++ linux-2.6-patched/Documentation/cputopology.txt 2010-08-11 13:47:23.000000000 +0200
@@ -14,25 +14,39 @@ to /proc/cpuinfo.
identifier (rather than the kernel's). The actual value is
architecture and platform dependent.
-3) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
+3) /sys/devices/system/cpu/cpuX/topology/book_id:
+
+ the book ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
+4) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
internel kernel map of cpuX's hardware threads within the same
core as cpuX
-4) /sys/devices/system/cpu/cpuX/topology/core_siblings:
+5) /sys/devices/system/cpu/cpuX/topology/core_siblings:
internal kernel map of cpuX's hardware threads within the same
physical_package_id.
+6) /sys/devices/system/cpu/cpuX/topology/book_siblings:
+
+ internal kernel map of cpuX's hardware threads within the same
+ book_id.
+
To implement it in an architecture-neutral way, a new source file,
-drivers/base/topology.c, is to export the 4 attributes.
+drivers/base/topology.c, is to export the 4 or 6 attributes. The two book
+related sysfs files will only be created if CONFIG_SCHED_BOOK is selected.
For an architecture to support this feature, it must define some of
these macros in include/asm-XXX/topology.h:
#define topology_physical_package_id(cpu)
#define topology_core_id(cpu)
+#define topology_book_id(cpu)
#define topology_thread_cpumask(cpu)
#define topology_core_cpumask(cpu)
+#define topology_book_cpumask(cpu)
The type of **_id is int.
The type of siblings is (const) struct cpumask *.
@@ -45,6 +59,9 @@ not defined by include/asm-XXX/topology.
3) thread_siblings: just the given CPU
4) core_siblings: just the given CPU
+For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
+default definitions for topology_book_id() and topology_book_cpumask().
+
Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
source for the output is in brackets ("[]").
diff -urpN linux-2.6/drivers/base/topology.c linux-2.6-patched/drivers/base/topology.c
--- linux-2.6/drivers/base/topology.c 2010-08-02 00:11:14.000000000 +0200
+++ linux-2.6-patched/drivers/base/topology.c 2010-08-11 13:47:23.000000000 +0200
@@ -45,7 +45,8 @@ static ssize_t show_##name(struct sys_de
return sprintf(buf, "%d\n", topology_##name(cpu)); \
}
-#if defined(topology_thread_cpumask) || defined(topology_core_cpumask)
+#if defined(topology_thread_cpumask) || defined(topology_core_cpumask) || \
+ defined(topology_book_cpumask)
static ssize_t show_cpumap(int type, const struct cpumask *mask, char *buf)
{
ptrdiff_t len = PTR_ALIGN(buf + PAGE_SIZE - 1, PAGE_SIZE) - buf;
@@ -114,6 +115,14 @@ define_siblings_show_func(core_cpumask);
define_one_ro_named(core_siblings, show_core_cpumask);
define_one_ro_named(core_siblings_list, show_core_cpumask_list);
+#ifdef CONFIG_SCHED_BOOK
+define_id_show_func(book_id);
+define_one_ro(book_id);
+define_siblings_show_func(book_cpumask);
+define_one_ro_named(book_siblings, show_book_cpumask);
+define_one_ro_named(book_siblings_list, show_book_cpumask_list);
+#endif
+
static struct attribute *default_attrs[] = {
&attr_physical_package_id.attr,
&attr_core_id.attr,
@@ -121,6 +130,11 @@ static struct attribute *default_attrs[]
&attr_thread_siblings_list.attr,
&attr_core_siblings.attr,
&attr_core_siblings_list.attr,
+#ifdef CONFIG_SCHED_BOOK
+ &attr_book_id.attr,
+ &attr_book_siblings.attr,
+ &attr_book_siblings_list.attr,
+#endif
NULL
};
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH/RFC 5/5] [PATCH] topology: add z196 cpu topology support
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
` (3 preceding siblings ...)
2010-08-12 17:25 ` [PATCH/RFC 4/5] [PATCH] topology/sysfs: provide book id and siblings attributes Heiko Carstens
@ 2010-08-12 17:25 ` Heiko Carstens
2010-08-19 12:22 ` [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Andreas Herrmann
5 siblings, 0 replies; 17+ messages in thread
From: Heiko Carstens @ 2010-08-12 17:25 UTC (permalink / raw)
To: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
Andreas Herrmann
Cc: linux-kernel, Martin Schwidefsky, Heiko Carstens
[-- Attachment #1: 05-topology-z196.diff --]
[-- Type: text/plain, Size: 9114 bytes --]
From: Heiko Carstens <heiko.carstens@de.ibm.com>
Use the extended cpu topology information that z196 machines provide
in order to make use of the new book scheduling domain.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
arch/s390/Kconfig | 7 +
arch/s390/include/asm/topology.h | 28 ++++++-
arch/s390/kernel/topology.c | 150 ++++++++++++++++++++++++---------------
3 files changed, 124 insertions(+), 61 deletions(-)
diff -urpN linux-2.6/arch/s390/include/asm/topology.h linux-2.6-patched/arch/s390/include/asm/topology.h
--- linux-2.6/arch/s390/include/asm/topology.h 2010-08-11 13:47:13.000000000 +0200
+++ linux-2.6-patched/arch/s390/include/asm/topology.h 2010-08-11 13:47:24.000000000 +0200
@@ -3,15 +3,33 @@
#include <linux/cpumask.h>
-#define mc_capable() (1)
-
-const struct cpumask *cpu_coregroup_mask(unsigned int cpu);
-
extern unsigned char cpu_core_id[NR_CPUS];
extern cpumask_t cpu_core_map[NR_CPUS];
+static inline const struct cpumask *cpu_coregroup_mask(unsigned int cpu)
+{
+ return &cpu_core_map[cpu];
+}
+
#define topology_core_id(cpu) (cpu_core_id[cpu])
#define topology_core_cpumask(cpu) (&cpu_core_map[cpu])
+#define mc_capable() (1)
+
+#ifdef CONFIG_SCHED_BOOK
+
+extern unsigned char cpu_book_id[NR_CPUS];
+extern cpumask_t cpu_book_map[NR_CPUS];
+
+static inline const struct cpumask *cpu_book_mask(unsigned int cpu)
+{
+ return &cpu_book_map[cpu];
+}
+
+#define topology_book_id(cpu) (cpu_book_id[cpu])
+#define topology_book_cpumask(cpu) (&cpu_book_map[cpu])
+#define book_capable() (1)
+
+#endif /* CONFIG_SCHED_BOOK */
int topology_set_cpu_management(int fc);
void topology_schedule_update(void);
@@ -30,6 +48,8 @@ static inline void s390_init_cpu_topolog
};
#endif
+#define SD_BOOK_INIT SD_CPU_INIT
+
#include <asm-generic/topology.h>
#endif /* _ASM_S390_TOPOLOGY_H */
diff -urpN linux-2.6/arch/s390/Kconfig linux-2.6-patched/arch/s390/Kconfig
--- linux-2.6/arch/s390/Kconfig 2010-08-11 13:47:13.000000000 +0200
+++ linux-2.6-patched/arch/s390/Kconfig 2010-08-11 13:47:24.000000000 +0200
@@ -198,6 +198,13 @@ config HOTPLUG_CPU
can be controlled through /sys/devices/system/cpu/cpu#.
Say N if you want to disable CPU hotplug.
+config SCHED_BOOK
+ bool "Book scheduler support"
+ depends on SMP
+ help
+ Book scheduler support improves the CPU scheduler's decision making
+ when dealing with machines that have several books.
+
config MATHEMU
bool "IEEE FPU emulation"
depends on MARCH_G5
diff -urpN linux-2.6/arch/s390/kernel/topology.c linux-2.6-patched/arch/s390/kernel/topology.c
--- linux-2.6/arch/s390/kernel/topology.c 2010-08-02 00:11:14.000000000 +0200
+++ linux-2.6-patched/arch/s390/kernel/topology.c 2010-08-11 13:47:24.000000000 +0200
@@ -57,8 +57,8 @@ struct tl_info {
union tl_entry tle[0];
};
-struct core_info {
- struct core_info *next;
+struct mask_info {
+ struct mask_info *next;
unsigned char id;
cpumask_t mask;
};
@@ -66,7 +66,6 @@ struct core_info {
static int topology_enabled;
static void topology_work_fn(struct work_struct *work);
static struct tl_info *tl_info;
-static struct core_info core_info;
static int machine_has_topology;
static struct timer_list topology_timer;
static void set_topology_timer(void);
@@ -74,38 +73,37 @@ static DECLARE_WORK(topology_work, topol
/* topology_lock protects the core linked list */
static DEFINE_SPINLOCK(topology_lock);
+static struct mask_info core_info;
cpumask_t cpu_core_map[NR_CPUS];
unsigned char cpu_core_id[NR_CPUS];
-static cpumask_t cpu_coregroup_map(unsigned int cpu)
+#ifdef CONFIG_SCHED_BOOK
+static struct mask_info book_info;
+cpumask_t cpu_book_map[NR_CPUS];
+unsigned char cpu_book_id[NR_CPUS];
+#endif
+
+static cpumask_t cpu_group_map(struct mask_info *info, unsigned int cpu)
{
- struct core_info *core = &core_info;
- unsigned long flags;
cpumask_t mask;
cpus_clear(mask);
if (!topology_enabled || !machine_has_topology)
return cpu_possible_map;
- spin_lock_irqsave(&topology_lock, flags);
- while (core) {
- if (cpu_isset(cpu, core->mask)) {
- mask = core->mask;
+ while (info) {
+ if (cpu_isset(cpu, info->mask)) {
+ mask = info->mask;
break;
}
- core = core->next;
+ info = info->next;
}
- spin_unlock_irqrestore(&topology_lock, flags);
if (cpus_empty(mask))
mask = cpumask_of_cpu(cpu);
return mask;
}
-const struct cpumask *cpu_coregroup_mask(unsigned int cpu)
-{
- return &cpu_core_map[cpu];
-}
-
-static void add_cpus_to_core(struct tl_cpu *tl_cpu, struct core_info *core)
+static void add_cpus_to_mask(struct tl_cpu *tl_cpu, struct mask_info *book,
+ struct mask_info *core)
{
unsigned int cpu;
@@ -117,23 +115,35 @@ static void add_cpus_to_core(struct tl_c
rcpu = CPU_BITS - 1 - cpu + tl_cpu->origin;
for_each_present_cpu(lcpu) {
- if (cpu_logical_map(lcpu) == rcpu) {
- cpu_set(lcpu, core->mask);
- cpu_core_id[lcpu] = core->id;
- smp_cpu_polarization[lcpu] = tl_cpu->pp;
- }
+ if (cpu_logical_map(lcpu) != rcpu)
+ continue;
+#ifdef CONFIG_SCHED_BOOK
+ cpu_set(lcpu, book->mask);
+ cpu_book_id[lcpu] = book->id;
+#endif
+ cpu_set(lcpu, core->mask);
+ cpu_core_id[lcpu] = core->id;
+ smp_cpu_polarization[lcpu] = tl_cpu->pp;
}
}
}
-static void clear_cores(void)
+static void clear_masks(void)
{
- struct core_info *core = &core_info;
+ struct mask_info *info;
- while (core) {
- cpus_clear(core->mask);
- core = core->next;
+ info = &core_info;
+ while (info) {
+ cpus_clear(info->mask);
+ info = info->next;
+ }
+#ifdef CONFIG_SCHED_BOOK
+ info = &book_info;
+ while (info) {
+ cpus_clear(info->mask);
+ info = info->next;
}
+#endif
}
static union tl_entry *next_tle(union tl_entry *tle)
@@ -146,29 +156,36 @@ static union tl_entry *next_tle(union tl
static void tl_to_cores(struct tl_info *info)
{
+#ifdef CONFIG_SCHED_BOOK
+ struct mask_info *book = &book_info;
+#else
+ struct mask_info *book = NULL;
+#endif
+ struct mask_info *core = &core_info;
union tl_entry *tle, *end;
- struct core_info *core = &core_info;
+
spin_lock_irq(&topology_lock);
- clear_cores();
+ clear_masks();
tle = info->tle;
end = (union tl_entry *)((unsigned long)info + info->length);
while (tle < end) {
switch (tle->nl) {
- case 5:
- case 4:
- case 3:
+#ifdef CONFIG_SCHED_BOOK
case 2:
+ book = book->next;
+ book->id = tle->container.id;
break;
+#endif
case 1:
core = core->next;
core->id = tle->container.id;
break;
case 0:
- add_cpus_to_core(&tle->cpu, core);
+ add_cpus_to_mask(&tle->cpu, book, core);
break;
default:
- clear_cores();
+ clear_masks();
machine_has_topology = 0;
goto out;
}
@@ -221,10 +238,29 @@ int topology_set_cpu_management(int fc)
static void update_cpu_core_map(void)
{
+ unsigned long flags;
int cpu;
- for_each_possible_cpu(cpu)
- cpu_core_map[cpu] = cpu_coregroup_map(cpu);
+ spin_lock_irqsave(&topology_lock, flags);
+ for_each_possible_cpu(cpu) {
+ cpu_core_map[cpu] = cpu_group_map(&core_info, cpu);
+#ifdef CONFIG_SCHED_BOOK
+ cpu_book_map[cpu] = cpu_group_map(&book_info, cpu);
+#endif
+ }
+ spin_unlock_irqrestore(&topology_lock, flags);
+}
+
+static void store_topology(struct tl_info *info)
+{
+#ifdef CONFIG_SCHED_BOOK
+ int rc;
+
+ rc = stsi(info, 15, 1, 3);
+ if (rc != -ENOSYS)
+ return;
+#endif
+ stsi(info, 15, 1, 2);
}
int arch_update_cpu_topology(void)
@@ -238,7 +274,7 @@ int arch_update_cpu_topology(void)
topology_update_polarization_simple();
return 0;
}
- stsi(info, 15, 1, 2);
+ store_topology(info);
tl_to_cores(info);
update_cpu_core_map();
for_each_online_cpu(cpu) {
@@ -299,12 +335,24 @@ out:
}
__initcall(init_topology_update);
+static void alloc_masks(struct tl_info *info, struct mask_info *mask, int offset)
+{
+ int i, nr_masks;
+
+ nr_masks = info->mag[NR_MAG - offset];
+ for (i = 0; i < info->mnest - offset; i++)
+ nr_masks *= info->mag[NR_MAG - offset - 1 - i];
+ nr_masks = max(nr_masks, 1);
+ for (i = 0; i < nr_masks; i++) {
+ mask->next = alloc_bootmem(sizeof(struct mask_info));
+ mask = mask->next;
+ }
+}
+
void __init s390_init_cpu_topology(void)
{
unsigned long long facility_bits;
struct tl_info *info;
- struct core_info *core;
- int nr_cores;
int i;
if (stfle(&facility_bits, 1) <= 0)
@@ -315,25 +363,13 @@ void __init s390_init_cpu_topology(void)
tl_info = alloc_bootmem_pages(PAGE_SIZE);
info = tl_info;
- stsi(info, 15, 1, 2);
-
- nr_cores = info->mag[NR_MAG - 2];
- for (i = 0; i < info->mnest - 2; i++)
- nr_cores *= info->mag[NR_MAG - 3 - i];
-
+ store_topology(info);
pr_info("The CPU configuration topology of the machine is:");
for (i = 0; i < NR_MAG; i++)
printk(" %d", info->mag[i]);
printk(" / %d\n", info->mnest);
-
- core = &core_info;
- for (i = 0; i < nr_cores; i++) {
- core->next = alloc_bootmem(sizeof(struct core_info));
- core = core->next;
- if (!core)
- goto error;
- }
- return;
-error:
- machine_has_topology = 0;
+ alloc_masks(info, &core_info, 2);
+#ifdef CONFIG_SCHED_BOOK
+ alloc_masks(info, &book_info, 3);
+#endif
}
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions
2010-08-12 17:25 ` [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions Heiko Carstens
@ 2010-08-13 21:11 ` Suresh Siddha
2010-08-31 8:26 ` Heiko Carstens
0 siblings, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2010-08-13 21:11 UTC (permalink / raw)
To: Heiko Carstens
Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Andreas Herrmann,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Thu, 2010-08-12 at 10:25 -0700, Heiko Carstens wrote:
> From: Heiko Carstens <heiko.carstens@de.ibm.com>
>
> Merge and simplify the two cpu_to_core_group variants so that the
> resulting function follows the same pattern like cpu_to_phys_group.
>
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> ---
>
> kernel/sched.c | 18 +++++-------------
> 1 file changed, 5 insertions(+), 13 deletions(-)
>
> diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
> --- linux-2.6/kernel/sched.c 2010-08-11 13:47:16.000000000 +0200
> +++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
> @@ -6546,31 +6546,23 @@ cpu_to_cpu_group(int cpu, const struct c
> #ifdef CONFIG_SCHED_MC
> static DEFINE_PER_CPU(struct static_sched_domain, core_domains);
> static DEFINE_PER_CPU(struct static_sched_group, sched_group_core);
> -#endif /* CONFIG_SCHED_MC */
>
> -#if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT)
> static int
> cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
> struct sched_group **sg, struct cpumask *mask)
> {
> int group;
> -
> +#ifdef CONFIG_SCHED_SMT
> cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map);
> group = cpumask_first(mask);
> +#else
> + group = cpu;
> +#endif
> if (sg)
> *sg = &per_cpu(sched_group_core, group).sg;
> return group;
> }
> -#elif defined(CONFIG_SCHED_MC)
> -static int
> -cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
> - struct sched_group **sg, struct cpumask *unused)
> -{
> - if (sg)
> - *sg = &per_cpu(sched_group_core, cpu).sg;
> - return cpu;
> -}
> -#endif
> +#endif /* CONFIG_SCHED_MC */
>
> static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);
> static DEFINE_PER_CPU(struct static_sched_group, sched_group_phys);
Reason why this code was structured like this was because of the
feedback from Andrew Morton. http://lkml.org/lkml/2006/1/27/308
May be we can further clean all this code up as part of your new
proposal. I can help in some of this. Thanks.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-12 17:25 ` [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store Heiko Carstens
@ 2010-08-13 21:13 ` Suresh Siddha
2010-08-19 11:36 ` Andreas Herrmann
2010-08-16 8:29 ` Peter Zijlstra
1 sibling, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2010-08-13 21:13 UTC (permalink / raw)
To: Heiko Carstens
Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Andreas Herrmann,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Thu, 2010-08-12 at 10:25 -0700, Heiko Carstens wrote:
> From: Heiko Carstens <heiko.carstens@de.ibm.com>
>
> Pass the corresponding sched domain level to sched_power_savings_store instead
> of a yes/no flag which indicates if the level is SMT or MC.
> This is needed to easily extend the function so it can be used for a third
> level.
>
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
> ---
>
> kernel/sched.c | 17 ++++++++++++-----
> 1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
> --- linux-2.6/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
> +++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
> @@ -7380,7 +7380,8 @@ static void arch_reinit_sched_domains(vo
> put_online_cpus();
> }
>
> -static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
> +static ssize_t sched_power_savings_store(const char *buf, size_t count,
> + enum sched_domain_level sd_level)
> {
> unsigned int level = 0;
>
> @@ -7397,10 +7398,16 @@ static ssize_t sched_power_savings_store
> if (level >= MAX_POWERSAVINGS_BALANCE_LEVELS)
> return -EINVAL;
>
> - if (smt)
> + switch (sd_level) {
> + case SD_LV_SIBLING:
> sched_smt_power_savings = level;
> - else
> + break;
> + case SD_LV_MC:
> sched_mc_power_savings = level;
> + break;
> + default:
> + break;
> + }
>
> arch_reinit_sched_domains();
>
> @@ -7418,7 +7425,7 @@ static ssize_t sched_mc_power_savings_st
> struct sysdev_class_attribute *attr,
> const char *buf, size_t count)
> {
> - return sched_power_savings_store(buf, count, 0);
> + return sched_power_savings_store(buf, count, SD_LV_MC);
> }
> static SYSDEV_CLASS_ATTR(sched_mc_power_savings, 0644,
> sched_mc_power_savings_show,
> @@ -7436,7 +7443,7 @@ static ssize_t sched_smt_power_savings_s
> struct sysdev_class_attribute *attr,
> const char *buf, size_t count)
> {
> - return sched_power_savings_store(buf, count, 1);
> + return sched_power_savings_store(buf, count, SD_LV_SIBLING);
> }
> static SYSDEV_CLASS_ATTR(sched_smt_power_savings, 0644,
> sched_smt_power_savings_show,
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain
2010-08-12 17:25 ` [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain Heiko Carstens
@ 2010-08-13 21:22 ` Suresh Siddha
2010-08-16 8:48 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Suresh Siddha @ 2010-08-13 21:22 UTC (permalink / raw)
To: Heiko Carstens
Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Andreas Herrmann,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Thu, 2010-08-12 at 10:25 -0700, Heiko Carstens wrote:
> From: Heiko Carstens <heiko.carstens@de.ibm.com>
>
> On top of the SMT and MC scheduling domains this adds the BOOK scheduling
> domain. This is useful for machines that have a four level cache hierarchy
> and but do not fall into the NUMA category.
>
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
PeterZ had some ideas in cleaning up the sched domain setup to avoid
this maze of #ifdef's. I will let him comment on this.
thanks,
suresh
> ---
>
> arch/s390/defconfig | 1
> include/linux/sched.h | 19 +++++++
> include/linux/topology.h | 6 ++
> kernel/sched.c | 112 ++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sched_fair.c | 11 ++--
> 5 files changed, 137 insertions(+), 12 deletions(-)
>
> diff -urpN linux-2.6/arch/s390/defconfig linux-2.6-patched/arch/s390/defconfig
> --- linux-2.6/arch/s390/defconfig 2010-08-02 00:11:14.000000000 +0200
> +++ linux-2.6-patched/arch/s390/defconfig 2010-08-11 13:47:23.000000000 +0200
> @@ -248,6 +248,7 @@ CONFIG_64BIT=y
> CONFIG_SMP=y
> CONFIG_NR_CPUS=32
> CONFIG_HOTPLUG_CPU=y
> +# CONFIG_SCHED_BOOK is not set
> CONFIG_COMPAT=y
> CONFIG_SYSVIPC_COMPAT=y
> CONFIG_AUDIT_ARCH=y
> diff -urpN linux-2.6/include/linux/sched.h linux-2.6-patched/include/linux/sched.h
> --- linux-2.6/include/linux/sched.h 2010-08-11 13:47:16.000000000 +0200
> +++ linux-2.6-patched/include/linux/sched.h 2010-08-11 13:47:23.000000000 +0200
> @@ -807,7 +807,9 @@ enum powersavings_balance_level {
> MAX_POWERSAVINGS_BALANCE_LEVELS
> };
>
> -extern int sched_mc_power_savings, sched_smt_power_savings;
> +extern int sched_smt_power_savings;
> +extern int sched_mc_power_savings;
> +extern int sched_book_power_savings;
>
> static inline int sd_balance_for_mc_power(void)
> {
> @@ -820,11 +822,23 @@ static inline int sd_balance_for_mc_powe
> return 0;
> }
>
> -static inline int sd_balance_for_package_power(void)
> +static inline int sd_balance_for_book_power(void)
> {
> if (sched_mc_power_savings | sched_smt_power_savings)
> return SD_POWERSAVINGS_BALANCE;
>
> + if (!sched_book_power_savings)
> + return SD_PREFER_SIBLING;
> +
> + return 0;
> +}
> +
> +static inline int sd_balance_for_package_power(void)
> +{
> + if (sched_book_power_savings | sched_mc_power_savings |
> + sched_smt_power_savings)
> + return SD_POWERSAVINGS_BALANCE;
> +
> return SD_PREFER_SIBLING;
> }
>
> @@ -875,6 +889,7 @@ enum sched_domain_level {
> SD_LV_NONE = 0,
> SD_LV_SIBLING,
> SD_LV_MC,
> + SD_LV_BOOK,
> SD_LV_CPU,
> SD_LV_NODE,
> SD_LV_ALLNODES,
> diff -urpN linux-2.6/include/linux/topology.h linux-2.6-patched/include/linux/topology.h
> --- linux-2.6/include/linux/topology.h 2010-08-11 13:47:16.000000000 +0200
> +++ linux-2.6-patched/include/linux/topology.h 2010-08-11 13:47:23.000000000 +0200
> @@ -201,6 +201,12 @@ int arch_update_cpu_topology(void);
> .balance_interval = 64, \
> }
>
> +#ifdef CONFIG_SCHED_BOOK
> +#ifndef SD_BOOK_INIT
> +#error Please define an appropriate SD_BOOK_INIT in include/asm/topology.h!!!
> +#endif
> +#endif /* CONFIG_SCHED_BOOK */
> +
> #ifdef CONFIG_NUMA
> #ifndef SD_NODE_INIT
> #error Please define an appropriate SD_NODE_INIT in include/asm/topology.h!!!
> diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
> --- linux-2.6/kernel/sched.c 2010-08-11 13:47:23.000000000 +0200
> +++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:23.000000000 +0200
> @@ -6472,7 +6472,9 @@ static void sched_domain_node_span(int n
> }
> #endif /* CONFIG_NUMA */
>
> -int sched_smt_power_savings = 0, sched_mc_power_savings = 0;
> +int sched_smt_power_savings;
> +int sched_mc_power_savings;
> +int sched_book_power_savings;
>
> /*
> * The cpus mask in sched_group and sched_domain hangs off the end.
> @@ -6500,6 +6502,7 @@ struct s_data {
> cpumask_var_t nodemask;
> cpumask_var_t this_sibling_map;
> cpumask_var_t this_core_map;
> + cpumask_var_t this_book_map;
> cpumask_var_t send_covered;
> cpumask_var_t tmpmask;
> struct sched_group **sched_group_nodes;
> @@ -6511,6 +6514,7 @@ enum s_alloc {
> sa_rootdomain,
> sa_tmpmask,
> sa_send_covered,
> + sa_this_book_map,
> sa_this_core_map,
> sa_this_sibling_map,
> sa_nodemask,
> @@ -6564,6 +6568,31 @@ cpu_to_core_group(int cpu, const struct
> }
> #endif /* CONFIG_SCHED_MC */
>
> +/*
> + * book sched-domains:
> + */
> +#ifdef CONFIG_SCHED_BOOK
> +static DEFINE_PER_CPU(struct static_sched_domain, book_domains);
> +static DEFINE_PER_CPU(struct static_sched_group, sched_group_book);
> +
> +static int
> +cpu_to_book_group(int cpu, const struct cpumask *cpu_map,
> + struct sched_group **sg, struct cpumask *mask)
> +{
> + int group = cpu;
> +#ifdef CONFIG_SCHED_MC
> + cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
> + group = cpumask_first(mask);
> +#elif defined(CONFIG_SCHED_SMT)
> + cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map);
> + group = cpumask_first(mask);
> +#endif
> + if (sg)
> + *sg = &per_cpu(sched_group_book, group).sg;
> + return group;
> +}
> +#endif /* CONFIG_SCHED_BOOK */
> +
> static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);
> static DEFINE_PER_CPU(struct static_sched_group, sched_group_phys);
>
> @@ -6572,7 +6601,10 @@ cpu_to_phys_group(int cpu, const struct
> struct sched_group **sg, struct cpumask *mask)
> {
> int group;
> -#ifdef CONFIG_SCHED_MC
> +#ifdef CONFIG_SCHED_BOOK
> + cpumask_and(mask, cpu_book_mask(cpu), cpu_map);
> + group = cpumask_first(mask);
> +#elif defined(CONFIG_SCHED_MC)
> cpumask_and(mask, cpu_coregroup_mask(cpu), cpu_map);
> group = cpumask_first(mask);
> #elif defined(CONFIG_SCHED_SMT)
> @@ -6833,6 +6865,9 @@ SD_INIT_FUNC(CPU)
> #ifdef CONFIG_SCHED_MC
> SD_INIT_FUNC(MC)
> #endif
> +#ifdef CONFIG_SCHED_BOOK
> + SD_INIT_FUNC(BOOK)
> +#endif
>
> static int default_relax_domain_level = -1;
>
> @@ -6882,6 +6917,8 @@ static void __free_domain_allocs(struct
> free_cpumask_var(d->tmpmask); /* fall through */
> case sa_send_covered:
> free_cpumask_var(d->send_covered); /* fall through */
> + case sa_this_book_map:
> + free_cpumask_var(d->this_book_map); /* fall through */
> case sa_this_core_map:
> free_cpumask_var(d->this_core_map); /* fall through */
> case sa_this_sibling_map:
> @@ -6928,8 +6965,10 @@ static enum s_alloc __visit_domain_alloc
> return sa_nodemask;
> if (!alloc_cpumask_var(&d->this_core_map, GFP_KERNEL))
> return sa_this_sibling_map;
> - if (!alloc_cpumask_var(&d->send_covered, GFP_KERNEL))
> + if (!alloc_cpumask_var(&d->this_book_map, GFP_KERNEL))
> return sa_this_core_map;
> + if (!alloc_cpumask_var(&d->send_covered, GFP_KERNEL))
> + return sa_this_book_map;
> if (!alloc_cpumask_var(&d->tmpmask, GFP_KERNEL))
> return sa_send_covered;
> d->rd = alloc_rootdomain();
> @@ -6987,6 +7026,23 @@ static struct sched_domain *__build_cpu_
> return sd;
> }
>
> +static struct sched_domain *__build_book_sched_domain(struct s_data *d,
> + const struct cpumask *cpu_map, struct sched_domain_attr *attr,
> + struct sched_domain *parent, int i)
> +{
> + struct sched_domain *sd = parent;
> +#ifdef CONFIG_SCHED_BOOK
> + sd = &per_cpu(book_domains, i).sd;
> + SD_INIT(sd, BOOK);
> + set_domain_attribute(sd, attr);
> + cpumask_and(sched_domain_span(sd), cpu_map, cpu_book_mask(i));
> + sd->parent = parent;
> + parent->child = sd;
> + cpu_to_book_group(i, cpu_map, &sd->groups, d->tmpmask);
> +#endif
> + return sd;
> +}
> +
> static struct sched_domain *__build_mc_sched_domain(struct s_data *d,
> const struct cpumask *cpu_map, struct sched_domain_attr *attr,
> struct sched_domain *parent, int i)
> @@ -7044,6 +7100,15 @@ static void build_sched_groups(struct s_
> d->send_covered, d->tmpmask);
> break;
> #endif
> +#ifdef CONFIG_SCHED_BOOK
> + case SD_LV_BOOK: /* set up book groups */
> + cpumask_and(d->this_book_map, cpu_map, cpu_book_mask(cpu));
> + if (cpu == cpumask_first(d->this_book_map))
> + init_sched_build_groups(d->this_book_map, cpu_map,
> + &cpu_to_book_group,
> + d->send_covered, d->tmpmask);
> + break;
> +#endif
> case SD_LV_CPU: /* set up physical groups */
> cpumask_and(d->nodemask, cpumask_of_node(cpu), cpu_map);
> if (!cpumask_empty(d->nodemask))
> @@ -7091,12 +7156,14 @@ static int __build_sched_domains(const s
>
> sd = __build_numa_sched_domains(&d, cpu_map, attr, i);
> sd = __build_cpu_sched_domain(&d, cpu_map, attr, sd, i);
> + sd = __build_book_sched_domain(&d, cpu_map, attr, sd, i);
> sd = __build_mc_sched_domain(&d, cpu_map, attr, sd, i);
> sd = __build_smt_sched_domain(&d, cpu_map, attr, sd, i);
> }
>
> for_each_cpu(i, cpu_map) {
> build_sched_groups(&d, SD_LV_SIBLING, cpu_map, i);
> + build_sched_groups(&d, SD_LV_BOOK, cpu_map, i);
> build_sched_groups(&d, SD_LV_MC, cpu_map, i);
> }
>
> @@ -7127,6 +7194,12 @@ static int __build_sched_domains(const s
> init_sched_groups_power(i, sd);
> }
> #endif
> +#ifdef CONFIG_SCHED_BOOK
> + for_each_cpu(i, cpu_map) {
> + sd = &per_cpu(book_domains, i).sd;
> + init_sched_groups_power(i, sd);
> + }
> +#endif
>
> for_each_cpu(i, cpu_map) {
> sd = &per_cpu(phys_domains, i).sd;
> @@ -7152,6 +7225,8 @@ static int __build_sched_domains(const s
> sd = &per_cpu(cpu_domains, i).sd;
> #elif defined(CONFIG_SCHED_MC)
> sd = &per_cpu(core_domains, i).sd;
> +#elif defined(CONFIG_SCHED_BOOK)
> + sd = &per_cpu(book_domains, i).sd;
> #else
> sd = &per_cpu(phys_domains, i).sd;
> #endif
> @@ -7368,7 +7443,8 @@ match2:
> mutex_unlock(&sched_domains_mutex);
> }
>
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> +#if defined(CONFIG_SCHED_BOOK) || defined(CONFIG_SCHED_MC) || \
> + defined(CONFIG_SCHED_SMT)
> static void arch_reinit_sched_domains(void)
> {
> get_online_cpus();
> @@ -7405,6 +7481,9 @@ static ssize_t sched_power_savings_store
> case SD_LV_MC:
> sched_mc_power_savings = level;
> break;
> + case SD_LV_BOOK:
> + sched_book_power_savings = level;
> + break;
> default:
> break;
> }
> @@ -7414,6 +7493,24 @@ static ssize_t sched_power_savings_store
> return count;
> }
>
> +#ifdef CONFIG_SCHED_BOOK
> +static ssize_t sched_book_power_savings_show(struct sysdev_class *class,
> + struct sysdev_class_attribute *attr,
> + char *page)
> +{
> + return sprintf(page, "%u\n", sched_book_power_savings);
> +}
> +static ssize_t sched_book_power_savings_store(struct sysdev_class *class,
> + struct sysdev_class_attribute *attr,
> + const char *buf, size_t count)
> +{
> + return sched_power_savings_store(buf, count, SD_LV_BOOK);
> +}
> +static SYSDEV_CLASS_ATTR(sched_book_power_savings, 0644,
> + sched_book_power_savings_show,
> + sched_book_power_savings_store);
> +#endif
> +
> #ifdef CONFIG_SCHED_MC
> static ssize_t sched_mc_power_savings_show(struct sysdev_class *class,
> struct sysdev_class_attribute *attr,
> @@ -7464,9 +7561,14 @@ int __init sched_create_sysfs_power_savi
> err = sysfs_create_file(&cls->kset.kobj,
> &attr_sched_mc_power_savings.attr);
> #endif
> +#ifdef CONFIG_SCHED_BOOK
> + if (!err && book_capable())
> + err = sysfs_create_file(&cls->kset.kobj,
> + &attr_sched_book_power_savings.attr);
> +#endif
> return err;
> }
> -#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> +#endif /* CONFIG_SCHED_BOOK || CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
>
> /*
> * Update cpusets according to cpu_active mask. If cpusets are
> diff -urpN linux-2.6/kernel/sched_fair.c linux-2.6-patched/kernel/sched_fair.c
> --- linux-2.6/kernel/sched_fair.c 2010-08-11 13:47:16.000000000 +0200
> +++ linux-2.6-patched/kernel/sched_fair.c 2010-08-11 13:47:23.000000000 +0200
> @@ -2039,7 +2039,8 @@ struct sd_lb_stats {
> unsigned long busiest_group_capacity;
>
> int group_imb; /* Is there imbalance in this sd */
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> +#if defined(CONFIG_SCHED_BOOK) || defined(CONFIG_SCHED_MC) || \
> + defined(CONFIG_SCHED_SMT)
> int power_savings_balance; /* Is powersave balance needed for this sd */
> struct sched_group *group_min; /* Least loaded group in sd */
> struct sched_group *group_leader; /* Group which relieves group_min */
> @@ -2096,8 +2097,8 @@ static inline int get_sd_load_idx(struct
> return load_idx;
> }
>
> -
> -#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
> +#if defined(CONFIG_SCHED_BOOK) || defined(CONFIG_SCHED_MC) || \
> + defined(CONFIG_SCHED_SMT)
> /**
> * init_sd_power_savings_stats - Initialize power savings statistics for
> * the given sched_domain, during load balancing.
> @@ -2217,7 +2218,7 @@ static inline int check_power_save_busie
> return 1;
>
> }
> -#else /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> +#else /* CONFIG_SCHED_BOOK || CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> static inline void init_sd_power_savings_stats(struct sched_domain *sd,
> struct sd_lb_stats *sds, enum cpu_idle_type idle)
> {
> @@ -2235,7 +2236,7 @@ static inline int check_power_save_busie
> {
> return 0;
> }
> -#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
> +#endif /* CONFIG_SCHED_BOOK || CONFIG_SCHED_MC || CONFIG_SCHED_SMT */
>
>
> unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-12 17:25 ` [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store Heiko Carstens
2010-08-13 21:13 ` Suresh Siddha
@ 2010-08-16 8:29 ` Peter Zijlstra
2010-08-19 11:41 ` Andreas Herrmann
1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-16 8:29 UTC (permalink / raw)
To: Heiko Carstens
Cc: Mike Galbraith, Ingo Molnar, Suresh Siddha, Andreas Herrmann,
linux-kernel, Martin Schwidefsky
On Thu, 2010-08-12 at 19:25 +0200, Heiko Carstens wrote:
> Pass the corresponding sched domain level to sched_power_savings_store instead
> of a yes/no flag which indicates if the level is SMT or MC.
> This is needed to easily extend the function so it can be used for a third
> level.
Ah, so the plan is to reduce the number of knobs, not create more.
Sysadmins really aren't interested in having a powersavings knob per
topology level.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain
2010-08-13 21:22 ` Suresh Siddha
@ 2010-08-16 8:48 ` Peter Zijlstra
0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-16 8:48 UTC (permalink / raw)
To: Suresh Siddha
Cc: Heiko Carstens, Mike Galbraith, Ingo Molnar, Andreas Herrmann,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Fri, 2010-08-13 at 14:22 -0700, Suresh Siddha wrote:
> On Thu, 2010-08-12 at 10:25 -0700, Heiko Carstens wrote:
> > From: Heiko Carstens <heiko.carstens@de.ibm.com>
> >
> > On top of the SMT and MC scheduling domains this adds the BOOK scheduling
> > domain. This is useful for machines that have a four level cache hierarchy
> > and but do not fall into the NUMA category.
> >
> > Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
>
> PeterZ had some ideas in cleaning up the sched domain setup to avoid
> this maze of #ifdef's. I will let him comment on this.
http://lkml.org/lkml/2009/8/18/169
More information in this thread: http://lkml.org/lkml/2009/8/20/190
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-13 21:13 ` Suresh Siddha
@ 2010-08-19 11:36 ` Andreas Herrmann
0 siblings, 0 replies; 17+ messages in thread
From: Andreas Herrmann @ 2010-08-19 11:36 UTC (permalink / raw)
To: Suresh Siddha
Cc: Heiko Carstens, Peter Zijlstra, Mike Galbraith, Ingo Molnar,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Fri, Aug 13, 2010 at 05:13:40PM -0400, Suresh Siddha wrote:
> On Thu, 2010-08-12 at 10:25 -0700, Heiko Carstens wrote:
> > From: Heiko Carstens <heiko.carstens@de.ibm.com>
> >
> > Pass the corresponding sched domain level to sched_power_savings_store instead
> > of a yes/no flag which indicates if the level is SMT or MC.
> > This is needed to easily extend the function so it can be used for a third
> > level.
> >
> > Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
>
> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Andreas Herrmann <andreas.herrmann3@amd.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-16 8:29 ` Peter Zijlstra
@ 2010-08-19 11:41 ` Andreas Herrmann
2010-08-19 12:35 ` Peter Zijlstra
0 siblings, 1 reply; 17+ messages in thread
From: Andreas Herrmann @ 2010-08-19 11:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Heiko Carstens, Mike Galbraith, Ingo Molnar, Suresh Siddha,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Mon, Aug 16, 2010 at 04:29:57AM -0400, Peter Zijlstra wrote:
> On Thu, 2010-08-12 at 19:25 +0200, Heiko Carstens wrote:
> > Pass the corresponding sched domain level to sched_power_savings_store instead
> > of a yes/no flag which indicates if the level is SMT or MC.
> > This is needed to easily extend the function so it can be used for a third
> > level.
>
> Ah, so the plan is to reduce the number of knobs, not create more.
Don't think so.
> Sysadmins really aren't interested in having a powersavings knob per
> topology level.
It just allows to use the same store functions for three instead of
two different knobs.
Andreas
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 0/5] sched: add new 'book' scheduling domain
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
` (4 preceding siblings ...)
2010-08-12 17:25 ` [PATCH/RFC 5/5] [PATCH] topology: add z196 cpu topology support Heiko Carstens
@ 2010-08-19 12:22 ` Andreas Herrmann
5 siblings, 0 replies; 17+ messages in thread
From: Andreas Herrmann @ 2010-08-19 12:22 UTC (permalink / raw)
To: Heiko Carstens
Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Suresh Siddha,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Thu, Aug 12, 2010 at 01:25:44PM -0400, Heiko Carstens wrote:
> This patch set adds (yet) another scheduling domain to the scheduler.
All that stuff reminds me of quite similar patches to introduce a
multi-node scheduling domain for Magny-Cours CPUs.
I am afraid that this stuff won't make it upstream and we both have to
review Peter's suggestions from last year to come up with a more
genarelized/flexible way to handle different scheduling domains.
> The reason for this is that the recent (s390) z196 architecture has
> four cache levels and uniform memory access (sort of -- see below).
> The cpu/cache/memory hierarchy is as follows:
> Each cpu has its private L1 (64KB I-cache + 128KB D-cache) and L2 (1.5MB)
> cache.
> A core consists of four cpus with a 24MB shared L3 cache.
> A book consists of six cores with a 192MB shared L4 cache.
> The z196 architecture has no SMT.
[...]
> A boot of a logical partition with 20 cpus, shared on two books, gives these
> initializion output to the console:
Below output shows that there is some odd distribution of your CPUs in
the different domain levels. Is this caused by the fact that not all
CPUs of a core and book were assigned to your logical partition?
For better understanding is the following CPUs-to-core/book mapping correct for
your example?
Book | Core | CPU
------+--------+---------
0 | 0 | 0,1,2,3
0 | 1 | 4,5
1 | 0 | 6,9
1 | 1 | 10,11
1 | 2 | 12,13
1 | 3 | 14,15,16
1 | 4 | 17,18,19
> Brought up 20 CPUs
> CPU0 attaching sched-domain:
> domain 0: span 0-5 level BOOK
> groups: 0 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048)
Why isn't there a range 0-3 instead of "0 1-3"?
And why isn't cpu_power=4096?
Ah, I think that for CPU 0 just the power information is
missing, So we have 3 groups:
0 (cpu_power=1024) 1-3 (cpu_power=3071) 4-5 (cpu_power=2048)
And the MC level is folded because it doesn't add anything in this
case.
So the mapping is in fact
Book | Core | CPU
------+--------+---------
0 | 0 | 0
0 | 1 | 1,2,3
0 | 2 | 4,5
1 | 0 | 6,9
1 | 1 | 10,11
1 | 2 | 12,13
1 | 3 | 14,15,16
1 | 4 | 17,18,19
> domain 1: span 0-19 level CPU
> groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
> CPU1 attaching sched-domain:
> domain 0: span 1-3 level MC
> groups: 1 2 3
> domain 1: span 0-5 level BOOK
> groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
> domain 2: span 0-19 level CPU
> groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
It's odd that for CPU 1 the BOOK domain groups differ from those shown
for CPU0.
> CPU2 attaching sched-domain:
> domain 0: span 1-3 level MC
> groups: 2 3 1
> domain 1: span 0-5 level BOOK
> groups: 1-3 (cpu_power = 3072) 4-5 (cpu_power = 2048) 0
Again for CPU 0 the cpu_power is missing. I think that is confusing.
For better readability that sould also be displayed (if a group
consists of only 1 CPU).
> domain 2: span 0-19 level CPU
> groups: 0-5 (cpu_power = 6144) 6-19 (cpu_power = 14336)
[snip the rest]
Andreas
--
Operating | Advanced Micro Devices GmbH
System | Einsteinring 24, 85609 Dornach b. München, Germany
Research | Geschäftsführer: Alberto Bozzo, Andrew Bowd
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
(OSRC) | Registergericht München, HRB Nr. 43632
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-19 12:35 ` Peter Zijlstra
@ 2010-08-19 12:32 ` Andreas Herrmann
0 siblings, 0 replies; 17+ messages in thread
From: Andreas Herrmann @ 2010-08-19 12:32 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Heiko Carstens, Mike Galbraith, Ingo Molnar, Suresh Siddha,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Thu, Aug 19, 2010 at 08:35:11AM -0400, Peter Zijlstra wrote:
> On Thu, 2010-08-19 at 13:41 +0200, Andreas Herrmann wrote:
> > It just allows to use the same store functions for three instead of
> > two different knobs.
>
> Creating more knobs for powersave scheduling is a fail.
>
> We already have 2^3 powersave scheduling states, it should be decreased
> to 2 (namely on/off), not increased to 3^3.
I think it should be possible to select a domain level at which power
saving scheduling should happen (this would result in 3 states in the
z196 case).
Andreas
--
Operating | Advanced Micro Devices GmbH
System | Einsteinring 24, 85609 Dornach b. München, Germany
Research | Geschäftsführer: Alberto Bozzo, Andrew Bowd
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
(OSRC) | Registergericht München, HRB Nr. 43632
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store
2010-08-19 11:41 ` Andreas Herrmann
@ 2010-08-19 12:35 ` Peter Zijlstra
2010-08-19 12:32 ` Andreas Herrmann
0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2010-08-19 12:35 UTC (permalink / raw)
To: Andreas Herrmann
Cc: Heiko Carstens, Mike Galbraith, Ingo Molnar, Suresh Siddha,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Thu, 2010-08-19 at 13:41 +0200, Andreas Herrmann wrote:
> It just allows to use the same store functions for three instead of
> two different knobs.
Creating more knobs for powersave scheduling is a fail.
We already have 2^3 powersave scheduling states, it should be decreased
to 2 (namely on/off), not increased to 3^3.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions
2010-08-13 21:11 ` Suresh Siddha
@ 2010-08-31 8:26 ` Heiko Carstens
0 siblings, 0 replies; 17+ messages in thread
From: Heiko Carstens @ 2010-08-31 8:26 UTC (permalink / raw)
To: Suresh Siddha
Cc: Peter Zijlstra, Mike Galbraith, Ingo Molnar, Andreas Herrmann,
linux-kernel@vger.kernel.org, Martin Schwidefsky
On Fri, Aug 13, 2010 at 02:11:54PM -0700, Suresh Siddha wrote:
> On Thu, 2010-08-12 at 10:25 -0700, Heiko Carstens wrote:
> > diff -urpN linux-2.6/kernel/sched.c linux-2.6-patched/kernel/sched.c
> > --- linux-2.6/kernel/sched.c 2010-08-11 13:47:16.000000000 +0200
> > +++ linux-2.6-patched/kernel/sched.c 2010-08-11 13:47:22.000000000 +0200
> > @@ -6546,31 +6546,23 @@ cpu_to_cpu_group(int cpu, const struct c
> > #ifdef CONFIG_SCHED_MC
> > static DEFINE_PER_CPU(struct static_sched_domain, core_domains);
> > static DEFINE_PER_CPU(struct static_sched_group, sched_group_core);
> > -#endif /* CONFIG_SCHED_MC */
> >
> > -#if defined(CONFIG_SCHED_MC) && defined(CONFIG_SCHED_SMT)
> > static int
> > cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
> > struct sched_group **sg, struct cpumask *mask)
> > {
> > int group;
> > -
> > +#ifdef CONFIG_SCHED_SMT
> > cpumask_and(mask, topology_thread_cpumask(cpu), cpu_map);
> > group = cpumask_first(mask);
> > +#else
> > + group = cpu;
> > +#endif
> > if (sg)
> > *sg = &per_cpu(sched_group_core, group).sg;
> > return group;
> > }
> > -#elif defined(CONFIG_SCHED_MC)
> > -static int
> > -cpu_to_core_group(int cpu, const struct cpumask *cpu_map,
> > - struct sched_group **sg, struct cpumask *unused)
> > -{
> > - if (sg)
> > - *sg = &per_cpu(sched_group_core, cpu).sg;
> > - return cpu;
> > -}
> > -#endif
> > +#endif /* CONFIG_SCHED_MC */
> >
> > static DEFINE_PER_CPU(struct static_sched_domain, phys_domains);
> > static DEFINE_PER_CPU(struct static_sched_group, sched_group_phys);
>
> Reason why this code was structured like this was because of the
> feedback from Andrew Morton. http://lkml.org/lkml/2006/1/27/308
Well, if I wouldn't merge this then the upcoming cpu_to_book_group
function would just be horribly long and unreadable. I think merging
this so it looks the same like cpu_to_phys_group is the right thing
to do.
If I wouldn't do that then the cpu_to_book_group function would just
be a real big mess instead of quite simple function.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2010-08-31 8:24 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-12 17:25 [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 1/5] [PATCH] sched: merge cpu_to_core_group functions Heiko Carstens
2010-08-13 21:11 ` Suresh Siddha
2010-08-31 8:26 ` Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 2/5] [PATCH] sched: pass sched_domain_level to sched_power_savings_store Heiko Carstens
2010-08-13 21:13 ` Suresh Siddha
2010-08-19 11:36 ` Andreas Herrmann
2010-08-16 8:29 ` Peter Zijlstra
2010-08-19 11:41 ` Andreas Herrmann
2010-08-19 12:35 ` Peter Zijlstra
2010-08-19 12:32 ` Andreas Herrmann
2010-08-12 17:25 ` [PATCH/RFC 3/5] [PATCH] sched: add book scheduling domain Heiko Carstens
2010-08-13 21:22 ` Suresh Siddha
2010-08-16 8:48 ` Peter Zijlstra
2010-08-12 17:25 ` [PATCH/RFC 4/5] [PATCH] topology/sysfs: provide book id and siblings attributes Heiko Carstens
2010-08-12 17:25 ` [PATCH/RFC 5/5] [PATCH] topology: add z196 cpu topology support Heiko Carstens
2010-08-19 12:22 ` [PATCH/RFC 0/5] sched: add new 'book' scheduling domain Andreas Herrmann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox