* [PATCH 1/2] sched/topology: Set correct numa topology type
[not found] <reply-to=<20180808081942.GA37418@linux.vnet.ibm.com>
@ 2018-08-10 17:00 ` Srikar Dronamraju
2018-08-10 17:00 ` [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch Srikar Dronamraju
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Srikar Dronamraju @ 2018-08-10 17:00 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Srikar Dronamraju, LKML, Mel Gorman, Rik van Riel,
Thomas Gleixner, Michael Ellerman, Heiko Carstens,
Suravee Suthikulpanit, Andre Wild, linuxppc-dev
With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
sched domain") scheduler introduces an new numa level. However this
leads to numa topology on 2 node systems no more marked as NUMA_DIRECT.
After this commit, it gets reported as NUMA_BACKPLANE. This is because
sched_domains_numa_level is now 2 on 2 node systems.
Fix this by allowing setting systems that have upto 2 numa levels as
NUMA_DIRECT.
While here remove a code that assumes level can be 0.
Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/topology.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a6e6b855ba81..cec3ee3ed320 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1315,7 +1315,7 @@ static void init_numa_topology_type(void)
n = sched_max_numa_distance;
- if (sched_domains_numa_levels <= 1) {
+ if (sched_domains_numa_levels <= 2) {
sched_numa_topology_type = NUMA_DIRECT;
return;
}
@@ -1400,9 +1400,6 @@ void sched_init_numa(void)
break;
}
- if (!level)
- return;
-
/*
* 'level' contains the number of unique distances
*
--
2.12.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-10 17:00 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
@ 2018-08-10 17:00 ` Srikar Dronamraju
2018-08-29 8:02 ` Peter Zijlstra
2018-08-21 11:02 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-09-10 10:06 ` [tip:sched/core] sched/topology: Set correct NUMA " tip-bot for Srikar Dronamraju
2 siblings, 1 reply; 12+ messages in thread
From: Srikar Dronamraju @ 2018-08-10 17:00 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: Srikar Dronamraju, LKML, Mel Gorman, Rik van Riel,
Thomas Gleixner, Michael Ellerman, Heiko Carstens,
Suravee Suthikulpanit, Andre Wild, linuxppc-dev
With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
sched domain") scheduler introduces an new numa level. However on shared
lpars like powerpc, this extra sched domain creation can lead to
repeated rcu stalls, sometimes even causing unresponsive systems on
boot. On such stalls, it was noticed that init_sched_groups_capacity()
(sg != sd->groups is always true).
INFO: rcu_sched self-detected stall on CPU
1-....: (240039 ticks this GP) idle=c32/1/4611686018427387906 softirq=782/782 fqs=80012
(t=240039 jiffies g=6272 c=6271 q=263040)
NMI backtrace for cpu 1
CPU: 1 PID: 1576 Comm: kworker/1:1 Kdump: loaded Tainted: G E 4.18.0-rc7-master+ #42
Workqueue: events topology_work_fn
Call Trace:
[c00000832132f190] [c0000000009557ac] dump_stack+0xb0/0xf4 (unreliable)
[c00000832132f1d0] [c00000000095ed54] nmi_cpu_backtrace+0x1b4/0x230
[c00000832132f270] [c00000000095efac] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[c00000832132f310] [c00000000005f77c] arch_trigger_cpumask_backtrace+0x2c/0x40
[c00000832132f330] [c0000000001a32d4] rcu_dump_cpu_stacks+0x100/0x15c
[c00000832132f380] [c0000000001a2024] rcu_check_callbacks+0x894/0xaa0
[c00000832132f4a0] [c0000000001ace9c] update_process_times+0x4c/0xa0
[c00000832132f4d0] [c0000000001c5400] tick_sched_handle.isra.13+0x50/0x80
[c00000832132f4f0] [c0000000001c549c] tick_sched_timer+0x6c/0xd0
[c00000832132f530] [c0000000001ae044] __hrtimer_run_queues+0x134/0x360
[c00000832132f5b0] [c0000000001aeea4] hrtimer_interrupt+0x124/0x300
[c00000832132f660] [c000000000024a04] timer_interrupt+0x114/0x2f0
[c00000832132f6c0] [c0000000000090f4] decrementer_common+0x114/0x120
--- interrupt: 901 at __bitmap_weight+0x70/0x100
LR = __bitmap_weight+0x78/0x100
[c00000832132f9b0] [c0000000009bb738] __func__.61127+0x0/0x20 (unreliable)
[c00000832132fa00] [c00000000016c178] build_sched_domains+0xf98/0x13f0
[c00000832132fb30] [c00000000016d73c] partition_sched_domains+0x26c/0x440
[c00000832132fc20] [c0000000001ee284] rebuild_sched_domains_locked+0x64/0x80
[c00000832132fc50] [c0000000001f11ec] rebuild_sched_domains+0x3c/0x60
[c00000832132fc80] [c00000000007e1c4] topology_work_fn+0x24/0x40
[c00000832132fca0] [c000000000126704] process_one_work+0x1a4/0x470
[c00000832132fd30] [c000000000126a68] worker_thread+0x98/0x540
[c00000832132fdc0] [c00000000012f078] kthread+0x168/0x1b0
[c00000832132fe30] [c00000000000b65c]
ret_from_kernel_thread+0x5c/0x80
Similar problem was earlier also reported at
https://lwn.net/ml/linux-kernel/20180512100233.GB3738@osiris/
Allow arch to set and clear masks corresponding to numa sched domain.
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
include/linux/sched/topology.h | 6 ++++++
kernel/sched/sched.h | 4 ----
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..13c7baeb7789 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -52,6 +52,12 @@ static inline int cpu_numa_flags(void)
{
return SD_NUMA;
}
+
+extern void sched_domains_numa_masks_set(unsigned int cpu);
+extern void sched_domains_numa_masks_clear(unsigned int cpu);
+#else
+static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
+static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
#endif
extern int arch_asym_cpu_priority(int cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..1028f3df8777 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1057,12 +1057,8 @@ extern bool find_numa_distance(int distance);
#ifdef CONFIG_NUMA
extern void sched_init_numa(void);
-extern void sched_domains_numa_masks_set(unsigned int cpu);
-extern void sched_domains_numa_masks_clear(unsigned int cpu);
#else
static inline void sched_init_numa(void) { }
-static inline void sched_domains_numa_masks_set(unsigned int cpu) { }
-static inline void sched_domains_numa_masks_clear(unsigned int cpu) { }
#endif
#ifdef CONFIG_NUMA_BALANCING
--
2.12.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched/topology: Set correct numa topology type
2018-08-10 17:00 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-08-10 17:00 ` [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch Srikar Dronamraju
@ 2018-08-21 11:02 ` Srikar Dronamraju
2018-08-21 13:59 ` Peter Zijlstra
2018-09-10 10:06 ` [tip:sched/core] sched/topology: Set correct NUMA " tip-bot for Srikar Dronamraju
2 siblings, 1 reply; 12+ messages in thread
From: Srikar Dronamraju @ 2018-08-21 11:02 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner, Michael Ellerman,
Heiko Carstens, Suravee Suthikulpanit, Andre Wild, linuxppc-dev
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-10 22:30:18]:
> With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> sched domain") scheduler introduces an new numa level. However this
> leads to numa topology on 2 node systems no more marked as NUMA_DIRECT.
> After this commit, it gets reported as NUMA_BACKPLANE. This is because
> sched_domains_numa_level is now 2 on 2 node systems.
>
> Fix this by allowing setting systems that have upto 2 numa levels as
> NUMA_DIRECT.
>
> While here remove a code that assumes level can be 0.
>
> Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
Hey Peter,
Did you look at these two patches?
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] sched/topology: Set correct numa topology type
2018-08-21 11:02 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
@ 2018-08-21 13:59 ` Peter Zijlstra
0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2018-08-21 13:59 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
Andre Wild, linuxppc-dev
On Tue, Aug 21, 2018 at 04:02:58AM -0700, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-10 22:30:18]:
>
> > With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> > sched domain") scheduler introduces an new numa level. However this
> > leads to numa topology on 2 node systems no more marked as NUMA_DIRECT.
> > After this commit, it gets reported as NUMA_BACKPLANE. This is because
> > sched_domains_numa_level is now 2 on 2 node systems.
> >
> > Fix this by allowing setting systems that have upto 2 numa levels as
> > NUMA_DIRECT.
> >
> > While here remove a code that assumes level can be 0.
> >
> > Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
>
> Hey Peter,
>
> Did you look at these two patches?
Nope, been on holidays and the inbox is an even bigger mess than normal.
I'll get to it, eventually :/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-10 17:00 ` [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch Srikar Dronamraju
@ 2018-08-29 8:02 ` Peter Zijlstra
2018-08-31 10:27 ` Srikar Dronamraju
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2018-08-29 8:02 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
Andre Wild, linuxppc-dev
On Fri, Aug 10, 2018 at 10:30:19PM +0530, Srikar Dronamraju wrote:
> With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> sched domain") scheduler introduces an new numa level. However on shared
> lpars like powerpc, this extra sched domain creation can lead to
> repeated rcu stalls, sometimes even causing unresponsive systems on
> boot. On such stalls, it was noticed that init_sched_groups_capacity()
> (sg != sd->groups is always true).
>
> INFO: rcu_sched self-detected stall on CPU
> 1-....: (240039 ticks this GP) idle=c32/1/4611686018427387906 softirq=782/782 fqs=80012
> (t=240039 jiffies g=6272 c=6271 q=263040)
> NMI backtrace for cpu 1
> --- interrupt: 901 at __bitmap_weight+0x70/0x100
> LR = __bitmap_weight+0x78/0x100
> [c00000832132f9b0] [c0000000009bb738] __func__.61127+0x0/0x20 (unreliable)
> [c00000832132fa00] [c00000000016c178] build_sched_domains+0xf98/0x13f0
> [c00000832132fb30] [c00000000016d73c] partition_sched_domains+0x26c/0x440
> [c00000832132fc20] [c0000000001ee284] rebuild_sched_domains_locked+0x64/0x80
> [c00000832132fc50] [c0000000001f11ec] rebuild_sched_domains+0x3c/0x60
> [c00000832132fc80] [c00000000007e1c4] topology_work_fn+0x24/0x40
> [c00000832132fca0] [c000000000126704] process_one_work+0x1a4/0x470
> [c00000832132fd30] [c000000000126a68] worker_thread+0x98/0x540
> [c00000832132fdc0] [c00000000012f078] kthread+0x168/0x1b0
> [c00000832132fe30] [c00000000000b65c]
> ret_from_kernel_thread+0x5c/0x80
>
> Similar problem was earlier also reported at
> https://lwn.net/ml/linux-kernel/20180512100233.GB3738@osiris/
>
> Allow arch to set and clear masks corresponding to numa sched domain.
What this Changelog fails to do is explain the problem and motivate why
this is the right solution.
As-is, this reads like, something's buggered, I changed this random thing
and it now works.
So what is causing that domain construction error?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-29 8:02 ` Peter Zijlstra
@ 2018-08-31 10:27 ` Srikar Dronamraju
2018-08-31 11:12 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: Srikar Dronamraju @ 2018-08-31 10:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
linuxppc-dev, Benjamin Herrenschmidt
* Peter Zijlstra <peterz@infradead.org> [2018-08-29 10:02:19]:
> On Fri, Aug 10, 2018 at 10:30:19PM +0530, Srikar Dronamraju wrote:
> > With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity node
> > sched domain") scheduler introduces an new numa level. However on shared
> > lpars like powerpc, this extra sched domain creation can lead to
> > repeated rcu stalls, sometimes even causing unresponsive systems on
> > boot. On such stalls, it was noticed that init_sched_groups_capacity()
> > (sg != sd->groups is always true).
> >
> > INFO: rcu_sched self-detected stall on CPU
> > 1-....: (240039 ticks this GP) idle=c32/1/4611686018427387906 softirq=782/782 fqs=80012
> > (t=240039 jiffies g=6272 c=6271 q=263040)
> > NMI backtrace for cpu 1
>
> > --- interrupt: 901 at __bitmap_weight+0x70/0x100
> > LR = __bitmap_weight+0x78/0x100
> > [c00000832132f9b0] [c0000000009bb738] __func__.61127+0x0/0x20 (unreliable)
> > [c00000832132fa00] [c00000000016c178] build_sched_domains+0xf98/0x13f0
> > [c00000832132fb30] [c00000000016d73c] partition_sched_domains+0x26c/0x440
> > [c00000832132fc20] [c0000000001ee284] rebuild_sched_domains_locked+0x64/0x80
> > [c00000832132fc50] [c0000000001f11ec] rebuild_sched_domains+0x3c/0x60
> > [c00000832132fc80] [c00000000007e1c4] topology_work_fn+0x24/0x40
> > [c00000832132fca0] [c000000000126704] process_one_work+0x1a4/0x470
> > [c00000832132fd30] [c000000000126a68] worker_thread+0x98/0x540
> > [c00000832132fdc0] [c00000000012f078] kthread+0x168/0x1b0
> > [c00000832132fe30] [c00000000000b65c]
> > ret_from_kernel_thread+0x5c/0x80
> >
> > Similar problem was earlier also reported at
> > https://lwn.net/ml/linux-kernel/20180512100233.GB3738@osiris/
> >
> > Allow arch to set and clear masks corresponding to numa sched domain.
>
> What this Changelog fails to do is explain the problem and motivate why
> this is the right solution.
>
> As-is, this reads like, something's buggered, I changed this random thing
> and it now works.
>
> So what is causing that domain construction error?
>
Powerpc lpars running on Phyp have 2 modes. Dedicated and shared.
Dedicated lpars are similar to kvm guest with vcpupin.
Shared lpars are similar to kvm guest without any pinning. When running
shared lpar mode, Phyp allows overcommitting. Now if more lpars are
created/destroyed, Phyp will internally move / consolidate the cores. The
objective is similar to what autonuma tries achieves on the host but with a
different approach (consolidating to optimal nodes to achieve the best
possible output). This would mean that the actual underlying cpus/node
mapping has changed. Phyp will propogate upwards an event to the lpar. The
lpar / os can choose to ignore or act on the same.
We have found that acting on the event will provide upto 40% improvement
over ignoring the event. Acting on the event would mean moving the cpu from
one node to the other, and topology_work_fn exactly does that.
In the case where we didn't have the NUMA sched domain, we would build the
independent (aka overlap) sched_groups. With NUMA sched domain
introduction, we try to reuse sched_groups (aka non-overlay). This results
in the above, which I thought I tried to explain in
https://lwn.net/ml/linux-kernel/20180810164533.GB42350@linux.vnet.ibm.com
In the typical case above, lets take 2 node, 8 core each having SMT 8
threads. Initially all the 8 cores might come from node 0. Hence
sched_domains_numa_masks[NODE][node1] and
sched_domains_numa_mask[NUMA][node1] is set at sched_init_numa will have
blank cpumasks.
Let say Phyp decides to move some of the load to another node, node 1, which
till now has 0 cpus. Hence we will see
"BUG: arch topology borken \n the DIE domain not a subset of the NODE
domain" which is probably okay. This problem is even present even before
NODE domain was created and systems still booted and ran.
However with the introduction of NODE sched_domain,
init_sched_groups_capacity() gets called for non-overlay sched_domains which
gets us into even worse problems. Here we will end up in a situation where
sgA->sgB->sgC-sgD->sgA gets converted into sgA->sgB->sgC->sgB which ends up
creating cpu stalls.
So the request is to expose the sched_domains_numa_masks_set /
sched_domains_numa_masks_clear to arch, so that on topology update i.e event
from phyp, arch set the mask correctly. The scheduler seems to take care of
everything else.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-31 10:27 ` Srikar Dronamraju
@ 2018-08-31 11:12 ` Peter Zijlstra
2018-08-31 11:26 ` Peter Zijlstra
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2018-08-31 11:12 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
linuxppc-dev, Benjamin Herrenschmidt
On Fri, Aug 31, 2018 at 03:27:24AM -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-08-29 10:02:19]:
> Powerpc lpars running on Phyp have 2 modes. Dedicated and shared.
>
> Dedicated lpars are similar to kvm guest with vcpupin.
Like i know what that means... I'm not big on virt. I suppose you're
saying it has a fixed virt to phys mapping.
> Shared lpars are similar to kvm guest without any pinning. When running
> shared lpar mode, Phyp allows overcommitting. Now if more lpars are
> created/destroyed, Phyp will internally move / consolidate the cores. The
> objective is similar to what autonuma tries achieves on the host but with a
> different approach (consolidating to optimal nodes to achieve the best
> possible output). This would mean that the actual underlying cpus/node
> mapping has changed.
AFAIK Linux can _not_ handle cpu:node relations changing. And I'm pretty
sure I told you that before.
> Phyp will propogate upwards an event to the lpar. The
> lpar / os can choose to ignore or act on the same.
>
> We have found that acting on the event will provide upto 40% improvement
> over ignoring the event. Acting on the event would mean moving the cpu from
> one node to the other, and topology_work_fn exactly does that.
How? Last time I checked there was a ton of code that relies on
cpu_to_node() not changing during the runtime of the kernel.
Stuff like the per-cpu memory allocations are done using the boot time
cpu_to_node() map for instance. Similarly, kthread creation uses the
cpu_to_node() map at the time of creation.
A lot of stuff is not re-evaluated. If you're dynamically changing the
node map, you're in for a world of hurt.
> In the case where we didn't have the NUMA sched domain, we would build the
> independent (aka overlap) sched_groups. With NUMA sched domain
> introduction, we try to reuse sched_groups (aka non-overlay). This results
> in the above, which I thought I tried to explain in
> https://lwn.net/ml/linux-kernel/20180810164533.GB42350@linux.vnet.ibm.com
That email was a ton of confusion; you show an error and you don't
explain how you get there.
> In the typical case above, lets take 2 node, 8 core each having SMT 8
> threads. Initially all the 8 cores might come from node 0. Hence
> sched_domains_numa_masks[NODE][node1] and
> sched_domains_numa_mask[NUMA][node1] is set at sched_init_numa will have
> blank cpumasks.
>
> Let say Phyp decides to move some of the load to another node, node 1, which
> till now has 0 cpus. Hence we will see
>
> "BUG: arch topology borken \n the DIE domain not a subset of the NODE
> domain" which is probably okay. This problem is even present even before
> NODE domain was created and systems still booted and ran.
No that is _NOT_ OKAY. The fact that it boots and runs just means we
cope with it, but it violates a base assumption when building domains.
> However with the introduction of NODE sched_domain,
> init_sched_groups_capacity() gets called for non-overlay sched_domains which
> gets us into even worse problems. Here we will end up in a situation where
> sgA->sgB->sgC-sgD->sgA gets converted into sgA->sgB->sgC->sgB which ends up
> creating cpu stalls.
>
> So the request is to expose the sched_domains_numa_masks_set /
> sched_domains_numa_masks_clear to arch, so that on topology update i.e event
> from phyp, arch set the mask correctly. The scheduler seems to take care of
> everything else.
NAK, not until you've fixed every cpu_to_node() user in the kernel to
deal with that mask changing.
This is absolutely insane.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-31 11:12 ` Peter Zijlstra
@ 2018-08-31 11:26 ` Peter Zijlstra
2018-08-31 11:53 ` Srikar Dronamraju
0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2018-08-31 11:26 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
linuxppc-dev, Benjamin Herrenschmidt
On Fri, Aug 31, 2018 at 01:12:53PM +0200, Peter Zijlstra wrote:
> NAK, not until you've fixed every cpu_to_node() user in the kernel to
> deal with that mask changing.
Also, what happens if userspace reads that information; uses libnuma and
then you go and shift the world underneath their feet?
> This is absolutely insane.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-31 11:26 ` Peter Zijlstra
@ 2018-08-31 11:53 ` Srikar Dronamraju
2018-08-31 12:05 ` Peter Zijlstra
2018-08-31 12:08 ` Peter Zijlstra
0 siblings, 2 replies; 12+ messages in thread
From: Srikar Dronamraju @ 2018-08-31 11:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
linuxppc-dev, Benjamin Herrenschmidt
* Peter Zijlstra <peterz@infradead.org> [2018-08-31 13:26:39]:
> On Fri, Aug 31, 2018 at 01:12:53PM +0200, Peter Zijlstra wrote:
> > NAK, not until you've fixed every cpu_to_node() user in the kernel to
> > deal with that mask changing.
>
> Also, what happens if userspace reads that information; uses libnuma and
> then you go and shift the world underneath their feet?
>
> > This is absolutely insane.
>
The topology events are suppose to be very rare.
>From whatever small experiments I have done till now, unless tasks are
bound to both cpu and memory, they seem to be coping well with topology
updates. I know things weren't optimal after a topology change but they
worked. Now after 051f3ca02e46 "Introduce NUMA identity node sched
domain", systems stall. I am only exploring at ways to keep them working
as much as they were before that commit.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-31 11:53 ` Srikar Dronamraju
@ 2018-08-31 12:05 ` Peter Zijlstra
2018-08-31 12:08 ` Peter Zijlstra
1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2018-08-31 12:05 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
linuxppc-dev, Benjamin Herrenschmidt
On Fri, Aug 31, 2018 at 04:53:50AM -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-08-31 13:26:39]:
>
> > On Fri, Aug 31, 2018 at 01:12:53PM +0200, Peter Zijlstra wrote:
> > > NAK, not until you've fixed every cpu_to_node() user in the kernel to
> > > deal with that mask changing.
> >
> > Also, what happens if userspace reads that information; uses libnuma and
> > then you go and shift the world underneath their feet?
> >
> > > This is absolutely insane.
> >
>
> The topology events are suppose to be very rare.
> From whatever small experiments I have done till now, unless tasks are
> bound to both cpu and memory, they seem to be coping well with topology
> updates. I know things weren't optimal after a topology change but they
> worked. Now after 051f3ca02e46 "Introduce NUMA identity node sched
> domain", systems stall. I am only exploring at ways to keep them working
> as much as they were before that commit.
I'm saying things were fundamentally buggered and this just made it show.
If you cannot guarantee cpu:node relations, you do not have NUMA, end of
story.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch
2018-08-31 11:53 ` Srikar Dronamraju
2018-08-31 12:05 ` Peter Zijlstra
@ 2018-08-31 12:08 ` Peter Zijlstra
1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2018-08-31 12:08 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner,
Michael Ellerman, Heiko Carstens, Suravee Suthikulpanit,
linuxppc-dev, Benjamin Herrenschmidt
On Fri, Aug 31, 2018 at 04:53:50AM -0700, Srikar Dronamraju wrote:
> The topology events are suppose to be very rare.
> From whatever small experiments I have done till now, unless tasks are
> bound to both cpu and memory, they seem to be coping well with topology
> updates.
IOW, if you're not using NUMA, it works if you change the NUMA setup.
You don't see anything wrong with that?!
Those programs would work as well if you didn't expose the NUMA stuff,
because they're not using it anyway.
^ permalink raw reply [flat|nested] 12+ messages in thread
* [tip:sched/core] sched/topology: Set correct NUMA topology type
2018-08-10 17:00 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-08-10 17:00 ` [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch Srikar Dronamraju
2018-08-21 11:02 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
@ 2018-09-10 10:06 ` tip-bot for Srikar Dronamraju
2 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-09-10 10:06 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, linuxppc-dev, torvalds, wild, suravee.suthikulpanit,
hpa, mingo, mpe, heiko.carstens, tglx, peterz, riel, srikar,
mgorman
Commit-ID: e5e96fafd9028b1478b165db78c52d981c14f471
Gitweb: https://git.kernel.org/tip/e5e96fafd9028b1478b165db78c52d981c14f471
Author: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Fri, 10 Aug 2018 22:30:18 +0530
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 10 Sep 2018 10:13:45 +0200
sched/topology: Set correct NUMA topology type
With the following commit:
051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")
the scheduler introduced a new NUMA level. However this leads to the NUMA topology
on 2 node systems to not be marked as NUMA_DIRECT anymore.
After this commit, it gets reported as NUMA_BACKPLANE, because
sched_domains_numa_level is now 2 on 2 node systems.
Fix this by allowing setting systems that have up to 2 NUMA levels as
NUMA_DIRECT.
While here remove code that assumes that level can be 0.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andre Wild <wild@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/topology.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 56a0fed30c0a..505a41c42b96 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1295,7 +1295,7 @@ static void init_numa_topology_type(void)
n = sched_max_numa_distance;
- if (sched_domains_numa_levels <= 1) {
+ if (sched_domains_numa_levels <= 2) {
sched_numa_topology_type = NUMA_DIRECT;
return;
}
@@ -1380,9 +1380,6 @@ void sched_init_numa(void)
break;
}
- if (!level)
- return;
-
/*
* 'level' contains the number of unique distances
*
^ permalink raw reply related [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-09-10 10:46 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <reply-to=<20180808081942.GA37418@linux.vnet.ibm.com>
2018-08-10 17:00 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-08-10 17:00 ` [PATCH 2/2] sched/topology: Expose numa_mask set/clear functions to arch Srikar Dronamraju
2018-08-29 8:02 ` Peter Zijlstra
2018-08-31 10:27 ` Srikar Dronamraju
2018-08-31 11:12 ` Peter Zijlstra
2018-08-31 11:26 ` Peter Zijlstra
2018-08-31 11:53 ` Srikar Dronamraju
2018-08-31 12:05 ` Peter Zijlstra
2018-08-31 12:08 ` Peter Zijlstra
2018-08-21 11:02 ` [PATCH 1/2] sched/topology: Set correct numa topology type Srikar Dronamraju
2018-08-21 13:59 ` Peter Zijlstra
2018-09-10 10:06 ` [tip:sched/core] sched/topology: Set correct NUMA " tip-bot for Srikar Dronamraju
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).