* [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
@ 2012-09-18 10:12 Tang Chen
2012-09-24 5:52 ` Tang Chen
2012-09-24 9:38 ` Peter Zijlstra
0 siblings, 2 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-18 10:12 UTC (permalink / raw)
To: linux-kernel, x86, linux-numa; +Cc: wency, tangchen
Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.
[ 3189.403280] Call Trace:
[ 3189.403286] [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
[ 3189.403289] [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
[ 3189.403292] [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
[ 3189.403296] [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
[ 3189.403299] [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
[ 3189.403305] [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308] [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
[ 3189.403316] [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
[ 3189.403320] [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328] [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333] [<ffffffff81070470>] __cpu_notify+0x20/0x40
[ 3189.403337] [<ffffffff8166663e>] _cpu_up+0xe9/0x131
[ 3189.403340] [<ffffffff81666761>] cpu_up+0xdb/0xee
[ 3189.403348] [<ffffffff8165667c>] store_online+0x9c/0xd0
[ 3189.403355] [<ffffffff81437640>] dev_attr_store+0x20/0x30
[ 3189.403361] [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
[ 3189.403368] [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
[ 3189.403371] [<ffffffff811ccdb4>] sys_write+0x54/0xa0
[ 3189.403375] [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 62 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
* numbers.
*/
+ /*
+ * Since sched_domains_numa_levels is also used in other functions as
+ * an index for sched_domains_numa_masks[][], we should reset it here in
+ * case sched_domains_numa_masks[][] fails to be initialized. And set it
+ * to 'level' when sched_domains_numa_masks[][] is fully initialized.
+ */
+ sched_domains_numa_levels = 0;
+
sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
if (!sched_domains_numa_masks)
return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
}
sched_domain_topology = tl;
+
+ sched_domains_numa_levels = level;
+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+ int i, j;
+ int node = cpu_to_node(cpu);
+
+ for (i = 0; i < sched_domains_numa_levels; i++)
+ for (j = 0; j < nr_node_ids; j++)
+ if (node_distance(j, node) <= sched_domains_numa_distance[i])
+ cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+ int i, j;
+ for (i = 0; i < sched_domains_numa_levels; i++)
+ for (j = 0; j < nr_node_ids; j++)
+ cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+/*
+ * Update sched_domains_numa_masks[level][node] array when new cpus
+ * are onlined.
+ */
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ int cpu = (int)hcpu;
+
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_ONLINE:
+ sched_domains_numa_masks_set(cpu);
+ break;
+
+ case CPU_DEAD:
+ sched_domains_numa_masks_clear(cpu);
+ break;
+
+ default:
+ return NOTIFY_DONE;
+ }
+ return NOTIFY_OK;
}
#else
static inline void sched_init_numa(void)
{
}
+
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+ unsigned long action,
+ void *hcpu)
+{
+ return 0;
+}
#endif /* CONFIG_NUMA */
static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
mutex_unlock(&sched_domains_mutex);
put_online_cpus();
+ hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
--
1.7.1
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen
@ 2012-09-24 5:52 ` Tang Chen
2012-09-24 9:38 ` Peter Zijlstra
1 sibling, 0 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-24 5:52 UTC (permalink / raw)
To: mingo, peterz; +Cc: linux-kernel, x86, linux-numa, wency, Tang Chen
Hi,
Would you please help to review this patch ?
Thanks. :)
On 09/18/2012 06:12 PM, Tang Chen wrote:
> Once array sched_domains_numa_masks is defined, it is never updated.
> When a new cpu on a new node is onlined, the coincident member in
> sched_domains_numa_masks is not initialized, and all the masks are 0.
> As a result, the build_overlap_sched_groups() will initialize a NULL
> sched_group for the new cpu on the new node, which will lead to kernel panic.
>
> [ 3189.403280] Call Trace:
> [ 3189.403286] [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
> [ 3189.403289] [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
> [ 3189.403292] [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
> [ 3189.403296] [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
> [ 3189.403299] [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
> [ 3189.403305] [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
> [ 3189.403308] [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
> [ 3189.403316] [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
> [ 3189.403320] [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
> [ 3189.403328] [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
> [ 3189.403333] [<ffffffff81070470>] __cpu_notify+0x20/0x40
> [ 3189.403337] [<ffffffff8166663e>] _cpu_up+0xe9/0x131
> [ 3189.403340] [<ffffffff81666761>] cpu_up+0xdb/0xee
> [ 3189.403348] [<ffffffff8165667c>] store_online+0x9c/0xd0
> [ 3189.403355] [<ffffffff81437640>] dev_attr_store+0x20/0x30
> [ 3189.403361] [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
> [ 3189.403368] [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
> [ 3189.403371] [<ffffffff811ccdb4>] sys_write+0x54/0xa0
> [ 3189.403375] [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
> [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
> [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>
> This patch registers a new notifier for cpu hotplug notify chain, and
> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
>
> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
> ---
> kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 62 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..66b36ab 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
> * numbers.
> */
>
> + /*
> + * Since sched_domains_numa_levels is also used in other functions as
> + * an index for sched_domains_numa_masks[][], we should reset it here in
> + * case sched_domains_numa_masks[][] fails to be initialized. And set it
> + * to 'level' when sched_domains_numa_masks[][] is fully initialized.
> + */
> + sched_domains_numa_levels = 0;
> +
> sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
> if (!sched_domains_numa_masks)
> return;
> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
> }
>
> sched_domain_topology = tl;
> +
> + sched_domains_numa_levels = level;
> +}
> +
> +static void sched_domains_numa_masks_set(int cpu)
> +{
> + int i, j;
> + int node = cpu_to_node(cpu);
> +
> + for (i = 0; i< sched_domains_numa_levels; i++)
> + for (j = 0; j< nr_node_ids; j++)
> + if (node_distance(j, node)<= sched_domains_numa_distance[i])
> + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
> +
> +static void sched_domains_numa_masks_clear(int cpu)
> +{
> + int i, j;
> + for (i = 0; i< sched_domains_numa_levels; i++)
> + for (j = 0; j< nr_node_ids; j++)
> + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
> +
> +/*
> + * Update sched_domains_numa_masks[level][node] array when new cpus
> + * are onlined.
> + */
> +static int sched_domains_numa_masks_update(struct notifier_block *nfb,
> + unsigned long action,
> + void *hcpu)
> +{
> + int cpu = (int)hcpu;
> +
> + switch (action& ~CPU_TASKS_FROZEN) {
> + case CPU_ONLINE:
> + sched_domains_numa_masks_set(cpu);
> + break;
> +
> + case CPU_DEAD:
> + sched_domains_numa_masks_clear(cpu);
> + break;
> +
> + default:
> + return NOTIFY_DONE;
> + }
> + return NOTIFY_OK;
> }
> #else
> static inline void sched_init_numa(void)
> {
> }
> +
> +static int sched_domains_numa_masks_update(struct notifier_block *nfb,
> + unsigned long action,
> + void *hcpu)
> +{
> + return 0;
> +}
> #endif /* CONFIG_NUMA */
>
> static int __sdt_alloc(const struct cpumask *cpu_map)
> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
> mutex_unlock(&sched_domains_mutex);
> put_online_cpus();
>
> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen
2012-09-24 5:52 ` Tang Chen
@ 2012-09-24 9:38 ` Peter Zijlstra
2012-09-24 9:57 ` Srivatsa S. Bhat
2012-09-25 2:39 ` Tang Chen
1 sibling, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-24 9:38 UTC (permalink / raw)
To: Tang Chen; +Cc: linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner
Why are you cc'ing x86 and numa folks but not a single scheduler person
when you're patching scheduler stuff?
On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:
> Once array sched_domains_numa_masks is defined, it is never updated.
> When a new cpu on a new node is onlined,
Hmm, so there's hardware where you can boot with smaller nr_node_ids
than possible.. I guess that makes sense.
> the coincident member in
> sched_domains_numa_masks is not initialized, and all the masks are 0.
> As a result, the build_overlap_sched_groups() will initialize a NULL
> sched_group for the new cpu on the new node, which will lead to kernel panic.
<snip>
> This patch registers a new notifier for cpu hotplug notify chain, and
> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
Urgh, more hotplug notifiers.. a well.
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
> kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 62 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..66b36ab 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
> * numbers.
> */
>
> + /*
> + * Since sched_domains_numa_levels is also used in other functions as
> + * an index for sched_domains_numa_masks[][], we should reset it here in
> + * case sched_domains_numa_masks[][] fails to be initialized. And set it
> + * to 'level' when sched_domains_numa_masks[][] is fully initialized.
> + */
> + sched_domains_numa_levels = 0;
This isn't strictly needed for this patch right? I don't see anybody
calling sched_init_numa() a second time (although they should)..
> sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
> if (!sched_domains_numa_masks)
> return;
> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
> }
>
> sched_domain_topology = tl;
> +
> + sched_domains_numa_levels = level;
> +}
> +
> +static void sched_domains_numa_masks_set(int cpu)
> +{
> + int i, j;
> + int node = cpu_to_node(cpu);
> +
> + for (i = 0; i < sched_domains_numa_levels; i++)
> + for (j = 0; j < nr_node_ids; j++)
> + if (node_distance(j, node) <= sched_domains_numa_distance[i])
> + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
> +
> +static void sched_domains_numa_masks_clear(int cpu)
> +{
> + int i, j;
> + for (i = 0; i < sched_domains_numa_levels; i++)
> + for (j = 0; j < nr_node_ids; j++)
> + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
Aside from the coding style nit of wanting braces over multi-line
statements even though not strictly required, I really don't see how
this could possibly be right..
We do this because nr_node_ids changed, right? This means the entire
distance table grew/shrunk, which means we should do the level scan
again.
> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
> mutex_unlock(&sched_domains_mutex);
> put_online_cpus();
>
> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not _ever_ rely on registration order.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-24 9:38 ` Peter Zijlstra
@ 2012-09-24 9:57 ` Srivatsa S. Bhat
2012-09-24 10:12 ` Peter Zijlstra
2012-09-25 2:39 ` Tang Chen
2012-09-25 2:39 ` Tang Chen
1 sibling, 2 replies; 13+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-24 9:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On 09/24/2012 03:08 PM, Peter Zijlstra wrote:
>> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>
> OK, so you really want your notifier to run before cpuset_cpu_active
> because otherwise you get that crash, yet you fail with the whole order
> thing.. You should not _ever_ rely on registration order.
>
IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
priority to ensure that the ordering of callbacks is right, isn't it?
Regards,
Srivatsa S. Bhat
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-24 9:57 ` Srivatsa S. Bhat
@ 2012-09-24 10:12 ` Peter Zijlstra
2012-09-24 10:17 ` Srivatsa S. Bhat
2012-09-25 2:39 ` Tang Chen
1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-24 10:12 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:
> >> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
> >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
> >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
> >
> > OK, so you really want your notifier to run before cpuset_cpu_active
> > because otherwise you get that crash, yet you fail with the whole order
> > thing.. You should not _ever_ rely on registration order.
> >
>
> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
> priority to ensure that the ordering of callbacks is right, isn't it?
Oh argh indeed. I can't read :/
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-24 10:12 ` Peter Zijlstra
@ 2012-09-24 10:17 ` Srivatsa S. Bhat
0 siblings, 0 replies; 13+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-24 10:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On 09/24/2012 03:42 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
>> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:
>>>> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>>>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>>>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>>>
>>> OK, so you really want your notifier to run before cpuset_cpu_active
>>> because otherwise you get that crash, yet you fail with the whole order
>>> thing.. You should not _ever_ rely on registration order.
>>>
>>
>> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
>> priority to ensure that the ordering of callbacks is right, isn't it?
>
> Oh argh indeed. I can't read :/
>
;-)
Regards,
Srivatsa S. Bhat
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-24 9:38 ` Peter Zijlstra
2012-09-24 9:57 ` Srivatsa S. Bhat
@ 2012-09-25 2:39 ` Tang Chen
2012-09-25 10:33 ` Peter Zijlstra
2012-09-25 11:07 ` Peter Zijlstra
1 sibling, 2 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-25 2:39 UTC (permalink / raw)
To: Peter Zijlstra, Srivatsa S. Bhat
Cc: linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner
Hi Peter:
Thanks for commenting this patch. :)
On 09/24/2012 05:38 PM, Peter Zijlstra wrote:
> Why are you cc'ing x86 and numa folks but not a single scheduler person
> when you're patching scheduler stuff?
First of all, I'm sorry for this. I thought it was a NUMA or memory
related problem. And I get your address from the get_maintainer.pl
script.
>
> On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:
>> Once array sched_domains_numa_masks is defined, it is never updated.
>> When a new cpu on a new node is onlined,
>
> Hmm, so there's hardware where you can boot with smaller nr_node_ids
> than possible.. I guess that makes sense.
Yes. nr_node_ids represents the max number of nodes the sustem supports.
And usually, we don't have that many.
>
>> the coincident member in
>> sched_domains_numa_masks is not initialized, and all the masks are 0.
>> As a result, the build_overlap_sched_groups() will initialize a NULL
>> sched_group for the new cpu on the new node, which will lead to kernel panic.
>
> <snip>
>
>> This patch registers a new notifier for cpu hotplug notify chain, and
>> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
>
> Urgh, more hotplug notifiers.. a well.
>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> ---
>> kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 62 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index fbf1fd0..66b36ab 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
>> * numbers.
>> */
>>
>> + /*
>> + * Since sched_domains_numa_levels is also used in other functions as
>> + * an index for sched_domains_numa_masks[][], we should reset it here in
>> + * case sched_domains_numa_masks[][] fails to be initialized. And set it
>> + * to 'level' when sched_domains_numa_masks[][] is fully initialized.
>> + */
>> + sched_domains_numa_levels = 0;
>
> This isn't strictly needed for this patch right? I don't see anybody
> calling sched_init_numa() a second time (although they should)..
Indeed, sched_init_numa() won't be called by anyone else. But I clear it
here in case the following memory allocation fail.
Actually, I want to use sched_domains_numa_levels to iterate all the
numa levels, such as in sched_domains_numa_masks_set().
Suppose sched_domains_numa_levels is 10. If allocating memory for
sched_domains_numa_masks[5] fails in sched_init_numa(), it will just
return, but sched_domains_numa_masks[] only has 4 members. It
could be dangerous.
So I set sched_domains_numa_levels to 0, and reset it to level in the
end of sched_init_numa(), see below.
>
>> sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
>> if (!sched_domains_numa_masks)
>> return;
>> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>> }
>>
>> sched_domain_topology = tl;
>> +
>> + sched_domains_numa_levels = level;
And I set it to level here again.
>> +}
>> +
>> +static void sched_domains_numa_masks_set(int cpu)
>> +{
>> + int i, j;
>> + int node = cpu_to_node(cpu);
>> +
>> + for (i = 0; i< sched_domains_numa_levels; i++)
>> + for (j = 0; j< nr_node_ids; j++)
>> + if (node_distance(j, node)<= sched_domains_numa_distance[i])
>> + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>> +}
>> +
>> +static void sched_domains_numa_masks_clear(int cpu)
>> +{
>> + int i, j;
>> + for (i = 0; i< sched_domains_numa_levels; i++)
>> + for (j = 0; j< nr_node_ids; j++)
>> + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
>> +}
>
> Aside from the coding style nit of wanting braces over multi-line
> statements even though not strictly required, I really don't see how
> this could possibly be right..
Thanks for telling me that. I'll fix the coding style. :)
>
> We do this because nr_node_ids changed, right? This means the entire
> distance table grew/shrunk, which means we should do the level scan
> again.
It seems that nr_node_ids will not change once the system is up.
I'm not quite sure. If I am wrong, please tell me. :)
I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c
int nr_node_ids __read_mostly = MAX_NUMNODES;
And all the functions change this value are __init functions. So I think
nr_node_ids is just the possible nodes in the system, and it won't be
changed after the system initialization finished.
>
>> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
>> mutex_unlock(&sched_domains_mutex);
>> put_online_cpus();
>>
>> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>
> OK, so you really want your notifier to run before cpuset_cpu_active
> because otherwise you get that crash, yet you fail with the whole order
> thing.. You should not _ever_ rely on registration order.
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-24 9:57 ` Srivatsa S. Bhat
2012-09-24 10:12 ` Peter Zijlstra
@ 2012-09-25 2:39 ` Tang Chen
1 sibling, 0 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-25 2:39 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: Peter Zijlstra, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote:
> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:
>>> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>>
>> OK, so you really want your notifier to run before cpuset_cpu_active
>> because otherwise you get that crash, yet you fail with the whole order
>> thing.. You should not _ever_ rely on registration order.
>>
>
> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
> priority to ensure that the ordering of callbacks is right, isn't it?
Yes, that's right. Thanks. :)
>
> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-25 2:39 ` Tang Chen
@ 2012-09-25 10:33 ` Peter Zijlstra
2012-09-25 11:45 ` Tang Chen
2012-09-25 11:07 ` Peter Zijlstra
1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-25 10:33 UTC (permalink / raw)
To: Tang Chen
Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
> >> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
> >> }
> >>
> >> sched_domain_topology = tl;
> >> +
> >> + sched_domains_numa_levels = level;
>
> And I set it to level here again.
>
But its already set there.. its set every time we find a new level.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-25 2:39 ` Tang Chen
2012-09-25 10:33 ` Peter Zijlstra
@ 2012-09-25 11:07 ` Peter Zijlstra
1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-25 11:07 UTC (permalink / raw)
To: Tang Chen
Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
> > We do this because nr_node_ids changed, right? This means the entire
> > distance table grew/shrunk, which means we should do the level scan
> > again.
>
> It seems that nr_node_ids will not change once the system is up.
> I'm not quite sure. If I am wrong, please tell me. :)
Ah, right you are..
So the problem is that cpumask_of_node() doesn't contain offline cpus,
and we might not boot the machine with all cpus present.
In that case I guess the suggested hotplug hooks are fine. Its just that
your changelog was confusing.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-25 10:33 ` Peter Zijlstra
@ 2012-09-25 11:45 ` Tang Chen
2012-09-25 11:50 ` Peter Zijlstra
0 siblings, 1 reply; 13+ messages in thread
From: Tang Chen @ 2012-09-25 11:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
Hi Peter~
Sorry about the confusing log, and thanks for the patient. :)
Here, I want to say something more about the sched_domains_numa_levels
to make myself more clear. :)
Let's have an example here.
sched_init_numa()
{
...
// A loop set sched_domains_numa_levels to level.-------------1
// I set sched_domains_numa_levels to 0.
sched_domains_numa_levels = 0;--------------------------------2
// A loop allocating memory for sched_domains_numa_masks[][]
for (i = 0; i < level; i++) {
......
// Allocate memory for sched_domains_numa_masks[i]----3
......
}
......
// I reset sched_domains_numa_levels to level.
sched_domains_numa_levels = level;----------------------------4
}
// A new function I added.
static void sched_domains_numa_masks_clear(int cpu)
{
int i, j;
for (i = 0; i < sched_domains_numa_levels; i++)---------------5
for (j = 0; j < nr_node_ids; j++)
cpumask_clear_cpu(cpu,
sched_domains_numa_masks[i][j]);
}
Suppose level is 10, and at step 1, sched_domains_numa_levels is 10.
If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10
all the time.
If memory allocation at step 3 fails when i = 5, the array
sched_domains_numa_masks[][] will only have 5 members, and
sched_domains_numa_levels is 10.
As you see, I added 2 functions using sched_domains_numa_levels to
iterate sched_domains_numa_masks[][], such as at step 5.
In this case, the iteration will break out when i = 5.
This could be dangerous.
So, I set sched_domains_numa_levels to 0 at step 2. This way, even if
any memory allocation at step 3 fails, and sched_init_numa() returns,
anyone uses sched_domains_numa_levels (which is 0) won't be wrong.
I'm not sure if this is the best way to settle this problem.
If you have a better idea, please tell me. Thanks. :)
On 09/25/2012 06:33 PM, Peter Zijlstra wrote:
> On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
>>>> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>>>> }
>>>>
>>>> sched_domain_topology = tl;
>>>> +
>>>> + sched_domains_numa_levels = level;
>>
>> And I set it to level here again.
>>
> But its already set there.. its set every time we find a new level.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-25 11:45 ` Tang Chen
@ 2012-09-25 11:50 ` Peter Zijlstra
2012-09-25 12:02 ` Tang Chen
0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-25 11:50 UTC (permalink / raw)
To: Tang Chen
Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:
> Let's have an example here.
>
> sched_init_numa()
> {
> ...
> // A loop set sched_domains_numa_levels to level.-------------1
>
> // I set sched_domains_numa_levels to 0.
> sched_domains_numa_levels = 0;--------------------------------2
Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.
> // A loop allocating memory for sched_domains_numa_masks[][]
> for (i = 0; i < level; i++) {
> ......
> // Allocate memory for sched_domains_numa_masks[i]----3
> ......
> }
> ......
>
> // I reset sched_domains_numa_levels to level.
> sched_domains_numa_levels = level;----------------------------4
> }
Yes this makes sense, it might be best to have this as a separate patch.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
2012-09-25 11:50 ` Peter Zijlstra
@ 2012-09-25 12:02 ` Tang Chen
0 siblings, 0 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-25 12:02 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
Thomas Gleixner
Hi Peter~
I will make a v2 patch-set following your comments
and resent it soon. :)
Thanks. :)
On 09/25/2012 07:50 PM, Peter Zijlstra wrote:
> On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:
>
>
>> Let's have an example here.
>>
>> sched_init_numa()
>> {
>> ...
>> // A loop set sched_domains_numa_levels to level.-------------1
>>
>> // I set sched_domains_numa_levels to 0.
>> sched_domains_numa_levels = 0;--------------------------------2
>
> Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
> thought it was at the start of the function.
>
>> // A loop allocating memory for sched_domains_numa_masks[][]
>> for (i = 0; i< level; i++) {
>> ......
>> // Allocate memory for sched_domains_numa_masks[i]----3
>> ......
>> }
>> ......
>>
>> // I reset sched_domains_numa_levels to level.
>> sched_domains_numa_levels = level;----------------------------4
>> }
>
> Yes this makes sense, it might be best to have this as a separate patch.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-09-25 12:02 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen
2012-09-24 5:52 ` Tang Chen
2012-09-24 9:38 ` Peter Zijlstra
2012-09-24 9:57 ` Srivatsa S. Bhat
2012-09-24 10:12 ` Peter Zijlstra
2012-09-24 10:17 ` Srivatsa S. Bhat
2012-09-25 2:39 ` Tang Chen
2012-09-25 2:39 ` Tang Chen
2012-09-25 10:33 ` Peter Zijlstra
2012-09-25 11:45 ` Tang Chen
2012-09-25 11:50 ` Peter Zijlstra
2012-09-25 12:02 ` Tang Chen
2012-09-25 11:07 ` Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).