[PATCH] Update sched_domains_numa_masks when new cpus are onlined.

linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
@ 2012-09-18 10:12 Tang Chen
  2012-09-24  5:52 ` Tang Chen
  2012-09-24  9:38 ` Peter Zijlstra
  0 siblings, 2 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-18 10:12 UTC (permalink / raw)
  To: linux-kernel, x86, linux-numa; +Cc: wency, tangchen

Once array sched_domains_numa_masks is defined, it is never updated.
When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic.

[ 3189.403280] Call Trace:
[ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
[ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
[ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
[ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
[ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
[ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
[ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
[ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
[ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
[ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
[ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 kernel/sched/core.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..66b36ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
 	 * numbers.
 	 */
 
+	/*
+	 * Since sched_domains_numa_levels is also used in other functions as
+	 * an index for sched_domains_numa_masks[][], we should reset it here in
+	 * case sched_domains_numa_masks[][] fails to be initialized. And set it
+	 * to 'level' when sched_domains_numa_masks[][] is fully initialized.
+	 */
+	sched_domains_numa_levels = 0;
+
 	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
 	if (!sched_domains_numa_masks)
 		return;
@@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
 	}
 
 	sched_domain_topology = tl;
+
+	sched_domains_numa_levels = level;
+}
+
+static void sched_domains_numa_masks_set(int cpu)
+{
+	int i, j;
+	int node = cpu_to_node(cpu);
+
+	for (i = 0; i < sched_domains_numa_levels; i++)
+		for (j = 0; j < nr_node_ids; j++)
+			if (node_distance(j, node) <= sched_domains_numa_distance[i])
+				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+static void sched_domains_numa_masks_clear(int cpu)
+{
+	int i, j;
+	for (i = 0; i < sched_domains_numa_levels; i++)
+		for (j = 0; j < nr_node_ids; j++)
+			cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
+}
+
+/*
+ * Update sched_domains_numa_masks[level][node] array when new cpus
+ * are onlined.
+ */
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+					   unsigned long action,
+					   void *hcpu)
+{
+	int cpu = (int)hcpu;
+
+	switch (action & ~CPU_TASKS_FROZEN) {
+	case CPU_ONLINE:
+		sched_domains_numa_masks_set(cpu);
+		break;
+
+	case CPU_DEAD:
+		sched_domains_numa_masks_clear(cpu);
+		break;
+
+	default:
+		return NOTIFY_DONE;
+	}
+	return NOTIFY_OK;
 }
 #else
 static inline void sched_init_numa(void)
 {
 }
+
+static int sched_domains_numa_masks_update(struct notifier_block *nfb,
+                                           unsigned long action,
+                                           void *hcpu)
+{
+	return 0;
+}
 #endif /* CONFIG_NUMA */
 
 static int __sdt_alloc(const struct cpumask *cpu_map)
@@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
 	mutex_unlock(&sched_domains_mutex);
 	put_online_cpus();
 
+	hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
 	hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
 	hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen
@ 2012-09-24  5:52 ` Tang Chen
  2012-09-24  9:38 ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-24  5:52 UTC (permalink / raw)
  To: mingo, peterz; +Cc: linux-kernel, x86, linux-numa, wency, Tang Chen

Hi,

Would you please help to review this patch ?

Thanks. :)

On 09/18/2012 06:12 PM, Tang Chen wrote:
> Once array sched_domains_numa_masks is defined, it is never updated.
> When a new cpu on a new node is onlined, the coincident member in
> sched_domains_numa_masks is not initialized, and all the masks are 0.
> As a result, the build_overlap_sched_groups() will initialize a NULL
> sched_group for the new cpu on the new node, which will lead to kernel panic.
>
> [ 3189.403280] Call Trace:
> [ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
> [ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
> [ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
> [ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
> [ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
> [ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
> [ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
> [ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
> [ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
> [ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
> [ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
> [ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
> [ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
> [ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
> [ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
> [ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
> [ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
> [ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
> [ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
> [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
> [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>
> This patch registers a new notifier for cpu hotplug notify chain, and
> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
>
> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
> ---
>   kernel/sched/core.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 62 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..66b36ab 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
>   	 * numbers.
>   	 */
>
> +	/*
> +	 * Since sched_domains_numa_levels is also used in other functions as
> +	 * an index for sched_domains_numa_masks[][], we should reset it here in
> +	 * case sched_domains_numa_masks[][] fails to be initialized. And set it
> +	 * to 'level' when sched_domains_numa_masks[][] is fully initialized.
> +	 */
> +	sched_domains_numa_levels = 0;
> +
>   	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
>   	if (!sched_domains_numa_masks)
>   		return;
> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>   	}
>
>   	sched_domain_topology = tl;
> +
> +	sched_domains_numa_levels = level;
> +}
> +
> +static void sched_domains_numa_masks_set(int cpu)
> +{
> +	int i, j;
> +	int node = cpu_to_node(cpu);
> +
> +	for (i = 0; i<  sched_domains_numa_levels; i++)
> +		for (j = 0; j<  nr_node_ids; j++)
> +			if (node_distance(j, node)<= sched_domains_numa_distance[i])
> +				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
> +
> +static void sched_domains_numa_masks_clear(int cpu)
> +{
> +	int i, j;
> +	for (i = 0; i<  sched_domains_numa_levels; i++)
> +		for (j = 0; j<  nr_node_ids; j++)
> +			cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
> +
> +/*
> + * Update sched_domains_numa_masks[level][node] array when new cpus
> + * are onlined.
> + */
> +static int sched_domains_numa_masks_update(struct notifier_block *nfb,
> +					   unsigned long action,
> +					   void *hcpu)
> +{
> +	int cpu = (int)hcpu;
> +
> +	switch (action&  ~CPU_TASKS_FROZEN) {
> +	case CPU_ONLINE:
> +		sched_domains_numa_masks_set(cpu);
> +		break;
> +
> +	case CPU_DEAD:
> +		sched_domains_numa_masks_clear(cpu);
> +		break;
> +
> +	default:
> +		return NOTIFY_DONE;
> +	}
> +	return NOTIFY_OK;
>   }
>   #else
>   static inline void sched_init_numa(void)
>   {
>   }
> +
> +static int sched_domains_numa_masks_update(struct notifier_block *nfb,
> +                                           unsigned long action,
> +                                           void *hcpu)
> +{
> +	return 0;
> +}
>   #endif /* CONFIG_NUMA */
>
>   static int __sdt_alloc(const struct cpumask *cpu_map)
> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
>   	mutex_unlock(&sched_domains_mutex);
>   	put_online_cpus();
>
> +	hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>   	hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>   	hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen
  2012-09-24  5:52 ` Tang Chen
@ 2012-09-24  9:38 ` Peter Zijlstra
  2012-09-24  9:57   ` Srivatsa S. Bhat
  2012-09-25  2:39   ` Tang Chen
  1 sibling, 2 replies; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-24  9:38 UTC (permalink / raw)
  To: Tang Chen; +Cc: linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner

Why are you cc'ing x86 and numa folks but not a single scheduler person
when you're patching scheduler stuff?

On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:
> Once array sched_domains_numa_masks is defined, it is never updated.
> When a new cpu on a new node is onlined,

Hmm, so there's hardware where you can boot with smaller nr_node_ids
than possible.. I guess that makes sense.

>  the coincident member in
> sched_domains_numa_masks is not initialized, and all the masks are 0.
> As a result, the build_overlap_sched_groups() will initialize a NULL
> sched_group for the new cpu on the new node, which will lead to kernel panic.

<snip>

> This patch registers a new notifier for cpu hotplug notify chain, and
> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Urgh, more hotplug notifiers.. a well.

> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  kernel/sched/core.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 62 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..66b36ab 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
>  	 * numbers.
>  	 */
>  
> +	/*
> +	 * Since sched_domains_numa_levels is also used in other functions as
> +	 * an index for sched_domains_numa_masks[][], we should reset it here in
> +	 * case sched_domains_numa_masks[][] fails to be initialized. And set it
> +	 * to 'level' when sched_domains_numa_masks[][] is fully initialized.
> +	 */
> +	sched_domains_numa_levels = 0;

This isn't strictly needed for this patch right? I don't see anybody
calling sched_init_numa() a second time (although they should)..

>  	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
>  	if (!sched_domains_numa_masks)
>  		return;
> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>  	}
>  
>  	sched_domain_topology = tl;
> +
> +	sched_domains_numa_levels = level;
> +}
> +
> +static void sched_domains_numa_masks_set(int cpu)
> +{
> +	int i, j;
> +	int node = cpu_to_node(cpu);
> +
> +	for (i = 0; i < sched_domains_numa_levels; i++)
> +		for (j = 0; j < nr_node_ids; j++)
> +			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> +				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}
> +
> +static void sched_domains_numa_masks_clear(int cpu)
> +{
> +	int i, j;
> +	for (i = 0; i < sched_domains_numa_levels; i++)
> +		for (j = 0; j < nr_node_ids; j++)
> +			cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
> +}

Aside from the coding style nit of wanting braces over multi-line
statements even though not strictly required, I really don't see how
this could possibly be right.. 

We do this because nr_node_ids changed, right? This means the entire
distance table grew/shrunk, which means we should do the level scan
again.

> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
>         mutex_unlock(&sched_domains_mutex);
>         put_online_cpus();
>  
> +       hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>         hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>         hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);

OK, so you really want your notifier to run before cpuset_cpu_active
because otherwise you get that crash, yet you fail with the whole order
thing.. You should not _ever_ rely on registration order.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-24  9:38 ` Peter Zijlstra
@ 2012-09-24  9:57   ` Srivatsa S. Bhat
  2012-09-24 10:12     ` Peter Zijlstra
  2012-09-25  2:39     ` Tang Chen
  2012-09-25  2:39   ` Tang Chen
  1 sibling, 2 replies; 13+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-24  9:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
>> +       hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>>         hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>>         hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
> 
> OK, so you really want your notifier to run before cpuset_cpu_active
> because otherwise you get that crash, yet you fail with the whole order
> thing.. You should not _ever_ rely on registration order.
> 

IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
priority to ensure that the ordering of callbacks is right, isn't it?

Regards,
Srivatsa S. Bhat 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-24  9:57   ` Srivatsa S. Bhat
@ 2012-09-24 10:12     ` Peter Zijlstra
  2012-09-24 10:17       ` Srivatsa S. Bhat
  2012-09-25  2:39     ` Tang Chen
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-24 10:12 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
> >> +       hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
> >>         hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
> >>         hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
> > 
> > OK, so you really want your notifier to run before cpuset_cpu_active
> > because otherwise you get that crash, yet you fail with the whole order
> > thing.. You should not _ever_ rely on registration order.
> > 
> 
> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
> priority to ensure that the ordering of callbacks is right, isn't it?

Oh argh indeed. I can't read :/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-24 10:12     ` Peter Zijlstra
@ 2012-09-24 10:17       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 13+ messages in thread
From: Srivatsa S. Bhat @ 2012-09-24 10:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On 09/24/2012 03:42 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote:
>> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:  
>>>> +       hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>>>>         hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>>>>         hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>>>
>>> OK, so you really want your notifier to run before cpuset_cpu_active
>>> because otherwise you get that crash, yet you fail with the whole order
>>> thing.. You should not _ever_ rely on registration order.
>>>
>>
>> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
>> priority to ensure that the ordering of callbacks is right, isn't it?
> 
> Oh argh indeed. I can't read :/
> 

;-)
 
Regards,
Srivatsa S. Bhat

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-24  9:57   ` Srivatsa S. Bhat
  2012-09-24 10:12     ` Peter Zijlstra
@ 2012-09-25  2:39     ` Tang Chen
  1 sibling, 0 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-25  2:39 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Peter Zijlstra, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote:
> On 09/24/2012 03:08 PM, Peter Zijlstra wrote:
>>> +       hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>>>          hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>>>          hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>>
>> OK, so you really want your notifier to run before cpuset_cpu_active
>> because otherwise you get that crash, yet you fail with the whole order
>> thing.. You should not _ever_ rely on registration order.
>>
>
> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE
> priority to ensure that the ordering of callbacks is right, isn't it?

Yes, that's right. Thanks. :)

>
> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-24  9:38 ` Peter Zijlstra
  2012-09-24  9:57   ` Srivatsa S. Bhat
@ 2012-09-25  2:39   ` Tang Chen
  2012-09-25 10:33     ` Peter Zijlstra
  2012-09-25 11:07     ` Peter Zijlstra
  1 sibling, 2 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-25  2:39 UTC (permalink / raw)
  To: Peter Zijlstra, Srivatsa S. Bhat
  Cc: linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner

Hi Peter:

Thanks for commenting this patch. :)

On 09/24/2012 05:38 PM, Peter Zijlstra wrote:
> Why are you cc'ing x86 and numa folks but not a single scheduler person
> when you're patching scheduler stuff?

First of all, I'm sorry for this. I thought it was a NUMA or memory
related problem. And I get your address from the get_maintainer.pl
script.

>
> On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote:
>> Once array sched_domains_numa_masks is defined, it is never updated.
>> When a new cpu on a new node is onlined,
>
> Hmm, so there's hardware where you can boot with smaller nr_node_ids
> than possible.. I guess that makes sense.

Yes. nr_node_ids represents the max number of nodes the sustem supports.
And usually, we don't have that many.

>
>>   the coincident member in
>> sched_domains_numa_masks is not initialized, and all the masks are 0.
>> As a result, the build_overlap_sched_groups() will initialize a NULL
>> sched_group for the new cpu on the new node, which will lead to kernel panic.
>
> <snip>
>
>> This patch registers a new notifier for cpu hotplug notify chain, and
>> updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
>
> Urgh, more hotplug notifiers.. a well.
>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> ---
>>   kernel/sched/core.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 files changed, 62 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index fbf1fd0..66b36ab 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void)
>>   	 * numbers.
>>   	 */
>>
>> +	/*
>> +	 * Since sched_domains_numa_levels is also used in other functions as
>> +	 * an index for sched_domains_numa_masks[][], we should reset it here in
>> +	 * case sched_domains_numa_masks[][] fails to be initialized. And set it
>> +	 * to 'level' when sched_domains_numa_masks[][] is fully initialized.
>> +	 */
>> +	sched_domains_numa_levels = 0;
>
> This isn't strictly needed for this patch right? I don't see anybody
> calling sched_init_numa() a second time (although they should)..

Indeed, sched_init_numa() won't be called by anyone else. But I clear it
here in case the following memory allocation fail.

Actually, I want to use sched_domains_numa_levels to iterate all the
numa levels, such as in sched_domains_numa_masks_set().

Suppose sched_domains_numa_levels is 10. If allocating memory for
sched_domains_numa_masks[5] fails in sched_init_numa(), it will just 
return, but sched_domains_numa_masks[] only has 4 members. It
could be dangerous.

So I set sched_domains_numa_levels to 0, and reset it to level in the
end of sched_init_numa(), see below.

>
>>   	sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);
>>   	if (!sched_domains_numa_masks)
>>   		return;
>> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>>   	}
>>
>>   	sched_domain_topology = tl;
>> +
>> +	sched_domains_numa_levels = level;

And I set it to level here again.

>> +}
>> +
>> +static void sched_domains_numa_masks_set(int cpu)
>> +{
>> +	int i, j;
>> +	int node = cpu_to_node(cpu);
>> +
>> +	for (i = 0; i<  sched_domains_numa_levels; i++)
>> +		for (j = 0; j<  nr_node_ids; j++)
>> +			if (node_distance(j, node)<= sched_domains_numa_distance[i])
>> +				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
>> +}
>> +
>> +static void sched_domains_numa_masks_clear(int cpu)
>> +{
>> +	int i, j;
>> +	for (i = 0; i<  sched_domains_numa_levels; i++)
>> +		for (j = 0; j<  nr_node_ids; j++)
>> +			cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
>> +}
>
> Aside from the coding style nit of wanting braces over multi-line
> statements even though not strictly required, I really don't see how
> this could possibly be right..

Thanks for telling me that. I'll fix the coding style. :)

>
> We do this because nr_node_ids changed, right? This means the entire
> distance table grew/shrunk, which means we should do the level scan
> again.

It seems that nr_node_ids will not change once the system is up.
I'm not quite sure. If I am wrong, please tell me. :)

I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c

int nr_node_ids __read_mostly = MAX_NUMNODES;

And all the functions change this value are __init functions. So I think
nr_node_ids is just the possible nodes in the system, and it won't be
changed after the system initialization finished.

>
>> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void)
>>          mutex_unlock(&sched_domains_mutex);
>>          put_online_cpus();
>>
>> +       hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE);
>>          hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE);
>>          hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE);
>
> OK, so you really want your notifier to run before cpuset_cpu_active
> because otherwise you get that crash, yet you fail with the whole order
> thing.. You should not _ever_ rely on registration order.
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-25  2:39   ` Tang Chen
@ 2012-09-25 10:33     ` Peter Zijlstra
  2012-09-25 11:45       ` Tang Chen
  2012-09-25 11:07     ` Peter Zijlstra
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-25 10:33 UTC (permalink / raw)
  To: Tang Chen
  Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
> >> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
> >>      }
> >>
> >>      sched_domain_topology = tl;
> >> +
> >> +    sched_domains_numa_levels = level;
> 
> And I set it to level here again.
> 
But its already set there.. its set every time we find a new level.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-25 10:33     ` Peter Zijlstra
@ 2012-09-25 11:45       ` Tang Chen
  2012-09-25 11:50         ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Tang Chen @ 2012-09-25 11:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

Hi Peter~

Sorry about the confusing log, and thanks for the patient. :)

Here, I want to say something more about the sched_domains_numa_levels
to make myself more clear. :)

Let's have an example here.

sched_init_numa()
{
	...
	// A loop set sched_domains_numa_levels to level.-------------1

	// I set sched_domains_numa_levels to 0.
	sched_domains_numa_levels = 0;--------------------------------2

	// A loop allocating memory for sched_domains_numa_masks[][]
	for (i = 0; i < level; i++) {
		......
		// Allocate memory for sched_domains_numa_masks[i]----3
		......
	}
	......

	// I reset sched_domains_numa_levels to level.
	sched_domains_numa_levels = level;----------------------------4
}

// A new function I added.
static void sched_domains_numa_masks_clear(int cpu)
{
         int i, j;
         for (i = 0; i < sched_domains_numa_levels; i++)---------------5
                 for (j = 0; j < nr_node_ids; j++)
                         cpumask_clear_cpu(cpu, 
sched_domains_numa_masks[i][j]);
}

Suppose level is 10, and at step 1, sched_domains_numa_levels is 10.

If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10
all the time.

If memory allocation at step 3 fails when i = 5, the array
sched_domains_numa_masks[][] will only have 5 members, and
sched_domains_numa_levels is 10.

As you see, I added 2 functions using sched_domains_numa_levels to
iterate sched_domains_numa_masks[][], such as at step 5.
In this case, the iteration will break out when i = 5.

This could be dangerous.

So, I set sched_domains_numa_levels to 0 at step 2. This way, even if
any memory allocation at step 3 fails, and sched_init_numa() returns,
anyone uses sched_domains_numa_levels (which is 0) won't be wrong.

I'm not sure if this is the best way to settle this problem.
If you have a better idea, please tell me. Thanks. :)

On 09/25/2012 06:33 PM, Peter Zijlstra wrote:
> On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
>>>> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void)
>>>>       }
>>>>
>>>>       sched_domain_topology = tl;
>>>> +
>>>> +    sched_domains_numa_levels = level;
>>
>> And I set it to level here again.
>>
> But its already set there.. its set every time we find a new level.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-25 11:45       ` Tang Chen
@ 2012-09-25 11:50         ` Peter Zijlstra
  2012-09-25 12:02           ` Tang Chen
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-25 11:50 UTC (permalink / raw)
  To: Tang Chen
  Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:


> Let's have an example here.
> 
> sched_init_numa()
> {
> 	...
> 	// A loop set sched_domains_numa_levels to level.-------------1
> 
> 	// I set sched_domains_numa_levels to 0.
> 	sched_domains_numa_levels = 0;--------------------------------2

Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
thought it was at the start of the function.

> 	// A loop allocating memory for sched_domains_numa_masks[][]
> 	for (i = 0; i < level; i++) {
> 		......
> 		// Allocate memory for sched_domains_numa_masks[i]----3
> 		......
> 	}
> 	......
> 
> 	// I reset sched_domains_numa_levels to level.
> 	sched_domains_numa_levels = level;----------------------------4
> }

Yes this makes sense, it might be best to have this as a separate patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-25 11:50         ` Peter Zijlstra
@ 2012-09-25 12:02           ` Tang Chen
  0 siblings, 0 replies; 13+ messages in thread
From: Tang Chen @ 2012-09-25 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

Hi Peter~

I will make a v2 patch-set following your comments
and resent it soon. :)

Thanks. :)

On 09/25/2012 07:50 PM, Peter Zijlstra wrote:
> On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote:
>
>
>> Let's have an example here.
>>
>> sched_init_numa()
>> {
>> 	...
>> 	// A loop set sched_domains_numa_levels to level.-------------1
>>
>> 	// I set sched_domains_numa_levels to 0.
>> 	sched_domains_numa_levels = 0;--------------------------------2
>
> Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and
> thought it was at the start of the function.
>
>> 	// A loop allocating memory for sched_domains_numa_masks[][]
>> 	for (i = 0; i<  level; i++) {
>> 		......
>> 		// Allocate memory for sched_domains_numa_masks[i]----3
>> 		......
>> 	}
>> 	......
>>
>> 	// I reset sched_domains_numa_levels to level.
>> 	sched_domains_numa_levels = level;----------------------------4
>> }
>
> Yes this makes sense, it might be best to have this as a separate patch.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined.
  2012-09-25  2:39   ` Tang Chen
  2012-09-25 10:33     ` Peter Zijlstra
@ 2012-09-25 11:07     ` Peter Zijlstra
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2012-09-25 11:07 UTC (permalink / raw)
  To: Tang Chen
  Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo,
	Thomas Gleixner

On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote:
> > We do this because nr_node_ids changed, right? This means the entire
> > distance table grew/shrunk, which means we should do the level scan
> > again.
> 
> It seems that nr_node_ids will not change once the system is up.
> I'm not quite sure. If I am wrong, please tell me. :) 

Ah, right you are..

So the problem is that cpumask_of_node() doesn't contain offline cpus,
and we might not boot the machine with all cpus present.

In that case I guess the suggested hotplug hooks are fine. Its just that
your changelog was confusing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-09-25 12:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen
2012-09-24  5:52 ` Tang Chen
2012-09-24  9:38 ` Peter Zijlstra
2012-09-24  9:57   ` Srivatsa S. Bhat
2012-09-24 10:12     ` Peter Zijlstra
2012-09-24 10:17       ` Srivatsa S. Bhat
2012-09-25  2:39     ` Tang Chen
2012-09-25  2:39   ` Tang Chen
2012-09-25 10:33     ` Peter Zijlstra
2012-09-25 11:45       ` Tang Chen
2012-09-25 11:50         ` Peter Zijlstra
2012-09-25 12:02           ` Tang Chen
2012-09-25 11:07     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).