* [PATCH] Update sched_domains_numa_masks when new cpus are onlined. @ 2012-09-18 10:12 Tang Chen 2012-09-24 5:52 ` Tang Chen 2012-09-24 9:38 ` Peter Zijlstra 0 siblings, 2 replies; 13+ messages in thread From: Tang Chen @ 2012-09-18 10:12 UTC (permalink / raw) To: linux-kernel, x86, linux-numa; +Cc: wency, tangchen Once array sched_domains_numa_masks is defined, it is never updated. When a new cpu on a new node is onlined, the coincident member in sched_domains_numa_masks is not initialized, and all the masks are 0. As a result, the build_overlap_sched_groups() will initialize a NULL sched_group for the new cpu on the new node, which will lead to kernel panic. [ 3189.403280] Call Trace: [ 3189.403286] [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0 [ 3189.403289] [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20 [ 3189.403292] [<ffffffff810b1d57>] build_sched_domains+0x467/0x470 [ 3189.403296] [<ffffffff810b2067>] partition_sched_domains+0x307/0x510 [ 3189.403299] [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510 [ 3189.403305] [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90 [ 3189.403308] [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70 [ 3189.403316] [<ffffffff81674b87>] notifier_call_chain+0x67/0x150 [ 3189.403320] [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5 [ 3189.403328] [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10 [ 3189.403333] [<ffffffff81070470>] __cpu_notify+0x20/0x40 [ 3189.403337] [<ffffffff8166663e>] _cpu_up+0xe9/0x131 [ 3189.403340] [<ffffffff81666761>] cpu_up+0xdb/0xee [ 3189.403348] [<ffffffff8165667c>] store_online+0x9c/0xd0 [ 3189.403355] [<ffffffff81437640>] dev_attr_store+0x20/0x30 [ 3189.403361] [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100 [ 3189.403368] [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0 [ 3189.403371] [<ffffffff811ccdb4>] sys_write+0x54/0xa0 [ 3189.403375] [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]--- [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 This patch registers a new notifier for cpu hotplug notify chain, and updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 62 insertions(+), 0 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..66b36ab 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) * numbers. */ + /* + * Since sched_domains_numa_levels is also used in other functions as + * an index for sched_domains_numa_masks[][], we should reset it here in + * case sched_domains_numa_masks[][] fails to be initialized. And set it + * to 'level' when sched_domains_numa_masks[][] is fully initialized. + */ + sched_domains_numa_levels = 0; + sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); if (!sched_domains_numa_masks) return; @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) } sched_domain_topology = tl; + + sched_domains_numa_levels = level; +} + +static void sched_domains_numa_masks_set(int cpu) +{ + int i, j; + int node = cpu_to_node(cpu); + + for (i = 0; i < sched_domains_numa_levels; i++) + for (j = 0; j < nr_node_ids; j++) + if (node_distance(j, node) <= sched_domains_numa_distance[i]) + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +static void sched_domains_numa_masks_clear(int cpu) +{ + int i, j; + for (i = 0; i < sched_domains_numa_levels; i++) + for (j = 0; j < nr_node_ids; j++) + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); +} + +/* + * Update sched_domains_numa_masks[level][node] array when new cpus + * are onlined. + */ +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + int cpu = (int)hcpu; + + switch (action & ~CPU_TASKS_FROZEN) { + case CPU_ONLINE: + sched_domains_numa_masks_set(cpu); + break; + + case CPU_DEAD: + sched_domains_numa_masks_clear(cpu); + break; + + default: + return NOTIFY_DONE; + } + return NOTIFY_OK; } #else static inline void sched_init_numa(void) { } + +static int sched_domains_numa_masks_update(struct notifier_block *nfb, + unsigned long action, + void *hcpu) +{ + return 0; +} #endif /* CONFIG_NUMA */ static int __sdt_alloc(const struct cpumask *cpu_map) @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) mutex_unlock(&sched_domains_mutex); put_online_cpus(); + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); -- 1.7.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen @ 2012-09-24 5:52 ` Tang Chen 2012-09-24 9:38 ` Peter Zijlstra 1 sibling, 0 replies; 13+ messages in thread From: Tang Chen @ 2012-09-24 5:52 UTC (permalink / raw) To: mingo, peterz; +Cc: linux-kernel, x86, linux-numa, wency, Tang Chen Hi, Would you please help to review this patch ? Thanks. :) On 09/18/2012 06:12 PM, Tang Chen wrote: > Once array sched_domains_numa_masks is defined, it is never updated. > When a new cpu on a new node is onlined, the coincident member in > sched_domains_numa_masks is not initialized, and all the masks are 0. > As a result, the build_overlap_sched_groups() will initialize a NULL > sched_group for the new cpu on the new node, which will lead to kernel panic. > > [ 3189.403280] Call Trace: > [ 3189.403286] [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0 > [ 3189.403289] [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20 > [ 3189.403292] [<ffffffff810b1d57>] build_sched_domains+0x467/0x470 > [ 3189.403296] [<ffffffff810b2067>] partition_sched_domains+0x307/0x510 > [ 3189.403299] [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510 > [ 3189.403305] [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90 > [ 3189.403308] [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70 > [ 3189.403316] [<ffffffff81674b87>] notifier_call_chain+0x67/0x150 > [ 3189.403320] [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5 > [ 3189.403328] [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10 > [ 3189.403333] [<ffffffff81070470>] __cpu_notify+0x20/0x40 > [ 3189.403337] [<ffffffff8166663e>] _cpu_up+0xe9/0x131 > [ 3189.403340] [<ffffffff81666761>] cpu_up+0xdb/0xee > [ 3189.403348] [<ffffffff8165667c>] store_online+0x9c/0xd0 > [ 3189.403355] [<ffffffff81437640>] dev_attr_store+0x20/0x30 > [ 3189.403361] [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100 > [ 3189.403368] [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0 > [ 3189.403371] [<ffffffff811ccdb4>] sys_write+0x54/0xa0 > [ 3189.403375] [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b > [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]--- > [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 > > This patch registers a new notifier for cpu hotplug notify chain, and > updates sched_domains_numa_masks every time a new cpu is onlined or offlined. > > Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com> > --- > kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 62 insertions(+), 0 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..66b36ab 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) > * numbers. > */ > > + /* > + * Since sched_domains_numa_levels is also used in other functions as > + * an index for sched_domains_numa_masks[][], we should reset it here in > + * case sched_domains_numa_masks[][] fails to be initialized. And set it > + * to 'level' when sched_domains_numa_masks[][] is fully initialized. > + */ > + sched_domains_numa_levels = 0; > + > sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); > if (!sched_domains_numa_masks) > return; > @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) > } > > sched_domain_topology = tl; > + > + sched_domains_numa_levels = level; > +} > + > +static void sched_domains_numa_masks_set(int cpu) > +{ > + int i, j; > + int node = cpu_to_node(cpu); > + > + for (i = 0; i< sched_domains_numa_levels; i++) > + for (j = 0; j< nr_node_ids; j++) > + if (node_distance(j, node)<= sched_domains_numa_distance[i]) > + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); > +} > + > +static void sched_domains_numa_masks_clear(int cpu) > +{ > + int i, j; > + for (i = 0; i< sched_domains_numa_levels; i++) > + for (j = 0; j< nr_node_ids; j++) > + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); > +} > + > +/* > + * Update sched_domains_numa_masks[level][node] array when new cpus > + * are onlined. > + */ > +static int sched_domains_numa_masks_update(struct notifier_block *nfb, > + unsigned long action, > + void *hcpu) > +{ > + int cpu = (int)hcpu; > + > + switch (action& ~CPU_TASKS_FROZEN) { > + case CPU_ONLINE: > + sched_domains_numa_masks_set(cpu); > + break; > + > + case CPU_DEAD: > + sched_domains_numa_masks_clear(cpu); > + break; > + > + default: > + return NOTIFY_DONE; > + } > + return NOTIFY_OK; > } > #else > static inline void sched_init_numa(void) > { > } > + > +static int sched_domains_numa_masks_update(struct notifier_block *nfb, > + unsigned long action, > + void *hcpu) > +{ > + return 0; > +} > #endif /* CONFIG_NUMA */ > > static int __sdt_alloc(const struct cpumask *cpu_map) > @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) > mutex_unlock(&sched_domains_mutex); > put_online_cpus(); > > + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); > hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); > hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen 2012-09-24 5:52 ` Tang Chen @ 2012-09-24 9:38 ` Peter Zijlstra 2012-09-24 9:57 ` Srivatsa S. Bhat 2012-09-25 2:39 ` Tang Chen 1 sibling, 2 replies; 13+ messages in thread From: Peter Zijlstra @ 2012-09-24 9:38 UTC (permalink / raw) To: Tang Chen; +Cc: linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner Why are you cc'ing x86 and numa folks but not a single scheduler person when you're patching scheduler stuff? On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote: > Once array sched_domains_numa_masks is defined, it is never updated. > When a new cpu on a new node is onlined, Hmm, so there's hardware where you can boot with smaller nr_node_ids than possible.. I guess that makes sense. > the coincident member in > sched_domains_numa_masks is not initialized, and all the masks are 0. > As a result, the build_overlap_sched_groups() will initialize a NULL > sched_group for the new cpu on the new node, which will lead to kernel panic. <snip> > This patch registers a new notifier for cpu hotplug notify chain, and > updates sched_domains_numa_masks every time a new cpu is onlined or offlined. Urgh, more hotplug notifiers.. a well. > Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> > --- > kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 62 insertions(+), 0 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..66b36ab 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) > * numbers. > */ > > + /* > + * Since sched_domains_numa_levels is also used in other functions as > + * an index for sched_domains_numa_masks[][], we should reset it here in > + * case sched_domains_numa_masks[][] fails to be initialized. And set it > + * to 'level' when sched_domains_numa_masks[][] is fully initialized. > + */ > + sched_domains_numa_levels = 0; This isn't strictly needed for this patch right? I don't see anybody calling sched_init_numa() a second time (although they should).. > sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); > if (!sched_domains_numa_masks) > return; > @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) > } > > sched_domain_topology = tl; > + > + sched_domains_numa_levels = level; > +} > + > +static void sched_domains_numa_masks_set(int cpu) > +{ > + int i, j; > + int node = cpu_to_node(cpu); > + > + for (i = 0; i < sched_domains_numa_levels; i++) > + for (j = 0; j < nr_node_ids; j++) > + if (node_distance(j, node) <= sched_domains_numa_distance[i]) > + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); > +} > + > +static void sched_domains_numa_masks_clear(int cpu) > +{ > + int i, j; > + for (i = 0; i < sched_domains_numa_levels; i++) > + for (j = 0; j < nr_node_ids; j++) > + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); > +} Aside from the coding style nit of wanting braces over multi-line statements even though not strictly required, I really don't see how this could possibly be right.. We do this because nr_node_ids changed, right? This means the entire distance table grew/shrunk, which means we should do the level scan again. > @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) > mutex_unlock(&sched_domains_mutex); > put_online_cpus(); > > + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); > hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); > hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); OK, so you really want your notifier to run before cpuset_cpu_active because otherwise you get that crash, yet you fail with the whole order thing.. You should not _ever_ rely on registration order. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-24 9:38 ` Peter Zijlstra @ 2012-09-24 9:57 ` Srivatsa S. Bhat 2012-09-24 10:12 ` Peter Zijlstra 2012-09-25 2:39 ` Tang Chen 2012-09-25 2:39 ` Tang Chen 1 sibling, 2 replies; 13+ messages in thread From: Srivatsa S. Bhat @ 2012-09-24 9:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On 09/24/2012 03:08 PM, Peter Zijlstra wrote: >> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); > > OK, so you really want your notifier to run before cpuset_cpu_active > because otherwise you get that crash, yet you fail with the whole order > thing.. You should not _ever_ rely on registration order. > IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE priority to ensure that the ordering of callbacks is right, isn't it? Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-24 9:57 ` Srivatsa S. Bhat @ 2012-09-24 10:12 ` Peter Zijlstra 2012-09-24 10:17 ` Srivatsa S. Bhat 2012-09-25 2:39 ` Tang Chen 1 sibling, 1 reply; 13+ messages in thread From: Peter Zijlstra @ 2012-09-24 10:12 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote: > On 09/24/2012 03:08 PM, Peter Zijlstra wrote: > >> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); > >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); > >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); > > > > OK, so you really want your notifier to run before cpuset_cpu_active > > because otherwise you get that crash, yet you fail with the whole order > > thing.. You should not _ever_ rely on registration order. > > > > IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE > priority to ensure that the ordering of callbacks is right, isn't it? Oh argh indeed. I can't read :/ ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-24 10:12 ` Peter Zijlstra @ 2012-09-24 10:17 ` Srivatsa S. Bhat 0 siblings, 0 replies; 13+ messages in thread From: Srivatsa S. Bhat @ 2012-09-24 10:17 UTC (permalink / raw) To: Peter Zijlstra Cc: Tang Chen, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On 09/24/2012 03:42 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 15:27 +0530, Srivatsa S. Bhat wrote: >> On 09/24/2012 03:08 PM, Peter Zijlstra wrote: >>>> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); >>>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); >>>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); >>> >>> OK, so you really want your notifier to run before cpuset_cpu_active >>> because otherwise you get that crash, yet you fail with the whole order >>> thing.. You should not _ever_ rely on registration order. >>> >> >> IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE >> priority to ensure that the ordering of callbacks is right, isn't it? > > Oh argh indeed. I can't read :/ > ;-) Regards, Srivatsa S. Bhat ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-24 9:57 ` Srivatsa S. Bhat 2012-09-24 10:12 ` Peter Zijlstra @ 2012-09-25 2:39 ` Tang Chen 1 sibling, 0 replies; 13+ messages in thread From: Tang Chen @ 2012-09-25 2:39 UTC (permalink / raw) To: Srivatsa S. Bhat Cc: Peter Zijlstra, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On 09/24/2012 05:57 PM, Srivatsa S. Bhat wrote: > On 09/24/2012 03:08 PM, Peter Zijlstra wrote: >>> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); >>> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); >>> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); >> >> OK, so you really want your notifier to run before cpuset_cpu_active >> because otherwise you get that crash, yet you fail with the whole order >> thing.. You should not _ever_ rely on registration order. >> > > IMHO he isn't relying on registration order.. He uses the CPU_PRI_SCHED_ACTIVE > priority to ensure that the ordering of callbacks is right, isn't it? Yes, that's right. Thanks. :) > > Regards, > Srivatsa S. Bhat > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-24 9:38 ` Peter Zijlstra 2012-09-24 9:57 ` Srivatsa S. Bhat @ 2012-09-25 2:39 ` Tang Chen 2012-09-25 10:33 ` Peter Zijlstra 2012-09-25 11:07 ` Peter Zijlstra 1 sibling, 2 replies; 13+ messages in thread From: Tang Chen @ 2012-09-25 2:39 UTC (permalink / raw) To: Peter Zijlstra, Srivatsa S. Bhat Cc: linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner Hi Peter: Thanks for commenting this patch. :) On 09/24/2012 05:38 PM, Peter Zijlstra wrote: > Why are you cc'ing x86 and numa folks but not a single scheduler person > when you're patching scheduler stuff? First of all, I'm sorry for this. I thought it was a NUMA or memory related problem. And I get your address from the get_maintainer.pl script. > > On Tue, 2012-09-18 at 18:12 +0800, Tang Chen wrote: >> Once array sched_domains_numa_masks is defined, it is never updated. >> When a new cpu on a new node is onlined, > > Hmm, so there's hardware where you can boot with smaller nr_node_ids > than possible.. I guess that makes sense. Yes. nr_node_ids represents the max number of nodes the sustem supports. And usually, we don't have that many. > >> the coincident member in >> sched_domains_numa_masks is not initialized, and all the masks are 0. >> As a result, the build_overlap_sched_groups() will initialize a NULL >> sched_group for the new cpu on the new node, which will lead to kernel panic. > > <snip> > >> This patch registers a new notifier for cpu hotplug notify chain, and >> updates sched_domains_numa_masks every time a new cpu is onlined or offlined. > > Urgh, more hotplug notifiers.. a well. > >> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com> >> --- >> kernel/sched/core.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 files changed, 62 insertions(+), 0 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index fbf1fd0..66b36ab 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -6711,6 +6711,14 @@ static void sched_init_numa(void) >> * numbers. >> */ >> >> + /* >> + * Since sched_domains_numa_levels is also used in other functions as >> + * an index for sched_domains_numa_masks[][], we should reset it here in >> + * case sched_domains_numa_masks[][] fails to be initialized. And set it >> + * to 'level' when sched_domains_numa_masks[][] is fully initialized. >> + */ >> + sched_domains_numa_levels = 0; > > This isn't strictly needed for this patch right? I don't see anybody > calling sched_init_numa() a second time (although they should).. Indeed, sched_init_numa() won't be called by anyone else. But I clear it here in case the following memory allocation fail. Actually, I want to use sched_domains_numa_levels to iterate all the numa levels, such as in sched_domains_numa_masks_set(). Suppose sched_domains_numa_levels is 10. If allocating memory for sched_domains_numa_masks[5] fails in sched_init_numa(), it will just return, but sched_domains_numa_masks[] only has 4 members. It could be dangerous. So I set sched_domains_numa_levels to 0, and reset it to level in the end of sched_init_numa(), see below. > >> sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL); >> if (!sched_domains_numa_masks) >> return; >> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) >> } >> >> sched_domain_topology = tl; >> + >> + sched_domains_numa_levels = level; And I set it to level here again. >> +} >> + >> +static void sched_domains_numa_masks_set(int cpu) >> +{ >> + int i, j; >> + int node = cpu_to_node(cpu); >> + >> + for (i = 0; i< sched_domains_numa_levels; i++) >> + for (j = 0; j< nr_node_ids; j++) >> + if (node_distance(j, node)<= sched_domains_numa_distance[i]) >> + cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]); >> +} >> + >> +static void sched_domains_numa_masks_clear(int cpu) >> +{ >> + int i, j; >> + for (i = 0; i< sched_domains_numa_levels; i++) >> + for (j = 0; j< nr_node_ids; j++) >> + cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); >> +} > > Aside from the coding style nit of wanting braces over multi-line > statements even though not strictly required, I really don't see how > this could possibly be right.. Thanks for telling me that. I'll fix the coding style. :) > > We do this because nr_node_ids changed, right? This means the entire > distance table grew/shrunk, which means we should do the level scan > again. It seems that nr_node_ids will not change once the system is up. I'm not quite sure. If I am wrong, please tell me. :) I found that nr_node_ids is defined as MAX_NUMNODES in mm/page_alloc.c int nr_node_ids __read_mostly = MAX_NUMNODES; And all the functions change this value are __init functions. So I think nr_node_ids is just the possible nodes in the system, and it won't be changed after the system initialization finished. > >> @@ -7218,6 +7279,7 @@ void __init sched_init_smp(void) >> mutex_unlock(&sched_domains_mutex); >> put_online_cpus(); >> >> + hotcpu_notifier(sched_domains_numa_masks_update, CPU_PRI_SCHED_ACTIVE); >> hotcpu_notifier(cpuset_cpu_active, CPU_PRI_CPUSET_ACTIVE); >> hotcpu_notifier(cpuset_cpu_inactive, CPU_PRI_CPUSET_INACTIVE); > > OK, so you really want your notifier to run before cpuset_cpu_active > because otherwise you get that crash, yet you fail with the whole order > thing.. You should not _ever_ rely on registration order. > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-25 2:39 ` Tang Chen @ 2012-09-25 10:33 ` Peter Zijlstra 2012-09-25 11:45 ` Tang Chen 2012-09-25 11:07 ` Peter Zijlstra 1 sibling, 1 reply; 13+ messages in thread From: Peter Zijlstra @ 2012-09-25 10:33 UTC (permalink / raw) To: Tang Chen Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: > >> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) > >> } > >> > >> sched_domain_topology = tl; > >> + > >> + sched_domains_numa_levels = level; > > And I set it to level here again. > But its already set there.. its set every time we find a new level. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-25 10:33 ` Peter Zijlstra @ 2012-09-25 11:45 ` Tang Chen 2012-09-25 11:50 ` Peter Zijlstra 0 siblings, 1 reply; 13+ messages in thread From: Tang Chen @ 2012-09-25 11:45 UTC (permalink / raw) To: Peter Zijlstra Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner Hi Peter~ Sorry about the confusing log, and thanks for the patient. :) Here, I want to say something more about the sched_domains_numa_levels to make myself more clear. :) Let's have an example here. sched_init_numa() { ... // A loop set sched_domains_numa_levels to level.-------------1 // I set sched_domains_numa_levels to 0. sched_domains_numa_levels = 0;--------------------------------2 // A loop allocating memory for sched_domains_numa_masks[][] for (i = 0; i < level; i++) { ...... // Allocate memory for sched_domains_numa_masks[i]----3 ...... } ...... // I reset sched_domains_numa_levels to level. sched_domains_numa_levels = level;----------------------------4 } // A new function I added. static void sched_domains_numa_masks_clear(int cpu) { int i, j; for (i = 0; i < sched_domains_numa_levels; i++)---------------5 for (j = 0; j < nr_node_ids; j++) cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]); } Suppose level is 10, and at step 1, sched_domains_numa_levels is 10. If I didn't set sched_domains_numa_levels to 0 at step 2, it will be 10 all the time. If memory allocation at step 3 fails when i = 5, the array sched_domains_numa_masks[][] will only have 5 members, and sched_domains_numa_levels is 10. As you see, I added 2 functions using sched_domains_numa_levels to iterate sched_domains_numa_masks[][], such as at step 5. In this case, the iteration will break out when i = 5. This could be dangerous. So, I set sched_domains_numa_levels to 0 at step 2. This way, even if any memory allocation at step 3 fails, and sched_init_numa() returns, anyone uses sched_domains_numa_levels (which is 0) won't be wrong. I'm not sure if this is the best way to settle this problem. If you have a better idea, please tell me. Thanks. :) On 09/25/2012 06:33 PM, Peter Zijlstra wrote: > On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: >>>> @@ -6765,11 +6773,64 @@ static void sched_init_numa(void) >>>> } >>>> >>>> sched_domain_topology = tl; >>>> + >>>> + sched_domains_numa_levels = level; >> >> And I set it to level here again. >> > But its already set there.. its set every time we find a new level. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-25 11:45 ` Tang Chen @ 2012-09-25 11:50 ` Peter Zijlstra 2012-09-25 12:02 ` Tang Chen 0 siblings, 1 reply; 13+ messages in thread From: Peter Zijlstra @ 2012-09-25 11:50 UTC (permalink / raw) To: Tang Chen Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote: > Let's have an example here. > > sched_init_numa() > { > ... > // A loop set sched_domains_numa_levels to level.-------------1 > > // I set sched_domains_numa_levels to 0. > sched_domains_numa_levels = 0;--------------------------------2 Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and thought it was at the start of the function. > // A loop allocating memory for sched_domains_numa_masks[][] > for (i = 0; i < level; i++) { > ...... > // Allocate memory for sched_domains_numa_masks[i]----3 > ...... > } > ...... > > // I reset sched_domains_numa_levels to level. > sched_domains_numa_levels = level;----------------------------4 > } Yes this makes sense, it might be best to have this as a separate patch. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-25 11:50 ` Peter Zijlstra @ 2012-09-25 12:02 ` Tang Chen 0 siblings, 0 replies; 13+ messages in thread From: Tang Chen @ 2012-09-25 12:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner Hi Peter~ I will make a v2 patch-set following your comments and resent it soon. :) Thanks. :) On 09/25/2012 07:50 PM, Peter Zijlstra wrote: > On Tue, 2012-09-25 at 19:45 +0800, Tang Chen wrote: > > >> Let's have an example here. >> >> sched_init_numa() >> { >> ... >> // A loop set sched_domains_numa_levels to level.-------------1 >> >> // I set sched_domains_numa_levels to 0. >> sched_domains_numa_levels = 0;--------------------------------2 > > Ah, ok, so this 'idiot in a hurry' (aka. me) placed the hunk wrong and > thought it was at the start of the function. > >> // A loop allocating memory for sched_domains_numa_masks[][] >> for (i = 0; i< level; i++) { >> ...... >> // Allocate memory for sched_domains_numa_masks[i]----3 >> ...... >> } >> ...... >> >> // I reset sched_domains_numa_levels to level. >> sched_domains_numa_levels = level;----------------------------4 >> } > > Yes this makes sense, it might be best to have this as a separate patch. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Update sched_domains_numa_masks when new cpus are onlined. 2012-09-25 2:39 ` Tang Chen 2012-09-25 10:33 ` Peter Zijlstra @ 2012-09-25 11:07 ` Peter Zijlstra 1 sibling, 0 replies; 13+ messages in thread From: Peter Zijlstra @ 2012-09-25 11:07 UTC (permalink / raw) To: Tang Chen Cc: Srivatsa S. Bhat, linux-kernel, x86, linux-numa, wency, mingo, Thomas Gleixner On Tue, 2012-09-25 at 10:39 +0800, Tang Chen wrote: > > We do this because nr_node_ids changed, right? This means the entire > > distance table grew/shrunk, which means we should do the level scan > > again. > > It seems that nr_node_ids will not change once the system is up. > I'm not quite sure. If I am wrong, please tell me. :) Ah, right you are.. So the problem is that cpumask_of_node() doesn't contain offline cpus, and we might not boot the machine with all cpus present. In that case I guess the suggested hotplug hooks are fine. Its just that your changelog was confusing. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2012-09-25 12:02 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-18 10:12 [PATCH] Update sched_domains_numa_masks when new cpus are onlined Tang Chen 2012-09-24 5:52 ` Tang Chen 2012-09-24 9:38 ` Peter Zijlstra 2012-09-24 9:57 ` Srivatsa S. Bhat 2012-09-24 10:12 ` Peter Zijlstra 2012-09-24 10:17 ` Srivatsa S. Bhat 2012-09-25 2:39 ` Tang Chen 2012-09-25 2:39 ` Tang Chen 2012-09-25 10:33 ` Peter Zijlstra 2012-09-25 11:45 ` Tang Chen 2012-09-25 11:50 ` Peter Zijlstra 2012-09-25 12:02 ` Tang Chen 2012-09-25 11:07 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).