* [PATCH] Check for online cpus before bringing them up
@ 2006-03-16 17:44 Srivatsa Vaddagiri
2006-03-17 1:08 ` Andrew Morton
0 siblings, 1 reply; 31+ messages in thread
From: Srivatsa Vaddagiri @ 2006-03-16 17:44 UTC (permalink / raw)
To: Andrew Morton, torvalds; +Cc: linux-kernel, shaohua.li, bryce
Bryce reported a bug wherein offlining CPU0 (on x86 box) and then subsequently
onlining it resulted in a lockup.
On x86, CPU0 is never offlined. The subsequent attempt to online CPU0
doesn't take that into account. It actually tries to bootup the already
booted CPU. Following patch fixes the problem (as acknowledged by
Bryce). Please consider for inclusion in 2.6.16.
Check if cpu is already online.
Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com>
-
arch/i386/kernel/smpboot.c | 6 ++++++
1 files changed, 6 insertions(+)
diff -puN arch/i386/kernel/smpboot.c~cpuhp arch/i386/kernel/smpboot.c
--- linux-2.6.16-rc5/arch/i386/kernel/smpboot.c~cpuhp 2006-03-14 14:42:26.000000000 +0530
+++ linux-2.6.16-rc5-root/arch/i386/kernel/smpboot.c 2006-03-14 14:43:21.000000000 +0530
@@ -1029,6 +1029,12 @@ int __devinit smp_prepare_cpu(int cpu)
int apicid, ret;
lock_cpu_hotplug();
+
+ if (cpu_online(cpu)) {
+ ret = -EINVAL;
+ goto exit;
+ }
+
apicid = x86_cpu_to_apicid[cpu];
if (apicid == BAD_APICID) {
ret = -ENODEV;
_
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [PATCH] Check for online cpus before bringing them up 2006-03-16 17:44 [PATCH] Check for online cpus before bringing them up Srivatsa Vaddagiri @ 2006-03-17 1:08 ` Andrew Morton 2006-03-17 1:16 ` Shaohua Li 2006-03-17 8:46 ` Srivatsa Vaddagiri 0 siblings, 2 replies; 31+ messages in thread From: Andrew Morton @ 2006-03-17 1:08 UTC (permalink / raw) To: vatsa; +Cc: torvalds, linux-kernel, shaohua.li, bryce Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote: > > Bryce reported a bug wherein offlining CPU0 (on x86 box) and then subsequently > onlining it resulted in a lockup. > > On x86, CPU0 is never offlined. The subsequent attempt to online CPU0 > doesn't take that into account. It actually tries to bootup the already > booted CPU. Following patch fixes the problem (as acknowledged by > Bryce). Please consider for inclusion in 2.6.16. > > Is x86 the only architecture which is exposed to this? > > diff -puN arch/i386/kernel/smpboot.c~cpuhp arch/i386/kernel/smpboot.c > --- linux-2.6.16-rc5/arch/i386/kernel/smpboot.c~cpuhp 2006-03-14 14:42:26.000000000 +0530 > +++ linux-2.6.16-rc5-root/arch/i386/kernel/smpboot.c 2006-03-14 14:43:21.000000000 +0530 > @@ -1029,6 +1029,12 @@ int __devinit smp_prepare_cpu(int cpu) > int apicid, ret; > > lock_cpu_hotplug(); > + > + if (cpu_online(cpu)) { > + ret = -EINVAL; > + goto exit; > + } > + > apicid = x86_cpu_to_apicid[cpu]; > if (apicid == BAD_APICID) { > ret = -ENODEV; a) It's hard for the reader to understand what that test is doing there b) People copy code from x86, so other architectures which are not exposed to this problem will end up having a pointless test in there. IOW: please comment your code. I'll fix this one up. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 1:08 ` Andrew Morton @ 2006-03-17 1:16 ` Shaohua Li 2006-03-17 8:46 ` Srivatsa Vaddagiri 1 sibling, 0 replies; 31+ messages in thread From: Shaohua Li @ 2006-03-17 1:16 UTC (permalink / raw) To: Andrew Morton; +Cc: vatsa, torvalds, linux-kernel, bryce On Thu, 2006-03-16 at 17:08 -0800, Andrew Morton wrote: > Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote: > > > > Bryce reported a bug wherein offlining CPU0 (on x86 box) and then subsequently > > onlining it resulted in a lockup. > > > > On x86, CPU0 is never offlined. The subsequent attempt to online CPU0 > > doesn't take that into account. It actually tries to bootup the already > > booted CPU. Following patch fixes the problem (as acknowledged by > > Bryce). Please consider for inclusion in 2.6.16. > > > > > > Is x86 the only architecture which is exposed to this? I guess i386 is the only architecture exposing to this. i386 uses a slight different method to do cpu hotadd (prepare cpu, and then up cpu). This is to workaround some issues like tsc synchronizing. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 1:08 ` Andrew Morton 2006-03-17 1:16 ` Shaohua Li @ 2006-03-17 8:46 ` Srivatsa Vaddagiri 2006-03-17 9:04 ` Andrew Morton 2006-03-17 12:21 ` [PATCH] Check for online cpus before bringing them up Ashok Raj 1 sibling, 2 replies; 31+ messages in thread From: Srivatsa Vaddagiri @ 2006-03-17 8:46 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, linux-kernel, shaohua.li, bryce On Thu, Mar 16, 2006 at 05:08:14PM -0800, Andrew Morton wrote: > Is x86 the only architecture which is exposed to this? Currently only x86 implements smp_prepare_cpu(). On other arch, it is a no-op. Hence yes, only x86 is exposed to this bug. > > > > lock_cpu_hotplug(); > > + > > + if (cpu_online(cpu)) { > > + ret = -EINVAL; > > + goto exit; > > + } > > + > > apicid = x86_cpu_to_apicid[cpu]; > > if (apicid == BAD_APICID) { > > ret = -ENODEV; > > a) It's hard for the reader to understand what that test is doing there > > b) People copy code from x86, so other architectures which are not > exposed to this problem will end up having a pointless test in there. Well ..other arch-es need to have a similar check if they get around to implement physical hot-add (even if they allow offlining of all CPUs). This is required since a user can (by mistake maybe) try to bring up an already online CPU by writing a '1' to it's sysfs 'online' file. 'store_online' (drivers/base/cpu.c) unconditionally calls 'smp_prepare_cpu' w/o checking for this error condition. The check added in the patch catches such error conditions as well. > IOW: please comment your code. I'll fix this one up. Sorry about not commenting my code earlier! How does the patch below look? Add check for online cpus. Signed-off-by : Srivatsa Vaddagiri <vatsa@in.ibm.com> arch/i386/kernel/smpboot.c | 10 ++++++++++ 1 files changed, 10 insertions(+) diff -puN arch/i386/kernel/smpboot.c~cpu_hp arch/i386/kernel/smpboot.c --- linux-2.6.16-rc6/arch/i386/kernel/smpboot.c~cpu_hp 2006-03-17 14:27:15.000000000 +0530 +++ linux-2.6.16-rc6-root/arch/i386/kernel/smpboot.c 2006-03-17 14:38:50.000000000 +0530 @@ -1029,6 +1029,16 @@ int __devinit smp_prepare_cpu(int cpu) int apicid, ret; lock_cpu_hotplug(); + + /* Check if CPU is already online. This can happen if user tries to + * bringup an already online CPU or a previous offline attempt + * on this CPU has failed. + */ + if (cpu_online(cpu)) { + ret = -EINVAL; + goto exit; + } + apicid = x86_cpu_to_apicid[cpu]; if (apicid == BAD_APICID) { ret = -ENODEV; _ -- Regards, vatsa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 8:46 ` Srivatsa Vaddagiri @ 2006-03-17 9:04 ` Andrew Morton 2006-03-17 14:13 ` Srivatsa Vaddagiri 2006-10-06 23:10 ` Status on CPU hotplug issues Bryce Harrington 2006-03-17 12:21 ` [PATCH] Check for online cpus before bringing them up Ashok Raj 1 sibling, 2 replies; 31+ messages in thread From: Andrew Morton @ 2006-03-17 9:04 UTC (permalink / raw) To: vatsa; +Cc: torvalds, linux-kernel, shaohua.li, bryce Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote: > > Well ..other arch-es need to have a similar check if they get around to > implement physical hot-add (even if they allow offlining of all CPUs). This is > required since a user can (by mistake maybe) try to bring up an already online > CPU by writing a '1' to it's sysfs 'online' file. 'store_online' > (drivers/base/cpu.c) unconditionally calls 'smp_prepare_cpu' w/o checking for > this error condition. The check added in the patch catches such error > conditions as well. OK.. I guess we should fix those architectures while we're thinking about it. > --- linux-2.6.16-rc6/arch/i386/kernel/smpboot.c~cpu_hp 2006-03-17 14:27:15.000000000 +0530 > +++ linux-2.6.16-rc6-root/arch/i386/kernel/smpboot.c 2006-03-17 14:38:50.000000000 +0530 > @@ -1029,6 +1029,16 @@ int __devinit smp_prepare_cpu(int cpu) > int apicid, ret; > > lock_cpu_hotplug(); > + > + /* Check if CPU is already online. This can happen if user tries to > + * bringup an already online CPU or a previous offline attempt > + * on this CPU has failed. > + */ > + if (cpu_online(cpu)) { > + ret = -EINVAL; > + goto exit; > + } > + How well tested is this? From my reading, this will cause enable_nonboot_cpus() to panic. Is that intended? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 9:04 ` Andrew Morton @ 2006-03-17 14:13 ` Srivatsa Vaddagiri 2006-03-18 14:09 ` Ashok Raj 2006-10-06 23:10 ` Status on CPU hotplug issues Bryce Harrington 1 sibling, 1 reply; 31+ messages in thread From: Srivatsa Vaddagiri @ 2006-03-17 14:13 UTC (permalink / raw) To: Andrew Morton; +Cc: torvalds, linux-kernel, shaohua.li, bryce, ashok.raj On Fri, Mar 17, 2006 at 01:04:12AM -0800, Andrew Morton wrote: > OK.. I guess we should fix those architectures while we're thinking about it. Only x86 has this bug, so only x86 needs to be fixed. Neverthless Ashok's patch [1] should address all architectures that may implement smp_prepare_cpu() in future as well. > How well tested is this? From my reading, this will cause > enable_nonboot_cpus() to panic. Is that intended? I have done some basic test of the patch using Bryce's scripts. Regarding enable_nonboot_cpus(), from my reading of it, it should not call smp_prepare_cpu on online cpus, so the check added should not cause it to panic. Am I missing something? Finally, I think the patch Ashok posted sometime back [1] is probably a more neater solution to this bug (he probably needs to modify it slightly to remove smp_prepare_cpu from enable_nonboot_cpus as well). [1] http://lkml.org/lkml/2006/3/17/103 -- Regards, vatsa ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 14:13 ` Srivatsa Vaddagiri @ 2006-03-18 14:09 ` Ashok Raj 2006-03-21 1:08 ` Shaohua Li 0 siblings, 1 reply; 31+ messages in thread From: Ashok Raj @ 2006-03-18 14:09 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Andrew Morton, torvalds, linux-kernel, shaohua.li, bryce, ashok.raj On Fri, Mar 17, 2006 at 07:43:22PM +0530, Srivatsa Vaddagiri wrote: > On Fri, Mar 17, 2006 at 01:04:12AM -0800, Andrew Morton wrote: > > OK.. I guess we should fix those architectures while we're thinking about it. > > Only x86 has this bug, so only x86 needs to be fixed. Neverthless > Ashok's patch [1] should address all architectures that may implement > smp_prepare_cpu() in future as well. > Hi Vatsa Since smp_prepare_cpu was introduced only to work around warm-boot on i386 due to the strange kickstart process of AP's. I have moved this al-together within __cpu_up() i386 arch code, so its less confusing to other arch's that have done it right already. Here is the updated patch, it removes smp_prepare_cpu() from being called from generic core cpuhotplug code, and moved it to i386 side. I did some basic up/down testing on a 2 way PIII box. Shaohua, could you take a look at this code, and also check suspend/resume is not affected? Andrew, could we queue it to -mm for some exposure before inclusion. thanks -- Cheers, Ashok Raj - Open Source Technology Center Check if cpu can be onlined before calling smp_prepare_cpu() - Moved check for online cpu out of smp_prepare_cpu() - Moved default declaration of smp_prepare_cpu() to kernel/cpu.c - Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since its called from cpu_up() as well now. - Removed clearing from cpu_present_map during cpu_offline as it breaks using cpu_up() directly during a subsequent online operation. Signed-off-by: Ashok Raj <ashok.raj@intel.com> -------------------------------------------------------- arch/i386/kernel/smpboot.c | 21 +++++++++++++++++---- drivers/base/cpu.c | 9 +-------- include/linux/cpu.h | 1 - kernel/power/smp.c | 4 +--- 4 files changed, 19 insertions(+), 16 deletions(-) Index: linux-2.6.16-rc6-mm1/arch/i386/kernel/smpboot.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/arch/i386/kernel/smpboot.c +++ linux-2.6.16-rc6-mm1/arch/i386/kernel/smpboot.c @@ -1002,7 +1002,6 @@ void cpu_exit_clear(void) cpu_clear(cpu, cpu_callout_map); cpu_clear(cpu, cpu_callin_map); - cpu_clear(cpu, cpu_present_map); cpu_clear(cpu, smp_commenced_mask); unmap_cpu_to_logical_apicid(cpu); @@ -1021,14 +1020,13 @@ static void __devinit do_warm_boot_cpu(v complete(info->complete); } -int __devinit smp_prepare_cpu(int cpu) +static int __devinit __smp_prepare_cpu(int cpu) { DECLARE_COMPLETION(done); struct warm_boot_cpu_info info; struct work_struct task; int apicid, ret; - lock_cpu_hotplug(); apicid = x86_cpu_to_apicid[cpu]; if (apicid == BAD_APICID) { ret = -ENODEV; @@ -1053,7 +1051,6 @@ int __devinit smp_prepare_cpu(int cpu) zap_low_mappings(); ret = 0; exit: - unlock_cpu_hotplug(); return ret; } #endif @@ -1379,6 +1376,22 @@ void __cpu_die(unsigned int cpu) int __devinit __cpu_up(unsigned int cpu) { + int ret=0; + +#ifdef CONFIG_HOTPLUG_CPU + /* + * We do warm boot only on cpus that had booted earlier + * Otherwise cold boot is all handled from smp_boot_cpus(). + * cpu_callin_map is set during AP kickstart process. Its reset + * when a cpu is taken offline from cpu_exit_clear(). + */ + if (!cpu_isset(cpu, cpu_callin_map)) + ret = __smp_prepare_cpu(cpu); + + if (ret) + return -EIO; +#endif + /* In case one didn't come up */ if (!cpu_isset(cpu, cpu_callin_map)) { printk(KERN_DEBUG "skipping cpu%d, didn't come online\n", cpu); Index: linux-2.6.16-rc6-mm1/drivers/base/cpu.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/drivers/base/cpu.c +++ linux-2.6.16-rc6-mm1/drivers/base/cpu.c @@ -19,11 +19,6 @@ EXPORT_SYMBOL(cpu_sysdev_class); static struct sys_device *cpu_sys_devices[NR_CPUS]; #ifdef CONFIG_HOTPLUG_CPU -int __attribute__((weak)) smp_prepare_cpu (int cpu) -{ - return 0; -} - static ssize_t show_online(struct sys_device *dev, char *buf) { struct cpu *cpu = container_of(dev, struct cpu, sysdev); @@ -44,9 +39,7 @@ static ssize_t store_online(struct sys_d kobject_uevent(&dev->kobj, KOBJ_OFFLINE); break; case '1': - ret = smp_prepare_cpu(cpu->sysdev.id); - if (!ret) - ret = cpu_up(cpu->sysdev.id); + ret = cpu_up(cpu->sysdev.id); if (!ret) kobject_uevent(&dev->kobj, KOBJ_ONLINE); break; Index: linux-2.6.16-rc6-mm1/include/linux/cpu.h =================================================================== --- linux-2.6.16-rc6-mm1.orig/include/linux/cpu.h +++ linux-2.6.16-rc6-mm1/include/linux/cpu.h @@ -74,7 +74,6 @@ extern int lock_cpu_hotplug_interruptibl register_cpu_notifier(&fn##_nb); \ } int cpu_down(unsigned int cpu); -extern int __attribute__((weak)) smp_prepare_cpu(int cpu); #define cpu_is_offline(cpu) unlikely(!cpu_online(cpu)) #else #define lock_cpu_hotplug() do { } while (0) Index: linux-2.6.16-rc6-mm1/kernel/power/smp.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/kernel/power/smp.c +++ linux-2.6.16-rc6-mm1/kernel/power/smp.c @@ -49,9 +49,7 @@ void enable_nonboot_cpus(void) printk("Thawing cpus ...\n"); for_each_cpu_mask(cpu, frozen_cpus) { - error = smp_prepare_cpu(cpu); - if (!error) - error = cpu_up(cpu); + error = cpu_up(cpu); if (!error) { printk("CPU%d is up\n", cpu); continue; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-18 14:09 ` Ashok Raj @ 2006-03-21 1:08 ` Shaohua Li 2006-03-21 1:25 ` Ashok Raj 0 siblings, 1 reply; 31+ messages in thread From: Shaohua Li @ 2006-03-21 1:08 UTC (permalink / raw) To: Ashok Raj Cc: Srivatsa Vaddagiri, Andrew Morton, torvalds, linux-kernel, bryce On Sat, 2006-03-18 at 06:09 -0800, Ashok Raj wrote: > On Fri, Mar 17, 2006 at 07:43:22PM +0530, Srivatsa Vaddagiri wrote: > > On Fri, Mar 17, 2006 at 01:04:12AM -0800, Andrew Morton wrote: > > > OK.. I guess we should fix those architectures while we're thinking about it. > > > > Only x86 has this bug, so only x86 needs to be fixed. Neverthless > > Ashok's patch [1] should address all architectures that may implement > > smp_prepare_cpu() in future as well. > > > Hi Vatsa > > Since smp_prepare_cpu was introduced only to work around warm-boot on i386 due to the > strange kickstart process of AP's. I have moved this al-together within __cpu_up() > i386 arch code, so its less confusing to other arch's that have done it right already. > > Here is the updated patch, it removes smp_prepare_cpu() from being called from generic > core cpuhotplug code, and moved it to i386 side. > > I did some basic up/down testing on a 2 way PIII box. > > Shaohua, could you take a look at this code, and also check suspend/resume is not > affected? > > Andrew, could we queue it to -mm for some exposure before inclusion. > > thanks > > -- > Cheers, > Ashok Raj > - Open Source Technology Center > > > Check if cpu can be onlined before calling smp_prepare_cpu() > > - Moved check for online cpu out of smp_prepare_cpu() > - Moved default declaration of smp_prepare_cpu() to kernel/cpu.c > - Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since > its called from cpu_up() as well now. > - Removed clearing from cpu_present_map during cpu_offline as it breaks using cpu_up() > directly during a subsequent online operation. > > Signed-off-by: Ashok Raj <ashok.raj@intel.com> > -------------------------------------------------------- > arch/i386/kernel/smpboot.c | 21 +++++++++++++++++---- > drivers/base/cpu.c | 9 +-------- > include/linux/cpu.h | 1 - > kernel/power/smp.c | 4 +--- > 4 files changed, 19 insertions(+), 16 deletions(-) > > Index: linux-2.6.16-rc6-mm1/arch/i386/kernel/smpboot.c > =================================================================== > --- linux-2.6.16-rc6-mm1.orig/arch/i386/kernel/smpboot.c > +++ linux-2.6.16-rc6-mm1/arch/i386/kernel/smpboot.c > @@ -1002,7 +1002,6 @@ void cpu_exit_clear(void) > > cpu_clear(cpu, cpu_callout_map); > cpu_clear(cpu, cpu_callin_map); > - cpu_clear(cpu, cpu_present_map); > > cpu_clear(cpu, smp_commenced_mask); > unmap_cpu_to_logical_apicid(cpu); > @@ -1021,14 +1020,13 @@ static void __devinit do_warm_boot_cpu(v > complete(info->complete); > } > > -int __devinit smp_prepare_cpu(int cpu) > +static int __devinit __smp_prepare_cpu(int cpu) > { > DECLARE_COMPLETION(done); > struct warm_boot_cpu_info info; > struct work_struct task; > int apicid, ret; > > - lock_cpu_hotplug(); > apicid = x86_cpu_to_apicid[cpu]; > if (apicid == BAD_APICID) { > ret = -ENODEV; > @@ -1053,7 +1051,6 @@ int __devinit smp_prepare_cpu(int cpu) > zap_low_mappings(); > ret = 0; > exit: > - unlock_cpu_hotplug(); > return ret; > } > #endif > @@ -1379,6 +1376,22 @@ void __cpu_die(unsigned int cpu) > > int __devinit __cpu_up(unsigned int cpu) > { > + int ret=0; > + > +#ifdef CONFIG_HOTPLUG_CPU > + /* > + * We do warm boot only on cpus that had booted earlier > + * Otherwise cold boot is all handled from smp_boot_cpus(). > + * cpu_callin_map is set during AP kickstart process. Its reset > + * when a cpu is taken offline from cpu_exit_clear(). > + */ > + if (!cpu_isset(cpu, cpu_callin_map)) > + ret = __smp_prepare_cpu(cpu); Does this work for boot time? cpu_callin_map isn't set at boot time for APs. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-21 1:08 ` Shaohua Li @ 2006-03-21 1:25 ` Ashok Raj 2006-03-21 1:36 ` Shaohua Li 0 siblings, 1 reply; 31+ messages in thread From: Ashok Raj @ 2006-03-21 1:25 UTC (permalink / raw) To: Shaohua Li Cc: Ashok Raj, Srivatsa Vaddagiri, Andrew Morton, torvalds, linux-kernel, bryce On Tue, Mar 21, 2006 at 09:08:05AM +0800, Shaohua Li wrote: > > + > > +#ifdef CONFIG_HOTPLUG_CPU > > + /* > > + * We do warm boot only on cpus that had booted earlier > > + * Otherwise cold boot is all handled from smp_boot_cpus(). > > + * cpu_callin_map is set during AP kickstart process. Its reset > > + * when a cpu is taken offline from cpu_exit_clear(). > > + */ > > + if (!cpu_isset(cpu, cpu_callin_map)) > > + ret = __smp_prepare_cpu(cpu); > Does this work for boot time? cpu_callin_map isn't set at boot time for > APs. > Yes, I tested on a 2 way system and it booted fine. This is set durng the kickstart process when smp_callin() is called after AP receives boot ipi. -- Cheers, Ashok Raj - Open Source Technology Center ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-21 1:25 ` Ashok Raj @ 2006-03-21 1:36 ` Shaohua Li 0 siblings, 0 replies; 31+ messages in thread From: Shaohua Li @ 2006-03-21 1:36 UTC (permalink / raw) To: Ashok Raj Cc: Srivatsa Vaddagiri, Andrew Morton, torvalds, linux-kernel, bryce On Mon, 2006-03-20 at 17:25 -0800, Ashok Raj wrote: > On Tue, Mar 21, 2006 at 09:08:05AM +0800, Shaohua Li wrote: > > > + > > > +#ifdef CONFIG_HOTPLUG_CPU > > > + /* > > > + * We do warm boot only on cpus that had booted earlier > > > + * Otherwise cold boot is all handled from smp_boot_cpus(). > > > + * cpu_callin_map is set during AP kickstart process. Its reset > > > + * when a cpu is taken offline from cpu_exit_clear(). > > > + */ > > > + if (!cpu_isset(cpu, cpu_callin_map)) > > > + ret = __smp_prepare_cpu(cpu); > > Does this work for boot time? cpu_callin_map isn't set at boot time for > > APs. > > > Yes, > > I tested on a 2 way system and it booted fine. This is set durng the kickstart process > when smp_callin() is called after AP receives boot ipi. Ok, thanks! It should work for suspend/resume. Thanks, Shaohua ^ permalink raw reply [flat|nested] 31+ messages in thread
* Status on CPU hotplug issues 2006-03-17 9:04 ` Andrew Morton 2006-03-17 14:13 ` Srivatsa Vaddagiri @ 2006-10-06 23:10 ` Bryce Harrington 2006-10-06 23:29 ` Andrew Morton 2006-10-07 21:57 ` Pavel Machek 1 sibling, 2 replies; 31+ messages in thread From: Bryce Harrington @ 2006-10-06 23:10 UTC (permalink / raw) To: Andrew Morton Cc: vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Fri, Mar 17, 2006 at 01:04:12AM -0800, Andrew Morton wrote: > Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote: > > Well ..other arch-es need to have a similar check if they get around to > > implement physical hot-add (even if they allow offlining of all CPUs). This is > > required since a user can (by mistake maybe) try to bring up an already online > > CPU by writing a '1' to it's sysfs 'online' file. 'store_online' > > (drivers/base/cpu.c) unconditionally calls 'smp_prepare_cpu' w/o checking for > > this error condition. The check added in the patch catches such error > > conditions as well. > > OK.. I guess we should fix those architectures while we're thinking about it. > > > + /* Check if CPU is already online. This can happen if user tries to > > + * bringup an already online CPU or a previous offline attempt > > + * on this CPU has failed. > > + */ > > + if (cpu_online(cpu)) { > > + ret = -EINVAL; > > + goto exit; > > + } > > How well tested is this? From my reading, this will cause > enable_nonboot_cpus() to panic. Is that intended? Andrew, I wanted to give you an update on results of cpu testing I've done on recent kernels and several architectures. Since -rc1 is out, I wanted to give added visibility to the few issues that remain. The full results are available here: http://crucible.osdl.org/runs/hotplug_report.html This is actually a report for cpu hotplug tests generated hourly, however we run it against all of the kernel -git snapshots posted to kernel.org. Whereever you see a blank square, it indicates the kernel either failed to build or boot. Issues were found in four areas: General kernel, cpu hotplug, sysstat, and the test harness itself. I've reported the general kernel issues here, and they have been addressed (thanks). The issues in the test harness were also addressed. Of the two sysstat issues, one is fixed in recent releases, and the other has been reported and may simply be a sysstat design decision. So, focusing just on the remaining open CPU hotplug issues: 1. Oops offlining cpu twice on AMD64 (but not on EM64t) with the 2.6.18-git22 kernel Reported to hotplug lists 10/05: http://lists.osdl.org/pipermail/hotplug_sig/2006-October/000680.html To recreate: offline, online, and then offline a CPU, then oopses http://crucible.osdl.org/runs/2397/sysinfo/amd01.console http://crucible.osdl.org/runs/2397/sysinfo/amd01.2/proc/config Here's a snippet of the oops: # echo 0 > /sys/devices/system/cpu/cpu1/online Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff80255287>] __drain_pages+0x29/0x5f PGD 7e56d067 PUD 7ee80067 PMD 0 Oops: 0000 [1] PREEMPT SMP CPU 0 Modules linked in: Pid: 7203, comm: bash Tainted: G M 2.6.18-git22 #1 RIP: 0010:[<ffffffff80255287>] [<ffffffff80255287>] __drain_pages+0x29/0x5f RSP: 0018:ffff81003f1b3dd8 EFLAGS: 00010082 RAX: 0000000000000001 RBX: 0000000000000082 RCX: 0000000000000000 RDX: ffff81000000c580 RSI: 00000000fffffffe RDI: ffff81000000c000 RBP: 0000000000000000 R08: 00000000fffffffe R09: ffff81007f1e63f0 R10: 0000000000000000 R11: 0000000000000000 R12: ffff81000000c580 R13: 0000000000000001 R14: 0000000000000001 R15: ffff81003f1b3f50 FS: 00002ab3e8a136d0(0000) GS:ffffffff806e3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000007eea1000 CR4: 00000000000006e0 Process bash (pid: 7203, threadinfo ffff81003f1b2000, task ffff81003f1c07c0) Stack: 0000000000000001 0000000000000001 0000000000000007 0000000000000007 0000000000000003 ffffffff802564b6 ffffffff8060d320 ffffffff805644a7 ffffffff8060dbe0 0000000000000001 0000000000000001 ffffffff8023c49d Call Trace: [<ffffffff802564b6>] page_alloc_cpu_notify+0x12/0x28 [<ffffffff805644a7>] notifier_call_chain+0x23/0x32 [<ffffffff8023c49d>] blocking_notifier_call_chain+0x22/0x36 [<ffffffff80248ff9>] _cpu_down+0x17f/0x23d [<ffffffff802490de>] cpu_down+0x27/0x3c [<ffffffff804419c7>] store_online+0x0/0x6b [<ffffffff804419ec>] store_online+0x25/0x6b [<ffffffff802ac4b7>] sysfs_write_file+0xad/0xd7 [<ffffffff80273ff7>] vfs_write+0xaf/0x14e [<ffffffff80274149>] sys_write+0x45/0x6e [<ffffffff8020965e>] system_call+0x7e/0x83 2. Oops during SMP bootup on AMD64 and EM64t Reported to hotplug and LKML lists 10/6: http://marc.theaimsgroup.com/?l=linux-kernel&m=116017427617557&w=2 Appears to have entered codebase between 2.6.19-rc1 and 2.6.19-rc1-git2. Seems ok on x86_32 and ia32. 3. Inconsistent error message when onlining already-onlined cpu Reported to hotplug lists 09/29: http://lists.osdl.org/pipermail/hotplug_sig/2006-September/000672.html Bugzilla ID 7249: http://bugzilla.kernel.org/show_bug.cgi?id=7249 This only appears on EM64t (I haven't been able to check AMD64 due to the aforementioned issues.) If you attempt to online an already onlined CPU on most architectures, the command simply exits with an error code of 1 and no error message to stderr. However, EM64t prints the following: # echo 1 > /sys/devices/system/cpu/cpu1/online # echo 1 > /sys/devices/system/cpu/cpu1/online -bash: echo: write error: Invalid argument Andi Kleen took at look at the report and says it looks like a bug to him. Thanks, Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-06 23:10 ` Status on CPU hotplug issues Bryce Harrington @ 2006-10-06 23:29 ` Andrew Morton 2006-10-07 0:00 ` Bryce Harrington 2006-10-07 10:24 ` Status on CPU hotplug issues Pavel Machek 2006-10-07 21:57 ` Pavel Machek 1 sibling, 2 replies; 31+ messages in thread From: Andrew Morton @ 2006-10-06 23:29 UTC (permalink / raw) To: Bryce Harrington, Pavel Machek Cc: vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Fri, 6 Oct 2006 16:10:12 -0700 Bryce Harrington <bryce@osdl.org> wrote: > On Fri, Mar 17, 2006 at 01:04:12AM -0800, Andrew Morton wrote: > > Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote: > > > Well ..other arch-es need to have a similar check if they get around to > > > implement physical hot-add (even if they allow offlining of all CPUs). This is > > > required since a user can (by mistake maybe) try to bring up an already online > > > CPU by writing a '1' to it's sysfs 'online' file. 'store_online' > > > (drivers/base/cpu.c) unconditionally calls 'smp_prepare_cpu' w/o checking for > > > this error condition. The check added in the patch catches such error > > > conditions as well. > > > > OK.. I guess we should fix those architectures while we're thinking about it. > > > > > + /* Check if CPU is already online. This can happen if user tries to > > > + * bringup an already online CPU or a previous offline attempt > > > + * on this CPU has failed. > > > + */ > > > + if (cpu_online(cpu)) { > > > + ret = -EINVAL; > > > + goto exit; > > > + } > > > > How well tested is this? From my reading, this will cause > > enable_nonboot_cpus() to panic. Is that intended? > > Andrew, > > I wanted to give you an update on results of cpu testing I've done on > recent kernels and several architectures. Since -rc1 is out, I wanted > to give added visibility to the few issues that remain. > > The full results are available here: > > http://crucible.osdl.org/runs/hotplug_report.html > > This is actually a report for cpu hotplug tests generated hourly, > however we run it against all of the kernel -git snapshots posted to > kernel.org. Whereever you see a blank square, it indicates the kernel > either failed to build or boot. > Can you describe the nature of the cpu-hotplug tests you're running? I'd be fairly staggered if the kernel was able to survive a full-on cpu-hotplug stress test for more than one second, frankly. There's a lot of code in there which is non-hotplug-aware. Running a non-preemptible kernel would make things appear more stable, perhaps. iirc Pavel did some testing a month or two ago and was seeing userspace misbehaviour? > > Issues were found in four areas: General kernel, cpu hotplug, sysstat, > and the test harness itself. > It's surprising that AMD and Intel CPUs behave differently. Also a good start on diagnosing things. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-06 23:29 ` Andrew Morton @ 2006-10-07 0:00 ` Bryce Harrington 2006-10-07 10:35 ` Pavel Machek 2006-10-07 10:24 ` Status on CPU hotplug issues Pavel Machek 1 sibling, 1 reply; 31+ messages in thread From: Bryce Harrington @ 2006-10-07 0:00 UTC (permalink / raw) To: Andrew Morton Cc: Pavel Machek, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Fri, Oct 06, 2006 at 04:29:24PM -0700, Andrew Morton wrote: > Can you describe the nature of the cpu-hotplug tests you're running? I'd > be fairly staggered if the kernel was able to survive a full-on cpu-hotplug > stress test for more than one second, frankly. There's a lot of code in > there which is non-hotplug-aware. Running a non-preemptible kernel would > make things appear more stable, perhaps. Certainly, the testsuite is one the OSDL Hotplug SIG put together last summer, and consists of several test cases: http://developer.osdl.org/dev/HOTPLUG/planning/hotplug_cpu_test_plan_status.html hotplug01: Check IRQ behavior during cpu hotplug events hotplug02: Check process migration during cpu hotplug events hotplug03: Verify tasks get scheduled on newly onlined cpu's hotplug04: Verify disallowing offlining all CPU's hotplug05: (Unimplemented) hotplug06: Check userspace tools (sar, top) during cpu hotplug events hotplug07: Stress case doing kernel compile while cpu's are hotplugged on and off repeatedly It can be downloaded here: http://developer.osdl.org/dev/hotplug/tests/ Results are posted here: http://crucible.osdl.org/runs/hotplug_report.html We've been running this testsuite fairly continuously for several months, and irregularly for about a year before that. We find that on some platforms like PPC64 it's quite robust, and on others there are issues, but the developers tend to be quick to provide fixes as the issues are found. I'm glad to see that the results are finally showing green for ia64. > > Issues were found in four areas: General kernel, cpu hotplug, sysstat, > > and the test harness itself. > > It's surprising that AMD and Intel CPUs behave differently. Also a good > start on diagnosing things. I was also surprised to see this too. Note that the .config's for the amd and em64t machines are considerably different (different drivers, etc.), but the cpu config should be relatively comparable. Still, I wouldn't rule out the different behaviors being due to configuration differences. We could work on homogenizing the configs of the two systems if that would help in troubleshooting? Thanks, Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-07 0:00 ` Bryce Harrington @ 2006-10-07 10:35 ` Pavel Machek 2006-10-07 20:42 ` Bryce Harrington 0 siblings, 1 reply; 31+ messages in thread From: Pavel Machek @ 2006-10-07 10:35 UTC (permalink / raw) To: Bryce Harrington; +Cc: Andrew Morton, kernel list On Fri 2006-10-06 17:00:31, Bryce Harrington wrote: > On Fri, Oct 06, 2006 at 04:29:24PM -0700, Andrew Morton wrote: > > Can you describe the nature of the cpu-hotplug tests you're running? I'd > > be fairly staggered if the kernel was able to survive a full-on cpu-hotplug > > stress test for more than one second, frankly. There's a lot of code in > > there which is non-hotplug-aware. Running a non-preemptible kernel would > > make things appear more stable, perhaps. > > Certainly, the testsuite is one the OSDL Hotplug SIG put together last > summer, and consists of several test cases: > > http://developer.osdl.org/dev/HOTPLUG/planning/hotplug_cpu_test_plan_status.html Page actually lists test 1-6. > hotplug01: Check IRQ behavior during cpu hotplug events > hotplug02: Check process migration during cpu hotplug events > hotplug03: Verify tasks get scheduled on newly onlined cpu's > hotplug04: Verify disallowing offlining all CPU's > hotplug05: (Unimplemented) > hotplug06: Check userspace tools (sar, top) during cpu hotplug events > hotplug07: Stress case doing kernel compile while cpu's are > hotplugged on and off repeatedly Well, while nice for "it basically works", that will not stress hotplug subsystem too badly. If you want some real nasty tests: hotplug_locking: create 10 threads, make them try to online/offline random cpus, all in paralel. (This is what I was doing in smaller scale). You'll get some expected errors (like cpu already up), but system should survive. cpufreq: change cpufreq parameters on cpu (toggling min/max frequency?) while trying to online/offline that cpu from another thread. suspend: swapoff -a, then proceed like in hotplug_locking, while trying to suspend machine to disk (it will immediately wake up because of no swap available). Should be useful at pointing out bugs in suspend code. (but quite tricky to setup the test, so you may or may not want to do this one). > We've been running this testsuite fairly continuously for several > months, and irregularly for about a year before that. We find that on > some platforms like PPC64 it's quite robust, and on others there are > issues, but the developers tend to be quick to provide fixes as the > issues are found. I'm glad to see that the results are finally showing > green for ia64. Hmm, perhaps you should add ppc64 to the hotplug_report.html, so that some green can be seen :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-07 10:35 ` Pavel Machek @ 2006-10-07 20:42 ` Bryce Harrington 2006-10-08 18:29 ` Heiko Carstens 0 siblings, 1 reply; 31+ messages in thread From: Bryce Harrington @ 2006-10-07 20:42 UTC (permalink / raw) To: Pavel Machek; +Cc: Andrew Morton, kernel list On Sat, Oct 07, 2006 at 12:35:59PM +0200, Pavel Machek wrote: > On Fri 2006-10-06 17:00:31, Bryce Harrington wrote: > > On Fri, Oct 06, 2006 at 04:29:24PM -0700, Andrew Morton wrote: > > > Can you describe the nature of the cpu-hotplug tests you're running? I'd > > > be fairly staggered if the kernel was able to survive a full-on cpu-hotplug > > > stress test for more than one second, frankly. There's a lot of code in > > > there which is non-hotplug-aware. Running a non-preemptible kernel would > > > make things appear more stable, perhaps. > > > > Certainly, the testsuite is one the OSDL Hotplug SIG put together last > > summer, and consists of several test cases: > > > > http://developer.osdl.org/dev/HOTPLUG/planning/hotplug_cpu_test_plan_status.html > > Page actually lists test 1-6. Case 7 was based on a contribution by one of the kernel developers, that he had used for testing his cpu code. > > hotplug01: Check IRQ behavior during cpu hotplug events > > hotplug02: Check process migration during cpu hotplug events > > hotplug03: Verify tasks get scheduled on newly onlined cpu's > > hotplug04: Verify disallowing offlining all CPU's > > hotplug05: (Unimplemented) > > hotplug06: Check userspace tools (sar, top) during cpu hotplug events > > hotplug07: Stress case doing kernel compile while cpu's are > > hotplugged on and off repeatedly > > Well, while nice for "it basically works", that will not stress > hotplug subsystem too badly. > > If you want some real nasty tests: > > hotplug_locking: create 10 threads, make them try to online/offline > random cpus, all in paralel. (This is what I was doing in smaller > scale). You'll get some expected errors (like cpu already up), but > system should survive. > > cpufreq: change cpufreq parameters on cpu (toggling min/max > frequency?) while trying to online/offline that cpu from another > thread. > > suspend: swapoff -a, then proceed like in hotplug_locking, while > trying to suspend machine to disk (it will immediately wake up because > of no swap available). Should be useful at pointing out bugs in > suspend code. (but quite tricky to setup the test, so you may or may > not want to do this one). Thanks, I've added these to the todo list. > > We've been running this testsuite fairly continuously for several > > months, and irregularly for about a year before that. We find that on > > some platforms like PPC64 it's quite robust, and on others there are > > issues, but the developers tend to be quick to provide fixes as the > > issues are found. I'm glad to see that the results are finally showing > > green for ia64. > > Hmm, perhaps you should add ppc64 to the hotplug_report.html, so that > some green can be seen :-). I'd like to, however the issue has been that we cannot automatically do boot-once with yaboot on PPC like we can with other bootloaders, so when the machine locks up we have to manually boot the machine to test kernels. That was okay initially when we were developing the testsuite, but for running the -mm, -git, and -rc trees every day it hasn't been feasible to do. So, getting boot-once functionality enabled in yaboot (or getting grub2 stable for ppc64) is another issue we're tracking. Kirkland has done some work in this area, but it sounds like advice from someone with good knowledge of yaboot internals is necessary to get this solved. I'm sure we'll get there eventually, but this has been a roadblock for automating our ppc64 kernel testing automation so far. Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-07 20:42 ` Bryce Harrington @ 2006-10-08 18:29 ` Heiko Carstens 2006-10-08 19:14 ` Pavel Machek 0 siblings, 1 reply; 31+ messages in thread From: Heiko Carstens @ 2006-10-08 18:29 UTC (permalink / raw) To: Bryce Harrington; +Cc: Pavel Machek, Andrew Morton, kernel list > > > We've been running this testsuite fairly continuously for several > > > months, and irregularly for about a year before that. We find that on > > > some platforms like PPC64 it's quite robust, and on others there are > > > issues, but the developers tend to be quick to provide fixes as the > > > issues are found. I'm glad to see that the results are finally showing > > > green for ia64. > > > > Hmm, perhaps you should add ppc64 to the hotplug_report.html, so that > > some green can be seen :-). FWIW, s390 passes all tests as well ;) ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-08 18:29 ` Heiko Carstens @ 2006-10-08 19:14 ` Pavel Machek 2006-10-11 1:08 ` [BUG] 2.6.19-rc1-mm1: fs/file.c138 on ia64 Bryce Harrington 0 siblings, 1 reply; 31+ messages in thread From: Pavel Machek @ 2006-10-08 19:14 UTC (permalink / raw) To: Heiko Carstens; +Cc: Bryce Harrington, Andrew Morton, kernel list On Sun 2006-10-08 20:29:41, Heiko Carstens wrote: > > > > We've been running this testsuite fairly continuously for several > > > > months, and irregularly for about a year before that. We find that on > > > > some platforms like PPC64 it's quite robust, and on others there are > > > > issues, but the developers tend to be quick to provide fixes as the > > > > issues are found. I'm glad to see that the results are finally showing > > > > green for ia64. > > > > > > Hmm, perhaps you should add ppc64 to the hotplug_report.html, so that > > > some green can be seen :-). > > FWIW, s390 passes all tests as well ;) Add it to the table so that we can see more systems where it works ;-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* [BUG] 2.6.19-rc1-mm1: fs/file.c138 on ia64 2006-10-08 19:14 ` Pavel Machek @ 2006-10-11 1:08 ` Bryce Harrington 2006-10-11 1:15 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Bryce Harrington @ 2006-10-11 1:08 UTC (permalink / raw) To: Pavel Machek; +Cc: Heiko Carstens, Andrew Morton, kernel list On Sun, Oct 08, 2006 at 09:14:47PM +0200, Pavel Machek wrote: > On Sun 2006-10-08 20:29:41, Heiko Carstens wrote: > > > > > We've been running this testsuite fairly continuously for several > > > > > months, and irregularly for about a year before that. We find that on > > > > > some platforms like PPC64 it's quite robust, and on others there are > > > > > issues, but the developers tend to be quick to provide fixes as the > > > > > issues are found. I'm glad to see that the results are finally showing > > > > > green for ia64. Spoke too soon. ;-) We've noticed a new ia64 issue on the 2.6.19-rc1-mm1 kernel. It has not occurred on other 2.6.19 kernels we've tested. We aslo encountered this BUG only on ia64; the x86 and x86_64 systems booted without issue. Apologies if this is already known; I didn't spot it in the list archives. I have hotplug-cpu configured for this machine, however I don't know if it has anything to do with this BUG. I can test with it turned of if it'd help. The line referred to in the output is in copy_fdtable(): BUG_ON(nfdt->max_fds < ofdt->max_fds); http://crucible.osdl.org/runs/2511/sysinfo/ita01.console.log Freeing unused kernel memory: 1840kB freed kernel BUG at fs/file.c:138! hotplug[956]: bugcheck! 0 [1] Modules linked in: Pid: 956, CPU 1, comm: hotplug psr : 0000101008026018 ifs : 800000000000050d ip : [<a00000010017c990>] Not tainted ip is at copy_fdtable+0x70/0x180 unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003 rnat: 0000000000000000 bsps: 0000000000000000 pr : 0000000000019659 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a00000010017c990 b6 : a000000100002d70 b7 : a0000001000373a0 f6 : 1003e0000000cc6c97af8 f7 : 0ffd9a200000000000000 f8 : 1003e00000000000005dc f9 : 1003e0000000cc6c980d4 f10 : 1003e0000000000000001 f11 : 1003e0000000000000000 r1 : a000000100e8f210 r2 : 0000000000004000 r3 : 0000000000004000 r8 : 0000000000000020 r9 : 0000000000000000 r10 : a000000100b600b0 r11 : 0000000000003e97 r12 : e0000040fc30fe30 r13 : e0000040fc308000 r14 : a000000100b3fc20 r15 : 0000000000003e97 r16 : ffffffffffffc169 r17 : a000000100ca89d0 r18 : 0000000000000000 r19 : a000000100ca89c8 r20 : a000000100ca89d8 r21 : 0000000000004000 r22 : 0000000000000000 r23 : a000000100ca89c0 r24 : a000000100ca44a0 r25 : 0000000000000000 r26 : a000000100ca89b4 r27 : a000000100b3fc20 r28 : a000000100c8f4e8 r29 : 0000000000800000 r30 : 0000000000000000 r31 : a000000100ca8990 Call Trace: [<a000000100012760>] show_stack+0x40/0xa0 sp=e0000040fc30f9a0 bsp=e0000040fc308e40 [<a000000100012ff0>] show_regs+0x7d0/0x800 sp=e0000040fc30fb70 bsp=e0000040fc308df0 [<a000000100037940>] die+0x1c0/0x2c0 sp=e0000040fc30fb70 bsp=e0000040fc308da8 [<a000000100037a80>] die_if_kernel+0x40/0x60 sp=e0000040fc30fb90 bsp=e0000040fc308d78 [<a000000100037cc0>] ia64_bad_break+0x220/0x460 sp=e0000040fc30fb90 bsp=e0000040fc308d50 [<a00000010000c8a0>] ia64_leave_kernel+0x0/0x280 sp=e0000040fc30fc60 bsp=e0000040fc308d50 [<a00000010017c990>] copy_fdtable+0x70/0x180 sp=e0000040fc30fe30 bsp=e0000040fc308ce8 [<a00000010017cef0>] expand_fdtable+0x90/0x1e0 sp=e0000040fc30fe30 bsp=e0000040fc308cb0 [<a00000010017d0a0>] expand_files+0x60/0x80 sp=e0000040fc30fe30 bsp=e0000040fc308c88 [<a000000100167210>] sys_dup2+0x130/0x360 sp=e0000040fc30fe30 bsp=e0000040fc308c00 [<a00000010000c700>] ia64_ret_from_syscall+0x0/0x20 sp=e0000040fc30fe30 bsp=e0000040fc308c00 Config used: http://crucible.osdl.org/runs/2511/sysinfo/ita01.config More info about system: http://crucible.osdl.org/runs/2511/sysinfo/ For comparison, here is a run against patch-2.6.19-rc1-git6: http://crucible.osdl.org/runs/2505/ Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] 2.6.19-rc1-mm1: fs/file.c138 on ia64 2006-10-11 1:08 ` [BUG] 2.6.19-rc1-mm1: fs/file.c138 on ia64 Bryce Harrington @ 2006-10-11 1:15 ` Andrew Morton 2006-10-11 5:38 ` Bryce Harrington 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2006-10-11 1:15 UTC (permalink / raw) To: Bryce Harrington; +Cc: Pavel Machek, Heiko Carstens, kernel list On Tue, 10 Oct 2006 18:08:26 -0700 Bryce Harrington <bryce@osdl.org> wrote: > On Sun, Oct 08, 2006 at 09:14:47PM +0200, Pavel Machek wrote: > > On Sun 2006-10-08 20:29:41, Heiko Carstens wrote: > > > > > > We've been running this testsuite fairly continuously for several > > > > > > months, and irregularly for about a year before that. We find that on > > > > > > some platforms like PPC64 it's quite robust, and on others there are > > > > > > issues, but the developers tend to be quick to provide fixes as the > > > > > > issues are found. I'm glad to see that the results are finally showing > > > > > > green for ia64. > > Spoke too soon. ;-) > > We've noticed a new ia64 issue on the 2.6.19-rc1-mm1 kernel. It has not > occurred on other 2.6.19 kernels we've tested. We aslo encountered this > BUG only on ia64; the x86 and x86_64 systems booted without issue. > Apologies if this is already known; I didn't spot it in the list > archives. > > I have hotplug-cpu configured for this machine, however I don't know if > it has anything to do with this BUG. I can test with it turned of if > it'd help. > > The line referred to in the output is in copy_fdtable(): > BUG_ON(nfdt->max_fds < ofdt->max_fds); > > > http://crucible.osdl.org/runs/2511/sysinfo/ita01.console.log > > Freeing unused kernel memory: 1840kB freed > kernel BUG at fs/file.c:138! Was that with ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/hot-fixes applied? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [BUG] 2.6.19-rc1-mm1: fs/file.c138 on ia64 2006-10-11 1:15 ` Andrew Morton @ 2006-10-11 5:38 ` Bryce Harrington 0 siblings, 0 replies; 31+ messages in thread From: Bryce Harrington @ 2006-10-11 5:38 UTC (permalink / raw) To: Andrew Morton; +Cc: Pavel Machek, Heiko Carstens, kernel list On Tue, Oct 10, 2006 at 06:15:53PM -0700, Andrew Morton wrote: > On Tue, 10 Oct 2006 18:08:26 -0700 > Bryce Harrington <bryce@osdl.org> wrote: > > We've noticed a new ia64 issue on the 2.6.19-rc1-mm1 kernel. It has not > > occurred on other 2.6.19 kernels we've tested. We aslo encountered this > > BUG only on ia64; the x86 and x86_64 systems booted without issue. > > Apologies if this is already known; I didn't spot it in the list > > archives. > > > > http://crucible.osdl.org/runs/2511/sysinfo/ita01.console.log > > > > Freeing unused kernel memory: 1840kB freed > > kernel BUG at fs/file.c:138! > > Was that with > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.19-rc1/2.6.19-rc1-mm1/hot-fixes > applied? Vadim Lobanov <vlobanov@speakeasy.net> writes: > Dave, Olof, Linas, Bryce, > > Could you please test the patch at the bottom of the email to see if it > makes your computers happy again, if you have the time and inclination > to do so? I've run the tests with the hotfix and with Vadim's patch. Here are the results of compiling, booting, and running the hotplug-cpu test on it: run patch(es) result 2511 2.6.19-rc1-mm1 BUG at fs/file.c:138 2519 2.6.19-rc1-mm1 + hot-fix PASS 2521 2.6.19-rc1-mm1 + vadim's patch PASS 2520 2.6.19-rc1-mm1 + hot-fix + vadim's patch patch doesn't apply Results available at http://crucible.osdl.org/runs/$run/ Thanks, Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-06 23:29 ` Andrew Morton 2006-10-07 0:00 ` Bryce Harrington @ 2006-10-07 10:24 ` Pavel Machek 2006-10-07 20:25 ` Bryce Harrington 1 sibling, 1 reply; 31+ messages in thread From: Pavel Machek @ 2006-10-07 10:24 UTC (permalink / raw) To: Andrew Morton Cc: Bryce Harrington, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel Hi! > > > How well tested is this? From my reading, this will cause > > > enable_nonboot_cpus() to panic. Is that intended? > > > > I wanted to give you an update on results of cpu testing I've done on > > recent kernels and several architectures. Since -rc1 is out, I wanted > > to give added visibility to the few issues that remain. > > > > The full results are available here: > > > > http://crucible.osdl.org/runs/hotplug_report.html > > > > This is actually a report for cpu hotplug tests generated hourly, > > however we run it against all of the kernel -git snapshots posted to > > kernel.org. Whereever you see a blank square, it indicates the kernel > > either failed to build or boot. So... patch-2.6.18-git4 failed to boot on all architectures? I'm seeing very little green fields there... actually I only see two green fields in whole table. (And it would be nice to call ia64 "ia64", not "ita64" :-) > Can you describe the nature of the cpu-hotplug tests you're running? I'd > be fairly staggered if the kernel was able to survive a full-on cpu-hotplug > stress test for more than one second, frankly. There's a lot of code in > there which is non-hotplug-aware. Running a non-preemptible kernel would > make things appear more stable, perhaps. > > iirc Pavel did some testing a month or two ago and was seeing userspace > misbehaviour? Pavel did some testing (like two threads trying to plug/unplug cpus at the same time), and seen machines dying real fast; but that was fixed, IIRC, and I did not really torture it after that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-07 10:24 ` Status on CPU hotplug issues Pavel Machek @ 2006-10-07 20:25 ` Bryce Harrington 2006-10-08 19:13 ` Pavel Machek 0 siblings, 1 reply; 31+ messages in thread From: Bryce Harrington @ 2006-10-07 20:25 UTC (permalink / raw) To: Pavel Machek Cc: Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Sat, Oct 07, 2006 at 12:24:19PM +0200, Pavel Machek wrote: > > > > How well tested is this? From my reading, this will cause > > > > enable_nonboot_cpus() to panic. Is that intended? > > > > > > I wanted to give you an update on results of cpu testing I've done on > > > recent kernels and several architectures. Since -rc1 is out, I wanted > > > to give added visibility to the few issues that remain. > > > > > > The full results are available here: > > > > > > http://crucible.osdl.org/runs/hotplug_report.html > > > > > > This is actually a report for cpu hotplug tests generated hourly, > > > however we run it against all of the kernel -git snapshots posted to > > > kernel.org. Whereever you see a blank square, it indicates the kernel > > > either failed to build or boot. > > So... patch-2.6.18-git4 failed to boot on all architectures? I'm > seeing very little green fields there... actually I only see two green > fields in whole table. No, it failed to build due to a patching issue that has since been fixed (I can rerun those older runs if there is interest.) > (And it would be nice to call ia64 "ia64", not "ita64" :-) Done > > Can you describe the nature of the cpu-hotplug tests you're running? I'd > > be fairly staggered if the kernel was able to survive a full-on cpu-hotplug > > stress test for more than one second, frankly. There's a lot of code in > > there which is non-hotplug-aware. Running a non-preemptible kernel would > > make things appear more stable, perhaps. > > > > iirc Pavel did some testing a month or two ago and was seeing userspace > > misbehaviour? > > Pavel did some testing (like two threads trying to plug/unplug cpus at > the same time), and seen machines dying real fast; but that was fixed, > IIRC, and I did not really torture it after that. If this test is available, I could include it in my test runs if you think it would be worth tracking. Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-07 20:25 ` Bryce Harrington @ 2006-10-08 19:13 ` Pavel Machek 2006-10-09 7:42 ` Bryce Harrington 0 siblings, 1 reply; 31+ messages in thread From: Pavel Machek @ 2006-10-08 19:13 UTC (permalink / raw) To: Bryce Harrington Cc: Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel Hi! > > > > > How well tested is this? From my reading, this will cause > > > > > enable_nonboot_cpus() to panic. Is that intended? > > > > > > > > I wanted to give you an update on results of cpu testing I've done on > > > > recent kernels and several architectures. Since -rc1 is out, I wanted > > > > to give added visibility to the few issues that remain. > > > > > > > > The full results are available here: > > > > > > > > http://crucible.osdl.org/runs/hotplug_report.html > > > > > > > > This is actually a report for cpu hotplug tests generated hourly, > > > > however we run it against all of the kernel -git snapshots posted to > > > > kernel.org. Whereever you see a blank square, it indicates the kernel > > > > either failed to build or boot. > > > > So... patch-2.6.18-git4 failed to boot on all architectures? I'm > > seeing very little green fields there... actually I only see two green > > fields in whole table. > > No, it failed to build due to a patching issue that has since been fixed > (I can rerun those older runs if there is interest.) Reruning few of them and deleting rest from table would be nice. > > > iirc Pavel did some testing a month or two ago and was seeing userspace > > > misbehaviour? > > > > Pavel did some testing (like two threads trying to plug/unplug cpus at > > the same time), and seen machines dying real fast; but that was fixed, > > IIRC, and I did not really torture it after that. > > If this test is available, I could include it in my test runs if you > think it would be worth tracking. Is shell script acceptable form of a test? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-08 19:13 ` Pavel Machek @ 2006-10-09 7:42 ` Bryce Harrington 0 siblings, 0 replies; 31+ messages in thread From: Bryce Harrington @ 2006-10-09 7:42 UTC (permalink / raw) To: Pavel Machek Cc: Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Sun, Oct 08, 2006 at 09:13:50PM +0200, Pavel Machek wrote: > > > So... patch-2.6.18-git4 failed to boot on all architectures? I'm > > > seeing very little green fields there... actually I only see two green > > > fields in whole table. > > > > No, it failed to build due to a patching issue that has since been fixed > > (I can rerun those older runs if there is interest.) > > Reruning few of them and deleting rest from table would be nice. I've requeued a few. The runs will auto-expunge from the results in a week or two. > > > > iirc Pavel did some testing a month or two ago and was seeing userspace > > > > misbehaviour? > > > > > > Pavel did some testing (like two threads trying to plug/unplug cpus at > > > the same time), and seen machines dying real fast; but that was fixed, > > > IIRC, and I did not really torture it after that. > > > > If this test is available, I could include it in my test runs if you > > think it would be worth tracking. > > Is shell script acceptable form of a test? Yes. In fact all the tests are shell scripts so far. Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-06 23:10 ` Status on CPU hotplug issues Bryce Harrington 2006-10-06 23:29 ` Andrew Morton @ 2006-10-07 21:57 ` Pavel Machek 2006-10-09 21:40 ` Randy Dunlap 1 sibling, 1 reply; 31+ messages in thread From: Pavel Machek @ 2006-10-07 21:57 UTC (permalink / raw) To: Bryce Harrington Cc: Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel Hi! > 1. Oops offlining cpu twice on AMD64 (but not on EM64t) > with the 2.6.18-git22 kernel > > Reported to hotplug lists 10/05: > http://lists.osdl.org/pipermail/hotplug_sig/2006-October/000680.html > > To recreate: offline, online, and then offline a CPU, then oopses > http://crucible.osdl.org/runs/2397/sysinfo/amd01.console > http://crucible.osdl.org/runs/2397/sysinfo/amd01.2/proc/config > > Here's a snippet of the oops: > > # echo 0 > /sys/devices/system/cpu/cpu1/online > > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > [<ffffffff80255287>] __drain_pages+0x29/0x5f > PGD 7e56d067 PUD 7ee80067 PMD 0 > Oops: 0000 [1] PREEMPT SMP > CPU 0 > Modules linked in: > Pid: 7203, comm: bash Tainted: G M 2.6.18-git22 #1 ~~~~~ kernel is unhappy here. Forced module unload? -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-07 21:57 ` Pavel Machek @ 2006-10-09 21:40 ` Randy Dunlap 2006-10-23 22:26 ` Bryce Harrington 0 siblings, 1 reply; 31+ messages in thread From: Randy Dunlap @ 2006-10-09 21:40 UTC (permalink / raw) To: Pavel Machek Cc: Bryce Harrington, Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Sat, 7 Oct 2006 21:57:49 +0000 Pavel Machek wrote: > Hi! > > > 1. Oops offlining cpu twice on AMD64 (but not on EM64t) > > with the 2.6.18-git22 kernel > > > > Reported to hotplug lists 10/05: > > http://lists.osdl.org/pipermail/hotplug_sig/2006-October/000680.html > > > > To recreate: offline, online, and then offline a CPU, then oopses > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.console > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.2/proc/config > > > > Here's a snippet of the oops: > > > > # echo 0 > /sys/devices/system/cpu/cpu1/online > > > > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > > [<ffffffff80255287>] __drain_pages+0x29/0x5f > > PGD 7e56d067 PUD 7ee80067 PMD 0 > > Oops: 0000 [1] PREEMPT SMP > > CPU 0 > > Modules linked in: > > Pid: 7203, comm: bash Tainted: G M 2.6.18-git22 #1 > ~~~~~ > kernel is unhappy here. Forced module unload? Machine check exception. 'G' is Good, same place where 'P' for proprietary would be. But yes, kernel or machine is unhappy. --- ~Randy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-09 21:40 ` Randy Dunlap @ 2006-10-23 22:26 ` Bryce Harrington 2006-11-08 5:35 ` Randy Dunlap 0 siblings, 1 reply; 31+ messages in thread From: Bryce Harrington @ 2006-10-23 22:26 UTC (permalink / raw) To: Randy Dunlap Cc: Pavel Machek, Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Mon, Oct 09, 2006 at 02:40:24PM -0700, Randy Dunlap wrote: > On Sat, 7 Oct 2006 21:57:49 +0000 Pavel Machek wrote: > > > Hi! > > > > > 1. Oops offlining cpu twice on AMD64 (but not on EM64t) > > > with the 2.6.18-git22 kernel > > > > > > Reported to hotplug lists 10/05: > > > http://lists.osdl.org/pipermail/hotplug_sig/2006-October/000680.html > > > > > > To recreate: offline, online, and then offline a CPU, then oopses > > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.console > > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.2/proc/config > > > > > > Here's a snippet of the oops: > > > > > > # echo 0 > /sys/devices/system/cpu/cpu1/online > > > > > > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > > > [<ffffffff80255287>] __drain_pages+0x29/0x5f > > > PGD 7e56d067 PUD 7ee80067 PMD 0 > > > Oops: 0000 [1] PREEMPT SMP > > > CPU 0 > > > Modules linked in: > > > Pid: 7203, comm: bash Tainted: G M 2.6.18-git22 #1 > > ~~~~~ > > kernel is unhappy here. Forced module unload? > > Machine check exception. 'G' is Good, same place where 'P' > for proprietary would be. But yes, kernel or machine is unhappy. To followup on this issue... I found a BIOS update for the motherboard of this machine indicating it includes a fix for MCE during hibernate operations; my guess is that cpu hotplug may be triggering this bug. Meanwhile, we checked against a couple other different AMD64 systems; these are behaving correctly. Anyway, thanks for the pointers, it sounds like this is probably just a hardware issue. I'll report back if I find differently. Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-10-23 22:26 ` Bryce Harrington @ 2006-11-08 5:35 ` Randy Dunlap 2006-11-08 5:52 ` Bryce Harrington 0 siblings, 1 reply; 31+ messages in thread From: Randy Dunlap @ 2006-11-08 5:35 UTC (permalink / raw) To: Bryce Harrington Cc: Pavel Machek, Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Mon, 23 Oct 2006 15:26:24 -0700 Bryce Harrington wrote: > On Mon, Oct 09, 2006 at 02:40:24PM -0700, Randy Dunlap wrote: > > On Sat, 7 Oct 2006 21:57:49 +0000 Pavel Machek wrote: > > > > > Hi! > > > > > > > 1. Oops offlining cpu twice on AMD64 (but not on EM64t) > > > > with the 2.6.18-git22 kernel > > > > > > > > Reported to hotplug lists 10/05: > > > > http://lists.osdl.org/pipermail/hotplug_sig/2006-October/000680.html > > > > > > > > To recreate: offline, online, and then offline a CPU, then oopses > > > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.console > > > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.2/proc/config > > > > > > > > Here's a snippet of the oops: > > > > > > > > # echo 0 > /sys/devices/system/cpu/cpu1/online > > > > > > > > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > > > > [<ffffffff80255287>] __drain_pages+0x29/0x5f > > > > PGD 7e56d067 PUD 7ee80067 PMD 0 > > > > Oops: 0000 [1] PREEMPT SMP > > > > CPU 0 > > > > Modules linked in: > > > > Pid: 7203, comm: bash Tainted: G M 2.6.18-git22 #1 > > > ~~~~~ > > > kernel is unhappy here. Forced module unload? > > > > Machine check exception. 'G' is Good, same place where 'P' > > for proprietary would be. But yes, kernel or machine is unhappy. > > To followup on this issue... > > I found a BIOS update for the motherboard of this machine indicating it > includes a fix for MCE during hibernate operations; my guess is that > cpu hotplug may be triggering this bug. > > Meanwhile, we checked against a couple other different AMD64 systems; > these are behaving correctly. > > Anyway, thanks for the pointers, it sounds like this is probably just a > hardware issue. I'll report back if I find differently. Regarding the MCE (that the BIOS update did not fix for this one particular machine), I don't see the actual Machine Check Exception kernel message log anywhere. It would have happened before the oops that is printed by this CPU hotplug test. Is the complete kerne log available or can it be reproduced? --- ~Randy ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Status on CPU hotplug issues 2006-11-08 5:35 ` Randy Dunlap @ 2006-11-08 5:52 ` Bryce Harrington 0 siblings, 0 replies; 31+ messages in thread From: Bryce Harrington @ 2006-11-08 5:52 UTC (permalink / raw) To: Randy Dunlap Cc: Pavel Machek, Andrew Morton, vatsa, torvalds, linux-kernel, shaohua.li, hotplug_sig, lhcs-devel On Tue, Nov 07, 2006 at 09:35:32PM -0800, Randy Dunlap wrote: > On Mon, 23 Oct 2006 15:26:24 -0700 Bryce Harrington wrote: > > > On Mon, Oct 09, 2006 at 02:40:24PM -0700, Randy Dunlap wrote: > > > On Sat, 7 Oct 2006 21:57:49 +0000 Pavel Machek wrote: > > > > > > > Hi! > > > > > > > > > 1. Oops offlining cpu twice on AMD64 (but not on EM64t) > > > > > with the 2.6.18-git22 kernel > > > > > > > > > > Reported to hotplug lists 10/05: > > > > > http://lists.osdl.org/pipermail/hotplug_sig/2006-October/000680.html > > > > > > > > > > To recreate: offline, online, and then offline a CPU, then oopses > > > > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.console > > > > > http://crucible.osdl.org/runs/2397/sysinfo/amd01.2/proc/config > > > > > > > > > > Here's a snippet of the oops: > > > > > > > > > > # echo 0 > /sys/devices/system/cpu/cpu1/online > > > > > > > > > > Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: > > > > > [<ffffffff80255287>] __drain_pages+0x29/0x5f > > > > > PGD 7e56d067 PUD 7ee80067 PMD 0 > > > > > Oops: 0000 [1] PREEMPT SMP > > > > > CPU 0 > > > > > Modules linked in: > > > > > Pid: 7203, comm: bash Tainted: G M 2.6.18-git22 #1 > > > > ~~~~~ > > > > kernel is unhappy here. Forced module unload? > > > > > > Machine check exception. 'G' is Good, same place where 'P' > > > for proprietary would be. But yes, kernel or machine is unhappy. > > > > To followup on this issue... > > > > I found a BIOS update for the motherboard of this machine indicating it > > includes a fix for MCE during hibernate operations; my guess is that > > cpu hotplug may be triggering this bug. > > > > Meanwhile, we checked against a couple other different AMD64 systems; > > these are behaving correctly. > > > > Anyway, thanks for the pointers, it sounds like this is probably just a > > hardware issue. I'll report back if I find differently. > > Regarding the MCE (that the BIOS update did not fix for this one > particular machine), I don't see the actual Machine Check Exception > kernel message log anywhere. It would have happened before the > oops that is printed by this CPU hotplug test. Is the complete > kerne log available or can it be reproduced? I never could actually get a bonified MCE error message - the only clue it was an MCE was your read on the line you highlighted above. All the info we have is what we posted. Bryce ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 8:46 ` Srivatsa Vaddagiri 2006-03-17 9:04 ` Andrew Morton @ 2006-03-17 12:21 ` Ashok Raj 2006-03-17 13:59 ` Srivatsa Vaddagiri 1 sibling, 1 reply; 31+ messages in thread From: Ashok Raj @ 2006-03-17 12:21 UTC (permalink / raw) To: Srivatsa Vaddagiri Cc: Andrew Morton, torvalds, linux-kernel, Li, Shaohua, bryce On Fri, Mar 17, 2006 at 12:46:53AM -0800, Srivatsa Vaddagiri wrote: > > > Well ..other arch-es need to have a similar check if they get around > to > implement physical hot-add (even if they allow offlining of all CPUs). > This is really not for physical hotplug, but due to fact i386 startup code wasnt clean enough. On x86_64 Andi cleaned this up and we dont need those hacks on x86_64. > + if (cpu_online(cpu)) { Should we add !cpu_present(cpu) check as well just to be consistent with checks in cpu_up() ? Probably better if we can move smp_prepare_cpu() to within cpu_up()? How does the attached patch look. Check if cpu can be onlined before calling smp_prepare_cpu() - Moved check for online cpu out of smp_prepare_cpu() - Moved default declaration of smp_prepare_cpu() to kernel/cpu.c - Removed lock_cpu_hotplug() from smp_prepare_cpu() to around it, since its called from cpu_up() as well now. Signed-off-by: Ashok Raj <ashok.raj@intel.com> -------------------------------------------------------- arch/i386/kernel/smpboot.c | 2 -- drivers/base/cpu.c | 9 +-------- kernel/cpu.c | 9 +++++++++ kernel/power/smp.c | 2 ++ 4 files changed, 12 insertions(+), 10 deletions(-) Index: linux-2.6.16-rc6-mm1/arch/i386/kernel/smpboot.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/arch/i386/kernel/smpboot.c +++ linux-2.6.16-rc6-mm1/arch/i386/kernel/smpboot.c @@ -1028,7 +1028,6 @@ int __devinit smp_prepare_cpu(int cpu) struct work_struct task; int apicid, ret; - lock_cpu_hotplug(); apicid = x86_cpu_to_apicid[cpu]; if (apicid == BAD_APICID) { ret = -ENODEV; @@ -1053,7 +1052,6 @@ int __devinit smp_prepare_cpu(int cpu) zap_low_mappings(); ret = 0; exit: - unlock_cpu_hotplug(); return ret; } #endif Index: linux-2.6.16-rc6-mm1/drivers/base/cpu.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/drivers/base/cpu.c +++ linux-2.6.16-rc6-mm1/drivers/base/cpu.c @@ -19,11 +19,6 @@ EXPORT_SYMBOL(cpu_sysdev_class); static struct sys_device *cpu_sys_devices[NR_CPUS]; #ifdef CONFIG_HOTPLUG_CPU -int __attribute__((weak)) smp_prepare_cpu (int cpu) -{ - return 0; -} - static ssize_t show_online(struct sys_device *dev, char *buf) { struct cpu *cpu = container_of(dev, struct cpu, sysdev); @@ -44,9 +39,7 @@ static ssize_t store_online(struct sys_d kobject_uevent(&dev->kobj, KOBJ_OFFLINE); break; case '1': - ret = smp_prepare_cpu(cpu->sysdev.id); - if (!ret) - ret = cpu_up(cpu->sysdev.id); + ret = cpu_up(cpu->sysdev.id); if (!ret) kobject_uevent(&dev->kobj, KOBJ_ONLINE); break; Index: linux-2.6.16-rc6-mm1/kernel/cpu.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/kernel/cpu.c +++ linux-2.6.16-rc6-mm1/kernel/cpu.c @@ -198,6 +198,11 @@ out: } #endif /*CONFIG_HOTPLUG_CPU*/ +int __attribute__((weak)) smp_prepare_cpu (int cpu) +{ + return 0; +} + int __devinit cpu_up(unsigned int cpu) { int ret; @@ -211,6 +216,10 @@ int __devinit cpu_up(unsigned int cpu) goto out; } + ret = smp_prepare_cpu(cpu); + if (ret) + goto out; + ret = notifier_call_chain(&cpu_chain, CPU_UP_PREPARE, hcpu); if (ret == NOTIFY_BAD) { printk("%s: attempt to bring up CPU %u failed\n", Index: linux-2.6.16-rc6-mm1/kernel/power/smp.c =================================================================== --- linux-2.6.16-rc6-mm1.orig/kernel/power/smp.c +++ linux-2.6.16-rc6-mm1/kernel/power/smp.c @@ -49,7 +49,9 @@ void enable_nonboot_cpus(void) printk("Thawing cpus ...\n"); for_each_cpu_mask(cpu, frozen_cpus) { + lock_cpu_hotplug(); error = smp_prepare_cpu(cpu); + unlock_cpu_hotplug(); if (!error) error = cpu_up(cpu); if (!error) { ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] Check for online cpus before bringing them up 2006-03-17 12:21 ` [PATCH] Check for online cpus before bringing them up Ashok Raj @ 2006-03-17 13:59 ` Srivatsa Vaddagiri 0 siblings, 0 replies; 31+ messages in thread From: Srivatsa Vaddagiri @ 2006-03-17 13:59 UTC (permalink / raw) To: Ashok Raj; +Cc: Andrew Morton, torvalds, linux-kernel, Li, Shaohua, bryce On Fri, Mar 17, 2006 at 04:21:54AM -0800, Ashok Raj wrote: > Should we add !cpu_present(cpu) check as well just to be consistent with checks > in cpu_up() ? Probably better if we can move smp_prepare_cpu() to within cpu_up()? > > How does the attached patch look. I think this is much better (except for a minot nit - see below). [snip] > @@ -49,7 +49,9 @@ void enable_nonboot_cpus(void) > > printk("Thawing cpus ...\n"); > for_each_cpu_mask(cpu, frozen_cpus) { > + lock_cpu_hotplug(); > error = smp_prepare_cpu(cpu); > + unlock_cpu_hotplug(); Can we remove this smp_prepare_cpu call also (since it is being called in cpu_up() now in your patch)? -- Regards, vatsa ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2006-11-08 5:54 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-16 17:44 [PATCH] Check for online cpus before bringing them up Srivatsa Vaddagiri 2006-03-17 1:08 ` Andrew Morton 2006-03-17 1:16 ` Shaohua Li 2006-03-17 8:46 ` Srivatsa Vaddagiri 2006-03-17 9:04 ` Andrew Morton 2006-03-17 14:13 ` Srivatsa Vaddagiri 2006-03-18 14:09 ` Ashok Raj 2006-03-21 1:08 ` Shaohua Li 2006-03-21 1:25 ` Ashok Raj 2006-03-21 1:36 ` Shaohua Li 2006-10-06 23:10 ` Status on CPU hotplug issues Bryce Harrington 2006-10-06 23:29 ` Andrew Morton 2006-10-07 0:00 ` Bryce Harrington 2006-10-07 10:35 ` Pavel Machek 2006-10-07 20:42 ` Bryce Harrington 2006-10-08 18:29 ` Heiko Carstens 2006-10-08 19:14 ` Pavel Machek 2006-10-11 1:08 ` [BUG] 2.6.19-rc1-mm1: fs/file.c138 on ia64 Bryce Harrington 2006-10-11 1:15 ` Andrew Morton 2006-10-11 5:38 ` Bryce Harrington 2006-10-07 10:24 ` Status on CPU hotplug issues Pavel Machek 2006-10-07 20:25 ` Bryce Harrington 2006-10-08 19:13 ` Pavel Machek 2006-10-09 7:42 ` Bryce Harrington 2006-10-07 21:57 ` Pavel Machek 2006-10-09 21:40 ` Randy Dunlap 2006-10-23 22:26 ` Bryce Harrington 2006-11-08 5:35 ` Randy Dunlap 2006-11-08 5:52 ` Bryce Harrington 2006-03-17 12:21 ` [PATCH] Check for online cpus before bringing them up Ashok Raj 2006-03-17 13:59 ` Srivatsa Vaddagiri
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox