All of lore.kernel.org
 help / color / mirror / Atom feed
From: Arjan van de Ven <arjan@linux.intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Michael Neuling <mikey@neuling.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>,
	LKML <linux-kernel@vger.kernel.org>,
	gregkh@linuxfoundation.org, linux-next@vger.kernel.org,
	ppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Milton Miller <miltonm@bga.com>
Subject: Re: Boot failure with next-20120208
Date: Mon, 13 Feb 2012 12:16:41 -0800	[thread overview]
Message-ID: <4F396FA9.90606@linux.intel.com> (raw)
In-Reply-To: <20120213120549.eab7e2b9.akpm@linux-foundation.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/13/2012 12:05 PM, Andrew Morton wrote:
> On Mon, 13 Feb 2012 06:18:34 -0800 Arjan van de Ven
> <arjan@linux.intel.com> wrote:
> 
>> On 2/12/2012 7:04 PM, Michael Neuling wrote:
>>>> Just a quick note to say I got a boot OOPs with next-20120208
>>>> and 9 on a Power7 blade (my other PowerPC boot tests are ok.
>>>> I'll investigate this further on Monday.
> 
> Thanks for testing linux-next.  Very useful.
> 
>>>> The line referenced below is:
>>>> 
>>>> BUG_ON(!kobj || !kobj->sd || !attr);
>>>> 
>>>> in sysfs_create_file().
> 
> Yes, this is exactly why we should never use BUG_ON(a || b).  We
> don't know which of those three expressions triggered.
> 
>>>> calling  .topology_init+0x0/0x1ac @ 1 initcall
>>>> 7_.async_cpu_up+0x0/0x40 returned 0 after 9765 usecs 
>>>> async_continuing @ 20 after 9765 usec ------------[ cut here
>>>> ]------------ kernel BUG at fs/sysfs/file.c:573! Oops:
>>>> Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=32 NUMA
>>>> pSeries Modules linked in: NIP: c00000000024a35c LR:
>>>> c0000000004ee050 CTR: c00000000083ca24 REGS: c0000003fd9e7560
>>>> TRAP: 0700   Not tainted  (3.3.0-rc2-autokern1) MSR:
>>>> 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 88002082  XER:
>>>> 0000000f CFAR: c00000000024a370 TASK = c0000003fd9e8000[20]
>>>> 'kworker/u:6' THREAD: c0000003fd9e4000 CPU: 0 GPR00:
>>>> 0000000000000001 c0000003fd9e77e0 c000000000d19bb8
>>>> 0000000000000000 GPR04: c000000000bf37a8 0000000000000008
>>>> 8000000002096400 0000000000000000 GPR08: 0000000000000000
>>>> c000000000f80028 c000000000d52bd8 0000000000000000 GPR12:
>>>> 0000000048002088 c00000000f33b000 0000000001affa78
>>>> 00000000009aa000 GPR16: 0000000000e1f3c8 0000000002d517f0
>>>> 0000000001aff984 0000000000000060 GPR20: 0000000000000000
>>>> ffffffffffffffff 0000000000000000 c000000000c45128 GPR24:
>>>> 0000000000000000 0000000000000008 0000000000000000
>>>> c000000000c44200 GPR28: c000000000f80028 0000000000000008
>>>> c000000000c85038 0000000000000002 NIP [c00000000024a35c]
>>>> .sysfs_create_file+0x1c/0x40 LR [c0000000004ee050]
>>>> .device_create_file+0x20/0x40 Call Trace: [c0000003fd9e77e0]
>>>> [c0000003fd9e78a0] 0xc0000003fd9e78a0 (unreliable) 
>>>> [c0000003fd9e7850] [c00000000083c9a4]
>>>> .register_cpu_online+0x1d0/0x250 [c0000003fd9e7900]
>>>> [c00000000083ca8c] .sysfs_cpu_notify+0x68/0x28c 
>>>> [c0000003fd9e79b0] [c00000000083769c]
>>>> .notifier_call_chain+0x9c/0x100 [c0000003fd9e7a50]
>>>> [c0000000000a5878] .__cpu_notify+0x38/0x80 [c0000003fd9e7ad0]
>>>> [c00000000083e124] ._cpu_up+0x10c/0x178 [c0000003fd9e7b90]
>>>> [c00000000083e2c8] .cpu_up+0x138/0x164 [c0000003fd9e7c20]
>>>> [c000000000ba46d0] .async_cpu_up+0x28/0x40 [c0000003fd9e7ca0]
>>>> [c0000000000d81ec] .async_run_entry_fn+0xbc/0x1f0 
>>>> [c0000003fd9e7d50] [c0000000000c7cbc]
>>>> .process_one_work+0x19c/0x590 [c0000003fd9e7e10]
>>>> [c0000000000c8618] .worker_thread+0x188/0x4b0 
>>>> [c0000003fd9e7ed0] [c0000000000ce57c] .kthread+0xbc/0xd0 
>>>> [c0000003fd9e7f90] [c000000000021448]
>>>> .kernel_thread+0x54/0x70 Instruction dump: 7fa3eb78 ebe1fff8
>>>> eba1ffe8 7c0803a6 4e800020 2c230000 41820024 e8630030 
>>>> 7c800074 7800d182 2fa30000 419e0014 <0b000000> 38a00002
>>>> 4bfffebc e8630030 ---[ end trace 31fd0ba7d8756001 ]--- 
>>>> initcall .topology_init+0x0/0x1ac returned 0 after 0 usecs 
>>>> calling  .pcibios_init+0x0/0xe8 @ 1 PCI: Probing PCI
>>>> hardware PCI: Probing PCI hardware done initcall
>>>> .pcibios_init+0x0/0xe8 returned 0 after 0 usecs calling
>>>> .add_system_ram_resources+0x0/0x140 @ 1 initcall
>>>> .add_system_ram_resources+0x0/0x140 returned 0 after 0 usecs 
>>>> calling
>>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8
>>>> @ 1 initcall
>>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8
>>>> returned 0 after 0 usecs calling  .opal_init+0x0/0x1cc @ 1 
>>>> opal: Node not found initcall .opal_init+0x0/0x1cc returned
>>>> -19 after 0 usecs calling
>>>> .__machine_initcall_pseries_ioei_init+0x0/0xa0 @ 1
>>> 
>>> Reverting "smp: start up non-boot CPUs asynchronously"
>>> (8de7a96405 from next-20120208) fixes this problem for me.
>>> 
>> if that fixes it, it means PPC has a race somewhere in the cpu
>> hotplug code, since all the patch does is hotplug the cpus one by
>> one (which the normal kernel also does, just not in parallel with
>> other work)
> 
> The bug looks pretty generic, nothing very PPC-specific there.  It 
> might affect other architectures - we won't know until we find out
> wht caused it.

well one half of the race looks pretty generic...
..... doesn't mean the other half of the race is though....


> 
> Ho hum, I suppose I should pull the patch out of linux-next, to
> avoid disrupting other testing.  This means it's going to be hard
> to get the bug fixed.

it means losing this one big PPC machine indeed.... until they hit
that same race some other way with regular real cpu hotplug ;-(
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJPOW+pAAoJEEHdSxh4DVnEJXEIAKHFQwrGw1R66bEjTMxfgOpH
ZlUwjciyNlo2KpqgaUgulHDHNzWQa29nzjPfuk6qG7mHGKnbgyn/IBCnUD5uJqri
6yu7Md0vekwJnoilZQvEuvF6qHrOYOcaWvW60x2y3W+fBesa5zxpqwDLbKj4Qvu3
YAbxeMaAr0W/d7pKEubKds3YJnr1S06qbK8Jw7DF92YEd7xDTBTSpuSF3vA6BlLQ
jEV1xwxVwnLEa6A/DnkvQ67Kayj0zfC2CSCqlt2T2BiDl81XkCPC/U3yHuHb/ISI
ykwFHy4L2VH54n5dQH8S6qfzNS1vh8IQ0xzVtJH6CHh8XXK7T97T07jZZ/LWMWI=
=GQrj
-----END PGP SIGNATURE-----

WARNING: multiple messages have this Message-ID (diff)
From: Arjan van de Ven <arjan@linux.intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
	Michael Neuling <mikey@neuling.org>,
	gregkh@linuxfoundation.org, LKML <linux-kernel@vger.kernel.org>,
	Milton Miller <miltonm@bga.com>,
	linux-next@vger.kernel.org,
	ppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: Boot failure with next-20120208
Date: Mon, 13 Feb 2012 12:16:41 -0800	[thread overview]
Message-ID: <4F396FA9.90606@linux.intel.com> (raw)
In-Reply-To: <20120213120549.eab7e2b9.akpm@linux-foundation.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/13/2012 12:05 PM, Andrew Morton wrote:
> On Mon, 13 Feb 2012 06:18:34 -0800 Arjan van de Ven
> <arjan@linux.intel.com> wrote:
> 
>> On 2/12/2012 7:04 PM, Michael Neuling wrote:
>>>> Just a quick note to say I got a boot OOPs with next-20120208
>>>> and 9 on a Power7 blade (my other PowerPC boot tests are ok.
>>>> I'll investigate this further on Monday.
> 
> Thanks for testing linux-next.  Very useful.
> 
>>>> The line referenced below is:
>>>> 
>>>> BUG_ON(!kobj || !kobj->sd || !attr);
>>>> 
>>>> in sysfs_create_file().
> 
> Yes, this is exactly why we should never use BUG_ON(a || b).  We
> don't know which of those three expressions triggered.
> 
>>>> calling  .topology_init+0x0/0x1ac @ 1 initcall
>>>> 7_.async_cpu_up+0x0/0x40 returned 0 after 9765 usecs 
>>>> async_continuing @ 20 after 9765 usec ------------[ cut here
>>>> ]------------ kernel BUG at fs/sysfs/file.c:573! Oops:
>>>> Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=32 NUMA
>>>> pSeries Modules linked in: NIP: c00000000024a35c LR:
>>>> c0000000004ee050 CTR: c00000000083ca24 REGS: c0000003fd9e7560
>>>> TRAP: 0700   Not tainted  (3.3.0-rc2-autokern1) MSR:
>>>> 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 88002082  XER:
>>>> 0000000f CFAR: c00000000024a370 TASK = c0000003fd9e8000[20]
>>>> 'kworker/u:6' THREAD: c0000003fd9e4000 CPU: 0 GPR00:
>>>> 0000000000000001 c0000003fd9e77e0 c000000000d19bb8
>>>> 0000000000000000 GPR04: c000000000bf37a8 0000000000000008
>>>> 8000000002096400 0000000000000000 GPR08: 0000000000000000
>>>> c000000000f80028 c000000000d52bd8 0000000000000000 GPR12:
>>>> 0000000048002088 c00000000f33b000 0000000001affa78
>>>> 00000000009aa000 GPR16: 0000000000e1f3c8 0000000002d517f0
>>>> 0000000001aff984 0000000000000060 GPR20: 0000000000000000
>>>> ffffffffffffffff 0000000000000000 c000000000c45128 GPR24:
>>>> 0000000000000000 0000000000000008 0000000000000000
>>>> c000000000c44200 GPR28: c000000000f80028 0000000000000008
>>>> c000000000c85038 0000000000000002 NIP [c00000000024a35c]
>>>> .sysfs_create_file+0x1c/0x40 LR [c0000000004ee050]
>>>> .device_create_file+0x20/0x40 Call Trace: [c0000003fd9e77e0]
>>>> [c0000003fd9e78a0] 0xc0000003fd9e78a0 (unreliable) 
>>>> [c0000003fd9e7850] [c00000000083c9a4]
>>>> .register_cpu_online+0x1d0/0x250 [c0000003fd9e7900]
>>>> [c00000000083ca8c] .sysfs_cpu_notify+0x68/0x28c 
>>>> [c0000003fd9e79b0] [c00000000083769c]
>>>> .notifier_call_chain+0x9c/0x100 [c0000003fd9e7a50]
>>>> [c0000000000a5878] .__cpu_notify+0x38/0x80 [c0000003fd9e7ad0]
>>>> [c00000000083e124] ._cpu_up+0x10c/0x178 [c0000003fd9e7b90]
>>>> [c00000000083e2c8] .cpu_up+0x138/0x164 [c0000003fd9e7c20]
>>>> [c000000000ba46d0] .async_cpu_up+0x28/0x40 [c0000003fd9e7ca0]
>>>> [c0000000000d81ec] .async_run_entry_fn+0xbc/0x1f0 
>>>> [c0000003fd9e7d50] [c0000000000c7cbc]
>>>> .process_one_work+0x19c/0x590 [c0000003fd9e7e10]
>>>> [c0000000000c8618] .worker_thread+0x188/0x4b0 
>>>> [c0000003fd9e7ed0] [c0000000000ce57c] .kthread+0xbc/0xd0 
>>>> [c0000003fd9e7f90] [c000000000021448]
>>>> .kernel_thread+0x54/0x70 Instruction dump: 7fa3eb78 ebe1fff8
>>>> eba1ffe8 7c0803a6 4e800020 2c230000 41820024 e8630030 
>>>> 7c800074 7800d182 2fa30000 419e0014 <0b000000> 38a00002
>>>> 4bfffebc e8630030 ---[ end trace 31fd0ba7d8756001 ]--- 
>>>> initcall .topology_init+0x0/0x1ac returned 0 after 0 usecs 
>>>> calling  .pcibios_init+0x0/0xe8 @ 1 PCI: Probing PCI
>>>> hardware PCI: Probing PCI hardware done initcall
>>>> .pcibios_init+0x0/0xe8 returned 0 after 0 usecs calling
>>>> .add_system_ram_resources+0x0/0x140 @ 1 initcall
>>>> .add_system_ram_resources+0x0/0x140 returned 0 after 0 usecs 
>>>> calling
>>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8
>>>> @ 1 initcall
>>>> .__machine_initcall_powermac_pmac_i2c_create_platform_devices+0x0/0xc8
>>>> returned 0 after 0 usecs calling  .opal_init+0x0/0x1cc @ 1 
>>>> opal: Node not found initcall .opal_init+0x0/0x1cc returned
>>>> -19 after 0 usecs calling
>>>> .__machine_initcall_pseries_ioei_init+0x0/0xa0 @ 1
>>> 
>>> Reverting "smp: start up non-boot CPUs asynchronously"
>>> (8de7a96405 from next-20120208) fixes this problem for me.
>>> 
>> if that fixes it, it means PPC has a race somewhere in the cpu
>> hotplug code, since all the patch does is hotplug the cpus one by
>> one (which the normal kernel also does, just not in parallel with
>> other work)
> 
> The bug looks pretty generic, nothing very PPC-specific there.  It 
> might affect other architectures - we won't know until we find out
> wht caused it.

well one half of the race looks pretty generic...
..... doesn't mean the other half of the race is though....


> 
> Ho hum, I suppose I should pull the patch out of linux-next, to
> avoid disrupting other testing.  This means it's going to be hard
> to get the bug fixed.

it means losing this one big PPC machine indeed.... until they hit
that same race some other way with regular real cpu hotplug ;-(
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJPOW+pAAoJEEHdSxh4DVnEJXEIAKHFQwrGw1R66bEjTMxfgOpH
ZlUwjciyNlo2KpqgaUgulHDHNzWQa29nzjPfuk6qG7mHGKnbgyn/IBCnUD5uJqri
6yu7Md0vekwJnoilZQvEuvF6qHrOYOcaWvW60x2y3W+fBesa5zxpqwDLbKj4Qvu3
YAbxeMaAr0W/d7pKEubKds3YJnr1S06qbK8Jw7DF92YEd7xDTBTSpuSF3vA6BlLQ
jEV1xwxVwnLEa6A/DnkvQ67Kayj0zfC2CSCqlt2T2BiDl81XkCPC/U3yHuHb/ISI
ykwFHy4L2VH54n5dQH8S6qfzNS1vh8IQ0xzVtJH6CHh8XXK7T97T07jZZ/LWMWI=
=GQrj
-----END PGP SIGNATURE-----

  reply	other threads:[~2012-02-13 20:16 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-12  0:38 Boot failure with next-20120208 Stephen Rothwell
2012-02-12  0:38 ` Stephen Rothwell
2012-02-13  3:04 ` Michael Neuling
2012-02-13  3:04   ` Michael Neuling
2012-02-13  5:47   ` Stephen Rothwell
2012-02-13  5:47     ` Stephen Rothwell
2012-02-13 14:18   ` Arjan van de Ven
2012-02-13 14:18     ` Arjan van de Ven
2012-02-13 20:05     ` Andrew Morton
2012-02-13 20:05       ` Andrew Morton
2012-02-13 20:16       ` Arjan van de Ven [this message]
2012-02-13 20:16         ` Arjan van de Ven
2012-03-23 19:22         ` Andrew Morton
2012-03-23 19:22           ` Andrew Morton
2012-03-23 19:24           ` Arjan van de Ven
2012-03-23 19:24             ` Arjan van de Ven
2012-03-23 22:18             ` Benjamin Herrenschmidt
2012-03-23 22:18               ` Benjamin Herrenschmidt
2012-02-13 21:42 ` Srivatsa S. Bhat
2012-02-13 21:42   ` Srivatsa S. Bhat

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F396FA9.90606@linux.intel.com \
    --to=arjan@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mikey@neuling.org \
    --cc=miltonm@bga.com \
    --cc=sfr@canb.auug.org.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.