From: raz ben yehuda <raz@scalemp.com>
To: Ingo Molnar <mingo@elte.hu>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Mike Galbraith <efault@gmx.de>, Jack Steiner <steiner@sgi.com>,
Cliff Wickman <cpw@sgi.com>, Mike Travis <travis@sgi.com>,
Thomas Gleixner <tglx@linutronix.de>,
"H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [BUG] soft lockup while booting machine with more than 700 cores
Date: Thu, 10 Feb 2011 08:09:22 +0200 [thread overview]
Message-ID: <1297318162.2428.3.camel@raz.scalemp.com> (raw)
In-Reply-To: <20110210123937.GD26094@elte.hu>
On Thu, 2011-02-10 at 13:39 +0100, Ingo Molnar wrote:
> * raz ben yehuda <raz@scalemp.com> wrote:
>
> > Mingo Hello
> >
> > Bellow is a boot of a 2.6.32.19 kernel over a machine with more than 700 cores. I
> > am failing to boot it due to a soft lockup in rebalance_domains area. I did not
> > find anything related in mainline git and kernel's bugzilla.
> >
> > thank you
> > Raz
> >
> >
> > [ 929.614315] TCP cubic registered
> > [ 929.614577] NET: Registered protocol family 17
> > [ 930.785915] Bridge firewalling registered
> > [ 930.928396] Freeing unused kernel memory: 1380k freed
> > ===============================================================================
> > Running /disklessrc
> > Mounting /proc
> > Creating /dev
> > Creating initial device nodes
> > [ 931.327841] usb 5-1: configuration #1 chosen from 1 choice
> > [ 931.657469] input: HP Virtual Keyboard as /class/input/input0
> > [ 931.671560] generic-usb 0003:03F0:1027.0001: input: USB HID v1.01 Keyboard [H
> > P Virtual Keyboard] on usb-0000:01:04.0-1/input0
> > [ 931.911480] input: HP Virtual Keyboard as /class/input/input1
> > [ 931.926135] generic-usb 0003:03F0:1027.0002: input: USB HID v1.01 Mouse [HP V
> > irtual Keyboard] on usb-0000:01:04.0-1/input1
> > [ 932.247432] scsi 0:0:0:0: Direct-Access Generic USB Flash Disk 0.00 PQ
> > : 0 ANSI: 2
> > [ 932.301626] sd 0:0:0:0: Attached scsi generic sg0 type 0
> > [ 932.416279] sd 0:0:0:0: [sda] 7892992 512-byte logical blocks: (4.04 GB/3.76
> > GiB)
> > [ 932.559424] sd 0:0:0:0: [sda] Write Protect is off
> > [ 932.563238] sd 0:0:0:0: [sda] Assuming drive cache: write through
> > [ 932.802006] sd 0:0:0:0: [sda] Assuming drive cache: write through
> > [ 932.805070] sda: sda1
> > [ 934.315071] sd 0:0:0:0: [sda] Assuming drive cache: write through
> > [ 934.318055] sd 0:0:0:0: [sda] Attached SCSI removable disk
> > Loading nfs module... [ 1011.681028] BUG: soft lockup - CPU#240 stuck for 62s! [
> > swapper:0]
> > [ 1011.744482] Modules linked in: sunrpc(+)
> > [ 1011.789117] CPU 240:
> > [ 1011.828757] Modules linked in: sunrpc(+)
> > [ 1011.874003] Pid: 0, comm: swapper Not tainted 2.6.32.19-3.vSMP #2 vSMP 3.5
> > [ 1011.935843] RIP: 0010:[<ffffffff8105ac32>] [<ffffffff8105ac32>] weighted_cpu
> > load+0x12/0x20
> > [ 1012.051597] RSP: 0018:ffff89468e803c88 EFLAGS: 00010286
> > [ 1012.115020] RAX: 00000000000115c0 RBX: 0000000000000002 RCX: 000000000000001d
> > [ 1012.162897] RDX: ffff8acd2e840000 RSI: 0000000000000002 RDI: 000000000000021d
> > [ 1012.243858] RBP: ffffffff81033133 R08: 0000000000000200 R09: ffff894f0ca3d450
> > [ 1012.309760] R10: 0000000000000000 R11: ffff89468e803dc0 R12: ffff89468e803c00
> > [ 1012.358023] R13: 00000000000115c0 R14: ffffffff8104b6dc R15: ffffffff81046ea6
> > [ 1012.417072] FS: 0000000000000000(0000) GS:ffff89468e800000(0000) knlGS:00000
> > 00000000000
> > [ 1012.494488] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > [ 1012.559412] CR2: 00000000008d3988 CR3: 0000000001001000 CR4: 00000000000026e0
> > [ 1012.619828] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 1012.675491] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
> > [ 1012.739386] Call Trace:
> > [ 1012.790082] <IRQ> [<ffffffff81039705>] ? sched_clock+0x5/0x10
> > [ 1012.868687] [<ffffffff8105ac6b>] ? source_load+0x2b/0x70
> > [ 1012.923473] [<ffffffff810602d5>] ? find_busiest_group+0x1b5/0xa30
> > [ 1012.973482] [<ffffffff81063487>] ? rebalance_domains+0x117/0x470
> > [ 1013.031838] [<ffffffff81065a4e>] ? run_rebalance_domains+0x3e/0xe0
> > [ 1013.081837] [<ffffffff8106fbbe>] ? __do_softirq+0xae/0x140
> > [ 1013.134496] [<ffffffff81085da0>] ? ktime_get+0x50/0xd0
> > [ 1013.182834] [<ffffffff8103374c>] ? call_softirq+0x1c/0x30
> > [ 1013.246263] [<ffffffff81035745>] ? do_softirq+0x65/0xa0
> > [ 1013.314801] [<ffffffff8106fb0c>] ? irq_exit+0x7c/0x80
> > [ 1013.355605] [<ffffffff81046eab>] ? smp_apic_timer_interrupt+0x6b/0xa0
> > [ 1013.391166] [<ffffffff8104b6dc>] ? native_apic_msr_write+0x2c/0x40
> > [ 1013.391166] [<ffffffff81033133>] ? apic_timer_interrupt+0x13/0x20
> > [ 1013.478307] <EOI> [<ffffffff8104dc92>] ? native_safe_halt+0x2/0x10
> > [ 1013.515916] [<ffffffff8103a481>] ? default_idle+0x21/0x40
> > [ 1013.572168] [<ffffffff81031537>] ? cpu_idle+0x57/0x90
> > [ 1112.445978] BUG: soft lockup - CPU#240 stuck for 62s! [swapper:0]
> > [ 1112.445978] Modules linked in: sunrpc(+)
>
> Interesting.
>
> Could you boot up with just enough cores for it to not lock up, and run perf top and
> see where the overhead is?
First, thank you for your reply. I will get back to you on this one
later as I have technical problems at the moment repeating the test.
Thanks
raz
>
> Ingo
next prev parent reply other threads:[~2011-02-10 18:10 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-09 7:27 [BUG] soft lockup while booting machine with more than 700 cores raz ben yehuda
2011-02-10 4:47 ` Mike Galbraith
2011-02-10 12:39 ` Ingo Molnar
2011-02-10 6:09 ` raz ben yehuda [this message]
2011-02-10 20:56 ` Jack Steiner
2011-02-10 21:03 ` David Miller
2011-02-10 21:12 ` Jack Steiner
2011-02-16 15:04 ` Dimitri Sivanich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1297318162.2428.3.camel@raz.scalemp.com \
--to=raz@scalemp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=cpw@sgi.com \
--cc=efault@gmx.de \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=mingo@redhat.com \
--cc=steiner@sgi.com \
--cc=tglx@linutronix.de \
--cc=travis@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox