public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: "linux-rt-users@vger.kernel.org" <linux-rt-users@vger.kernel.org>
Subject: Re: softlockup on 3.6.8-rt19
Date: Wed, 5 Dec 2012 16:49:28 +0100 (CET)	[thread overview]
Message-ID: <alpine.LFD.2.02.1212051543300.2701@ionos> (raw)
In-Reply-To: <12BAC08C-7C55-4BC5-B69E-DC33E023C8BF@gmail.com>

On Wed, 5 Dec 2012, Sven-Thorsten Dietrich wrote:
> 
> This is the softlockup I am seeing on one of our HP blades.
> 
> I haven't fully ruled out bad hardware, trying to reproduce on another machine.
> 
> Sven
> 
> 
> [  128.371195] BUG: soft lockup - CPU#9 stuck for 22s! [git:6333]
> [  132.387637] BUG: soft lockup - CPU#10 stuck for 23s! [agetty:674]
> [  144.398987] BUG: soft lockup - CPU#11 stuck for 22s! [flush-8:0:336]
> [  156.353376] BUG: soft lockup - CPU#9 stuck for 22s! [git:6333]
> [  160.369814] BUG: soft lockup - CPU#10 stuck for 22s! [agetty:674]
> [  192.330459] BUG: soft lockup - CPU#9 stuck for 23s! [git:6333]
> [  192.349444] BUG: soft lockup - CPU#10 stuck for 23s! [agetty:674]
> [  192.368428] BUG: soft lockup - CPU#11 stuck for 23s! [flush-8:0:336]
> [  195.632116] BUG: spinlock lockup suspected on CPU#9, git/6333
> [  195.632122] general protection fault: 0000 [#1] PREEMPT SMP 

So we fault in spin_dump. Which is not surprising when we decode the
faulting instruction:

	 44 8b 83 e4 02 00 00 	mov    0x2e4(%rbx),%r8d

> [  195.632138] RIP: 0010:[<ffffffff816438c1>]  [<ffffffff816438c1>] spin_dump+0x56/0x91
> [  195.632138] RSP: 0000:ffff880be0077818  EFLAGS: 00010206
> [  195.632139] RAX: 0000000000000031 RBX: 1067a77cb2247fcc RCX: 0000000000000871

RBX contains a random number. Ditto in the next dump on CPU10

> [  200.084385] BUG: spinlock lockup suspected on CPU#10, agetty/674
> [  200.084388] general protection fault: 0000 [#2] PREEMPT SMP 

> [  200.084403] RIP: 0010:[<ffffffff816438c1>]  [<ffffffff816438c1>] spin_dump+0x56/0x91
> [  200.084403] RSP: 0018:ffff8805e03877a8  EFLAGS: 00010286
> [  200.084404] RAX: 0000000000000034 RBX: cdc5c4fabb8bf87b RCX: 00000000000008d5

0000000000000000 <spin_dump>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	41 54                	push   %r12
   6:	49 89 fc             	mov    %rdi,%r12
   9:	53                   	push   %rbx
   a:	48 8b 5f 10          	mov    0x10(%rdi),%rbx

RBX is initialized with lock->owner (0ffset 0x10 of lock)

   e:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
  15:	48 8d 43 ff          	lea    -0x1(%rbx),%rax
  19:	48 83 f8 fe          	cmp    $0xfffffffffffffffe,%rax
  1d:	b8 00 00 00 00       	mov    $0x0,%eax
  22:	48 0f 43 d8          	cmovae %rax,%rbx
  26:	65 48 8b 04 25 00 00 	mov    %gs:0x0,%rax
  2d:	00 00 
  2f:	44 8b 80 e4 02 00 00 	mov    0x2e4(%rax),%r8d
  36:	48 8d 88 90 04 00 00 	lea    0x490(%rax),%rcx
  3d:	31 c0                	xor    %eax,%eax
  3f:	65 8b 14 25 00 00 00 	mov    %gs:0x0,%edx
  46:	00 
  47:	e8 00 00 00 00       	callq  4c <spin_dump+0x4c>
  4c:	48 85 db             	test   %rbx,%rbx
  4f:	45 8b 4c 24 08       	mov    0x8(%r12),%r9d

Here we read lock->owner_cpu into R9. Random numbers as well:

     R09: 000000004642dad1

     R09: 0000000017f07438

  54:	74 10                	je     66 <spin_dump+0x66>
  56:	44 8b 83 e4 02 00 00 	mov    0x2e4(%rbx),%r8d

And of course here we crash. Let's look at the call chain

> [  200.084416]  [<ffffffff81343189>] do_raw_spin_lock+0xf9/0x140
> [  200.084417]  [<ffffffff81649f44>] _raw_spin_lock+0x44/0x50
> [  200.084418]  [<ffffffff81648d63>] ? rt_spin_lock_slowlock+0x43/0x380
> [  200.084420]  [<ffffffff81648d63>] ? rt_spin_lock_slowlock+0x43/0x380
> [  200.084421]  [<ffffffff81648d63>] rt_spin_lock_slowlock+0x43/0x380
> [  200.084422]  [<ffffffff81649817>] rt_spin_lock+0x27/0x60
> [  200.084424]  [<ffffffff8113f4bd>] __lru_cache_add+0x5d/0x1f0

That's the per cpu local lock swap_lock protecting the pagevec
operations. So something is corrupting the per cpu locks really badly.

The lock addresses look reasonably:

CPU9:	 R12: ffff880bc1867c00
CPU10:	 R12: ffff880bc1887c00
CPU11:	 R12: ffff880bc18a7c00

That's a spacing of 20000H per cpu.

I really have no idea what scribbles over those locks. Can you check
what is next to those locks in the per_cpu area ?

Thanks,

	tglx

      reply	other threads:[~2012-12-05 15:49 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-05 11:19 softlockup on 3.6.8-rt19 Sven-Thorsten Dietrich
2012-12-05 15:49 ` Thomas Gleixner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.02.1212051543300.2701@ionos \
    --to=tglx@linutronix.de \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=thebigcorporation@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox