[3.10] Oopses in kmem_cache_allocate() via prepare

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
@ 2013-07-06  0:09 Simon Kirby
  2013-07-06  8:27 ` Pekka Enberg
  0 siblings, 1 reply; 19+ messages in thread
From: Simon Kirby @ 2013-07-06  0:09 UTC (permalink / raw)
  To: linux-kernel

We saw two Oopses overnight on two separate boxes that seem possibly
related, but both are weird. These boxes typically run btrfs for rsync
snapshot backups (and usually Oops in btrfs ;), but not this time!
backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
was running 3.10 release plus btrfs-next from yesterday. Full kern.log
and .config at http://0x.ca/sim/ref/3.10/

backup02's first Oops:

BUG: unable to handle kernel paging request at 0000000100000000
IP: [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
PGD 1f54f7067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw bnx2 evdev
CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: ffff8802c3f08000 ti: ffff8801b4876000 task.ti: ffff8801b4876000
RIP: 0010:[<ffffffff81124beb>]  [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
RSP: 0018:ffff8801b4877e88  EFLAGS: 00010206
RAX: 0000000000000000 RBX: ffff8802c3f08000 RCX: 00000000017f040e
RDX: 00000000017f040d RSI: 00000000000000d0 RDI: ffffffff8107a503
RBP: ffff8801b4877ec8 R08: 0000000000016a80 R09: 0000000000000000
R10: 00007fff025fe120 R11: 0000000000000246 R12: 00000000000000d0
R13: ffff88042d8019c0 R14: 0000000100000000 R15: 00007fc3588ee97f
FS:  0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000100000000 CR3: 0000000409d68000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff8801b4877ed8 ffffffff8112a1bc ffff8800985acd20 ffff8802c3f08000
 0000000000000001 00007fc3588ee334 00007fc358af5758 00007fc3588ee97f
 ffff8801b4877ee8 ffffffff8107a503 ffff8801b4877ee8 ffffffffffffffea
Call Trace:
 [<ffffffff8112a1bc>] ? __fput+0x12c/0x240
 [<ffffffff8107a503>] prepare_creds+0x23/0x150
 [<ffffffff811272d4>] SyS_faccessat+0x34/0x1f0
 [<ffffffff811274a3>] SyS_access+0x13/0x20
 [<ffffffff8179e7a9>] system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP  [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
 RSP <ffff8801b4877e88>
CR2: 0000000100000000
---[ end trace 744477356cd98306 ]---

backup03's first Oops:

BUG: unable to handle kernel paging request at ffff880502efc240
IP: [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
PGD 1d3a067 PUD 0
Oops: 0000 [#1] SMP
Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw evdev
CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
task: ffff88040111c3b0 ti: ffff8803c23ae000 task.ti: ffff8803c23ae000
RIP: 0010:[<ffffffff81124c4b>]  [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
RSP: 0018:ffff8803c23afd90  EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88040111c3b0 RCX: 000000000002a76e
RDX: 000000000002a76d RSI: 00000000000000d0 RDI: ffffffff8107a4e3
RBP: ffff8803c23afdd0 R08: 0000000000016a80 R09: 00000000ffffffff
R10: fffffffffffffffe R11: ffffffffffffffd0 R12: 00000000000000d0
R13: ffff88041d403980 R14: ffff880502efc240 R15: ffff88010e375a40
FS:  00007f2cae496700(0000) GS:ffff88041f2c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff880502efc240 CR3: 00000001e0ced000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 ffff8803c23afe98 ffff8803c23afdb8 ffffffff81133811 ffff88040111c3b0
 ffff88010e375a40 0000000001200011 00007f2cae4969d0 ffff88010e375a40
 ffff8803c23afdf0 ffffffff8107a4e3 ffffffff81b49b80 0000000001200011
Call Trace:
 [<ffffffff81133811>] ? final_putname+0x21/0x50
 [<ffffffff8107a4e3>] prepare_creds+0x23/0x150
 [<ffffffff8107ab11>] copy_creds+0x31/0x160
 [<ffffffff8101a97b>] ? unlazy_fpu+0x9b/0xb0
 [<ffffffff8104ef09>] copy_process.part.49+0x239/0x1390
 [<ffffffff81143312>] ? __alloc_fd+0x42/0x100
 [<ffffffff81050134>] do_fork+0xa4/0x320
 [<ffffffff81131b77>] ? __do_pipe_flags+0x77/0xb0
 [<ffffffff81143426>] ? __fd_install+0x26/0x60
 [<ffffffff81050431>] SyS_clone+0x11/0x20
 [<ffffffff817ad849>] stub_clone+0x69/0x90
 [<ffffffff817ad569>] ? system_call_fastpath+0x16/0x1b
Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
RIP  [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
 RSP <ffff8803c23afd90>
CR2: ffff880502efc240
---[ end trace 956d153150ecc57f ]---

Simon-

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-07-06  0:09 [3.10] Oopses in kmem_cache_allocate() via prepare_creds() Simon Kirby
@ 2013-07-06  8:27 ` Pekka Enberg
  2013-08-19 20:17   ` Simon Kirby
  0 siblings, 1 reply; 19+ messages in thread
From: Pekka Enberg @ 2013-07-06  8:27 UTC (permalink / raw)
  To: Simon Kirby; +Cc: LKML, Christoph Lameter, Chris Mason, Linus Torvalds

On Sat, Jul 6, 2013 at 3:09 AM, Simon Kirby <sim@hostway.ca> wrote:
> We saw two Oopses overnight on two separate boxes that seem possibly
> related, but both are weird. These boxes typically run btrfs for rsync
> snapshot backups (and usually Oops in btrfs ;), but not this time!
> backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
> was running 3.10 release plus btrfs-next from yesterday. Full kern.log
> and .config at http://0x.ca/sim/ref/3.10/
>
> backup02's first Oops:
>
> BUG: unable to handle kernel paging request at 0000000100000000
> IP: [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> PGD 1f54f7067 PUD 0
> Oops: 0000 [#1] SMP
> Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw bnx2 evdev
> CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
> Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
> task: ffff8802c3f08000 ti: ffff8801b4876000 task.ti: ffff8801b4876000
> RIP: 0010:[<ffffffff81124beb>]  [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> RSP: 0018:ffff8801b4877e88  EFLAGS: 00010206
> RAX: 0000000000000000 RBX: ffff8802c3f08000 RCX: 00000000017f040e
> RDX: 00000000017f040d RSI: 00000000000000d0 RDI: ffffffff8107a503
> RBP: ffff8801b4877ec8 R08: 0000000000016a80 R09: 0000000000000000
> R10: 00007fff025fe120 R11: 0000000000000246 R12: 00000000000000d0
> R13: ffff88042d8019c0 R14: 0000000100000000 R15: 00007fc3588ee97f
> FS:  0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000100000000 CR3: 0000000409d68000 CR4: 00000000000007f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
>  ffff8801b4877ed8 ffffffff8112a1bc ffff8800985acd20 ffff8802c3f08000
>  0000000000000001 00007fc3588ee334 00007fc358af5758 00007fc3588ee97f
>  ffff8801b4877ee8 ffffffff8107a503 ffff8801b4877ee8 ffffffffffffffea
> Call Trace:
>  [<ffffffff8112a1bc>] ? __fput+0x12c/0x240
>  [<ffffffff8107a503>] prepare_creds+0x23/0x150
>  [<ffffffff811272d4>] SyS_faccessat+0x34/0x1f0
>  [<ffffffff811274a3>] SyS_access+0x13/0x20
>  [<ffffffff8179e7a9>] system_call_fastpath+0x16/0x1b
> Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> RIP  [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
>  RSP <ffff8801b4877e88>
> CR2: 0000000100000000
> ---[ end trace 744477356cd98306 ]---
>
> backup03's first Oops:
>
> BUG: unable to handle kernel paging request at ffff880502efc240
> IP: [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> PGD 1d3a067 PUD 0
> Oops: 0000 [#1] SMP
> Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw evdev
> CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
> Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
> task: ffff88040111c3b0 ti: ffff8803c23ae000 task.ti: ffff8803c23ae000
> RIP: 0010:[<ffffffff81124c4b>]  [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> RSP: 0018:ffff8803c23afd90  EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffff88040111c3b0 RCX: 000000000002a76e
> RDX: 000000000002a76d RSI: 00000000000000d0 RDI: ffffffff8107a4e3
> RBP: ffff8803c23afdd0 R08: 0000000000016a80 R09: 00000000ffffffff
> R10: fffffffffffffffe R11: ffffffffffffffd0 R12: 00000000000000d0
> R13: ffff88041d403980 R14: ffff880502efc240 R15: ffff88010e375a40
> FS:  00007f2cae496700(0000) GS:ffff88041f2c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: ffff880502efc240 CR3: 00000001e0ced000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
>  ffff8803c23afe98 ffff8803c23afdb8 ffffffff81133811 ffff88040111c3b0
>  ffff88010e375a40 0000000001200011 00007f2cae4969d0 ffff88010e375a40
>  ffff8803c23afdf0 ffffffff8107a4e3 ffffffff81b49b80 0000000001200011
> Call Trace:
>  [<ffffffff81133811>] ? final_putname+0x21/0x50
>  [<ffffffff8107a4e3>] prepare_creds+0x23/0x150
>  [<ffffffff8107ab11>] copy_creds+0x31/0x160
>  [<ffffffff8101a97b>] ? unlazy_fpu+0x9b/0xb0
>  [<ffffffff8104ef09>] copy_process.part.49+0x239/0x1390
>  [<ffffffff81143312>] ? __alloc_fd+0x42/0x100
>  [<ffffffff81050134>] do_fork+0xa4/0x320
>  [<ffffffff81131b77>] ? __do_pipe_flags+0x77/0xb0
>  [<ffffffff81143426>] ? __fd_install+0x26/0x60
>  [<ffffffff81050431>] SyS_clone+0x11/0x20
>  [<ffffffff817ad849>] stub_clone+0x69/0x90
>  [<ffffffff817ad569>] ? system_call_fastpath+0x16/0x1b
> Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> RIP  [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
>  RSP <ffff8803c23afd90>
> CR2: ffff880502efc240
> ---[ end trace 956d153150ecc57f ]---

Looks like slab corruption to me.

Please try reproducing with "slub_debug" passed as a kernel parameter.
It should give us some more debug output for catching the caller
that's messing up slab.

Btw, there are some btrfs related lockup warnings in the logs so I'm
also CC'ing Chris.

                        Pekka

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-07-06  8:27 ` Pekka Enberg
@ 2013-08-19 20:17   ` Simon Kirby
  2013-08-19 20:29     ` Christoph Lameter
  0 siblings, 1 reply; 19+ messages in thread
From: Simon Kirby @ 2013-08-19 20:17 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: LKML, Christoph Lameter, Chris Mason, Linus Torvalds

On Sat, Jul 06, 2013 at 11:27:38AM +0300, Pekka Enberg wrote:

> On Sat, Jul 6, 2013 at 3:09 AM, Simon Kirby <sim@hostway.ca> wrote:
> > We saw two Oopses overnight on two separate boxes that seem possibly
> > related, but both are weird. These boxes typically run btrfs for rsync
> > snapshot backups (and usually Oops in btrfs ;), but not this time!
> > backup02 was running 3.10-rc6 plus btrfs-next at the time, and backup03
> > was running 3.10 release plus btrfs-next from yesterday. Full kern.log
> > and .config at http://0x.ca/sim/ref/3.10/
> >
> > backup02's first Oops:
> >
> > BUG: unable to handle kernel paging request at 0000000100000000
> > IP: [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> > PGD 1f54f7067 PUD 0
> > Oops: 0000 [#1] SMP
> > Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe microcode serio_raw bnx2 evdev
> > CPU: 0 PID: 23112 Comm: ionice Not tainted 3.10.0-rc6-hw+ #46
> > Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
> > task: ffff8802c3f08000 ti: ffff8801b4876000 task.ti: ffff8801b4876000
> > RIP: 0010:[<ffffffff81124beb>]  [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> > RSP: 0018:ffff8801b4877e88  EFLAGS: 00010206
> > RAX: 0000000000000000 RBX: ffff8802c3f08000 RCX: 00000000017f040e
> > RDX: 00000000017f040d RSI: 00000000000000d0 RDI: ffffffff8107a503
> > RBP: ffff8801b4877ec8 R08: 0000000000016a80 R09: 0000000000000000
> > R10: 00007fff025fe120 R11: 0000000000000246 R12: 00000000000000d0
> > R13: ffff88042d8019c0 R14: 0000000100000000 R15: 00007fc3588ee97f
> > FS:  0000000000000000(0000) GS:ffff88043fc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000100000000 CR3: 0000000409d68000 CR4: 00000000000007f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Stack:
> >  ffff8801b4877ed8 ffffffff8112a1bc ffff8800985acd20 ffff8802c3f08000
> >  0000000000000001 00007fc3588ee334 00007fc358af5758 00007fc3588ee97f
> >  ffff8801b4877ee8 ffffffff8107a503 ffff8801b4877ee8 ffffffffffffffea
> > Call Trace:
> >  [<ffffffff8112a1bc>] ? __fput+0x12c/0x240
> >  [<ffffffff8107a503>] prepare_creds+0x23/0x150
> >  [<ffffffff811272d4>] SyS_faccessat+0x34/0x1f0
> >  [<ffffffff811274a3>] SyS_access+0x13/0x20
> >  [<ffffffff8179e7a9>] system_call_fastpath+0x16/0x1b
> > Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> > RIP  [<ffffffff81124beb>] kmem_cache_alloc+0x4b/0x110
> >  RSP <ffff8801b4877e88>
> > CR2: 0000000100000000
> > ---[ end trace 744477356cd98306 ]---
> >
> > backup03's first Oops:
> >
> > BUG: unable to handle kernel paging request at ffff880502efc240
> > IP: [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> > PGD 1d3a067 PUD 0
> > Oops: 0000 [#1] SMP
> > Modules linked in: aoe ipmi_devintf ipmi_msghandler bnx2 microcode serio_raw evdev
> > CPU: 6 PID: 14066 Comm: perl Not tainted 3.10.0-hw+ #2
> > Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
> > task: ffff88040111c3b0 ti: ffff8803c23ae000 task.ti: ffff8803c23ae000
> > RIP: 0010:[<ffffffff81124c4b>]  [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> > RSP: 0018:ffff8803c23afd90  EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: ffff88040111c3b0 RCX: 000000000002a76e
> > RDX: 000000000002a76d RSI: 00000000000000d0 RDI: ffffffff8107a4e3
> > RBP: ffff8803c23afdd0 R08: 0000000000016a80 R09: 00000000ffffffff
> > R10: fffffffffffffffe R11: ffffffffffffffd0 R12: 00000000000000d0
> > R13: ffff88041d403980 R14: ffff880502efc240 R15: ffff88010e375a40
> > FS:  00007f2cae496700(0000) GS:ffff88041f2c0000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: ffff880502efc240 CR3: 00000001e0ced000 CR4: 00000000000007e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Stack:
> >  ffff8803c23afe98 ffff8803c23afdb8 ffffffff81133811 ffff88040111c3b0
> >  ffff88010e375a40 0000000001200011 00007f2cae4969d0 ffff88010e375a40
> >  ffff8803c23afdf0 ffffffff8107a4e3 ffffffff81b49b80 0000000001200011
> > Call Trace:
> >  [<ffffffff81133811>] ? final_putname+0x21/0x50
> >  [<ffffffff8107a4e3>] prepare_creds+0x23/0x150
> >  [<ffffffff8107ab11>] copy_creds+0x31/0x160
> >  [<ffffffff8101a97b>] ? unlazy_fpu+0x9b/0xb0
> >  [<ffffffff8104ef09>] copy_process.part.49+0x239/0x1390
> >  [<ffffffff81143312>] ? __alloc_fd+0x42/0x100
> >  [<ffffffff81050134>] do_fork+0xa4/0x320
> >  [<ffffffff81131b77>] ? __do_pipe_flags+0x77/0xb0
> >  [<ffffffff81143426>] ? __fd_install+0x26/0x60
> >  [<ffffffff81050431>] SyS_clone+0x11/0x20
> >  [<ffffffff817ad849>] stub_clone+0x69/0x90
> >  [<ffffffff817ad569>] ? system_call_fastpath+0x16/0x1b
> > Code: 75 f0 4c 89 7d f8 49 8b 4d 00 65 48 03 0c 25 68 da 00 00 48 8b 51 08 4c 8b 31 4d 85 f6 74 5f 49 63 45 20 4d 8b 45 00 48 8d 4a 01 <49> 8b 1c 06 4c 89 f0 65 49 0f c7 08 0f 94 c0 84 c0 74 c8 49 63
> > RIP  [<ffffffff81124c4b>] kmem_cache_alloc+0x4b/0x110
> >  RSP <ffff8803c23afd90>
> > CR2: ffff880502efc240
> > ---[ end trace 956d153150ecc57f ]---
> 
> Looks like slab corruption to me.
> 
> Please try reproducing with "slub_debug" passed as a kernel parameter.
> It should give us some more debug output for catching the caller
> that's messing up slab.
> 
> Btw, there are some btrfs related lockup warnings in the logs so I'm
> also CC'ing Chris.

So, with slub_debug, we are seeing "Poison overwritten" on two separate
boxes (we have four running roughly the same NFS-to-btrfs snapshot backup
tasks). One does it about weekly, the other has only done it once. The
alloc/free traces are always the same -- always alloc_pipe_info and
free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:

=============================================================================
BUG kmalloc-192 (Not tainted): Poison overwritten
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: 0xffff880090f19e7c-0xffff880090f19e7c. First byte 0x6c instead of 0x6b
INFO: Allocated in alloc_pipe_info+0x1f/0xb0 age=15 cpu=6 pid=21914
        __slab_alloc.constprop.66+0x35b/0x3a0
        kmem_cache_alloc_trace+0xa0/0x100
        alloc_pipe_info+0x1f/0xb0
        create_pipe_files+0x41/0x1f0
        __do_pipe_flags+0x3c/0xb0
        SyS_pipe2+0x1b/0xa0
        SyS_pipe+0xb/0x10
        system_call_fastpath+0x16/0x1b
INFO: Freed in free_pipe_info+0x6a/0x70 age=14 cpu=6 pid=21914
        __slab_free+0x2d/0x2df
        kfree+0xfd/0x130
        free_pipe_info+0x6a/0x70
        pipe_release+0x94/0xf0
        __fput+0xa7/0x230
        ____fput+0x9/0x10
        task_work_run+0x97/0xd0
        do_notify_resume+0x66/0x70
        int_signal+0x12/0x17
INFO: Slab 0xffffea000243c600 objects=31 used=31 fp=0x          (null) flags=0x4000000000004080
INFO: Object 0xffff880090f19e78 @offset=7800 fp=0xffff880090f1b6d8

Bytes b4 ffff880090f19e68: 11 a2 b0 07 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
Object ffff880090f19e88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19e98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19ea8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19eb8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19ec8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19ed8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19ee8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19ef8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19f08: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19f18: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
Object ffff880090f19f28: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
Redzone ffff880090f19f38: bb bb bb bb bb bb bb bb                          ........
Padding ffff880090f1a078: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
CPU: 6 PID: 21914 Comm: perl Tainted: G    B        3.11.0-rc4-hw+ #48
Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.11.0 07/23/2012
 ffff880090f19e78 ffff8800a0f03c98 ffffffff817af54c 0000000000000007
 ffff88041d404900 ffff8800a0f03cc8 ffffffff81131c89 ffff880090f19e7d
 ffff88041d404900 000000000000006b ffff880090f19e78 ffff8800a0f03d18
Call Trace:
 [<ffffffff817af54c>] dump_stack+0x46/0x58
 [<ffffffff81131c89>] print_trailer+0xf9/0x160
 [<ffffffff81131e22>] check_bytes_and_report+0xe2/0x120
 [<ffffffff81132027>] check_object+0x1c7/0x240
 [<ffffffff8113fd9f>] ? alloc_pipe_info+0x1f/0xb0
 [<ffffffff817abaae>] alloc_debug_processing+0x153/0x168
 [<ffffffff817abe1e>] __slab_alloc.constprop.66+0x35b/0x3a0
 [<ffffffff8113fd9f>] ? alloc_pipe_info+0x1f/0xb0
 [<ffffffff811333a0>] kmem_cache_alloc_trace+0xa0/0x100
 [<ffffffff8114f26d>] ? inode_init_always+0xed/0x1b0
 [<ffffffff8113fd9f>] alloc_pipe_info+0x1f/0xb0
 [<ffffffff811402c1>] create_pipe_files+0x41/0x1f0
 [<ffffffff811404ac>] __do_pipe_flags+0x3c/0xb0
 [<ffffffff81152206>] ? __fd_install+0x26/0x60
 [<ffffffff8114057b>] SyS_pipe2+0x1b/0xa0
 [<ffffffff8114060b>] SyS_pipe+0xb/0x10
 [<ffffffff817bce69>] system_call_fastpath+0x16/0x1b
FIX kmalloc-192: Restoring 0xffff880090f19e7c-0xffff880090f19e7c=0x6b

FIX kmalloc-192: Marking all objects used

This and more traces posted here: http://0x.ca/sim/ref/3.11-rc4/

Is there anything more we should turn on to get more information?
CONFIG_EFENCE? :)

Simon-

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-19 20:17   ` Simon Kirby
@ 2013-08-19 20:29     ` Christoph Lameter
  2013-08-19 21:16       ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Lameter @ 2013-08-19 20:29 UTC (permalink / raw)
  To: Simon Kirby; +Cc: Pekka Enberg, LKML, Chris Mason, Linus Torvalds

On Mon, 19 Aug 2013, Simon Kirby wrote:

> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

This looks like an increment after free in the second 32 bit value of the
structure. First 32 bit value's poison is unchanged.

> CONFIG_EFENCE? :)

I think the developers should have enough information now to fix the
issue. Find the struct definition. Look for the second 32 bit value and
find out where that value is incremented.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-19 20:29     ` Christoph Lameter
@ 2013-08-19 21:16       ` Linus Torvalds
  2013-08-19 21:24         ` Chris Mason
  2013-08-20  4:06         ` Al Viro
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2013-08-19 21:16 UTC (permalink / raw)
  To: Christoph Lameter, Al Viro; +Cc: Simon Kirby, Pekka Enberg, LKML, Chris Mason

On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <cl@gentwo.org> wrote:
> On Mon, 19 Aug 2013, Simon Kirby wrote:
>
>>    [... ]  The
>> alloc/free traces are always the same -- always alloc_pipe_info and
>> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
>>
>> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
>
> This looks like an increment after free in the second 32 bit value of the
> structure. First 32 bit value's poison is unchanged.

Ugh. If that is "struct pipe_inode_info" and I read it right, that's
the "wait_lock" spinlock that is part of the mutex.

Doing a "spin_lock()" could indeed cause an increment operation. But
it still sounds like a very odd case. And even for some wild pointer
I'd then expect the spin_unlock to also happen, and to then increment
the next byte (or word) too. More importantly, for a mutex, I'd expect
the *other* fields to be corrupted too (the "waiter" field etc). That
is, unless we're still spinning waiting for the mutex, but with that
value we shouldn't, as far as I can see.

But it kind of does match at least one of your oopses that you had
before using slab debugging: one of them had a pointer that should
have been NULL that was 0000000100000000. Which again is "increment
the second 32-bit word", and could be explained by the slab entry
being re-used for another allocation that just happened to have a
pointer in the first 8 bytes instead.

And I think the timing is interesting, and there is data to back up
the fact that it is that mutex field: the field was introduced by
commit 72b0d9aacb89 ("pipe: don't use ->i_mutex"), which was merged
into 3.10-rc1. So it matches the timing Simon sees. So while I think
the pipe mutex spinlock field is a bit odd,

Al Viro added to the participants list. Because that
pipe->mutex->mutex_lock corruption doesn't really make sense to me,
but there are certainly interesting coincidences wrt timing.

Simon - it *might* be interesting to do this with DEBUG_PAGEALLOC, and
make the pipe_inode_info allocations use a full page instead of a
kmalloc() in order to trigger that way. So now it uses

    pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);

and

    kfree(pipe);

in alloc_pipe_info/free_pipe_info respectively, could you make it use

    pipe = (void *)get_zeroed_page(GFP_KERNEL);

and

    free_page((unsigned long)pipe);

instead respectively, and then enable DEBUG_PAGEALLOC? That *should*
trigger an exception on the actual bad access, if it really is this
pipe_inode_info that is having problems..

                  Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-19 21:16       ` Linus Torvalds
@ 2013-08-19 21:24         ` Chris Mason
  2013-08-19 23:31           ` Simon Kirby
  2013-08-20  4:06         ` Al Viro
  1 sibling, 1 reply; 19+ messages in thread
From: Chris Mason @ 2013-08-19 21:24 UTC (permalink / raw)
  To: Linus Torvalds, Christoph Lameter, Al Viro
  Cc: Simon Kirby, Pekka Enberg, LKML

Quoting Linus Torvalds (2013-08-19 17:16:36)
> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <cl@gentwo.org> wrote:
> > On Mon, 19 Aug 2013, Simon Kirby wrote:
> >
> >>    [... ]  The
> >> alloc/free traces are always the same -- always alloc_pipe_info and
> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> >>
> >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
> >
> > This looks like an increment after free in the second 32 bit value of the
> > structure. First 32 bit value's poison is unchanged.
> 
> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> the "wait_lock" spinlock that is part of the mutex.
> 
> Doing a "spin_lock()" could indeed cause an increment operation. But
> it still sounds like a very odd case. And even for some wild pointer
> I'd then expect the spin_unlock to also happen, and to then increment
> the next byte (or word) too. More importantly, for a mutex, I'd expect
> the *other* fields to be corrupted too (the "waiter" field etc). That
> is, unless we're still spinning waiting for the mutex, but with that
> value we shouldn't, as far as I can see.
> 

Simon, is this box doing btrfs send/receive?  If so, it's probably where
this pipe is coming from.

Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
way to find it, I can give you a patch if it'll help.

It would be nice if you could trigger this on plain 3.11-rcX instead of
btrfs-next.

-chris

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-19 21:24         ` Chris Mason
@ 2013-08-19 23:31           ` Simon Kirby
  2013-09-03 20:43             ` Simon Kirby
  0 siblings, 1 reply; 19+ messages in thread
From: Simon Kirby @ 2013-08-19 23:31 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Christoph Lameter, Al Viro, Pekka Enberg, LKML

On Mon, Aug 19, 2013 at 05:24:41PM -0400, Chris Mason wrote:

> Quoting Linus Torvalds (2013-08-19 17:16:36)
> > On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <cl@gentwo.org> wrote:
> > > On Mon, 19 Aug 2013, Simon Kirby wrote:
> > >
> > >>    [... ]  The
> > >> alloc/free traces are always the same -- always alloc_pipe_info and
> > >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> > >>
> > >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
> > >
> > > This looks like an increment after free in the second 32 bit value of the
> > > structure. First 32 bit value's poison is unchanged.
> > 
> > Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> > the "wait_lock" spinlock that is part of the mutex.
> > 
> > Doing a "spin_lock()" could indeed cause an increment operation. But
> > it still sounds like a very odd case. And even for some wild pointer
> > I'd then expect the spin_unlock to also happen, and to then increment
> > the next byte (or word) too. More importantly, for a mutex, I'd expect
> > the *other* fields to be corrupted too (the "waiter" field etc). That
> > is, unless we're still spinning waiting for the mutex, but with that
> > value we shouldn't, as far as I can see.
> > 
> 
> Simon, is this box doing btrfs send/receive?  If so, it's probably where
> this pipe is coming from.

No, not for some time (a few kernel versions ago).

> Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
> way to find it, I can give you a patch if it'll help.

I presume it's just:

diff --git a/fs/pipe.c b/fs/pipe.c
index d2c45e1..30d5b8d 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -780,7 +780,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 {
 	struct pipe_inode_info *pipe;
 
-	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+	pipe = (void *)get_zeroed_page(GFP_KERNEL);
 	if (pipe) {
 		pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * PIPE_DEF_BUFFERS, GFP_KERNEL);
 		if (pipe->bufs) {
@@ -790,7 +790,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 			mutex_init(&pipe->mutex);
 			return pipe;
 		}
-		kfree(pipe);
+		free_page((unsigned long)pipe);
 	}
 
 	return NULL;
@@ -808,7 +808,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
 	if (pipe->tmp_page)
 		__free_page(pipe->tmp_page);
 	kfree(pipe->bufs);
-	kfree(pipe);
+	free_page((unsigned long)pipe);
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;

...and CONFIG_DEBUG_PAGEALLOC enabled.

> It would be nice if you could trigger this on plain 3.11-rcX instead of
> btrfs-next.

On 3.10 it was with some btrfs-next pulled in, but the 3.11-rc4 traces
were from 3.11-rc4 with just some of our local patches:

> git diff --stat v3.11-rc4..master
 firmware/Makefile                         |    4 +-
 firmware/bnx2/bnx2-mips-06-6.2.3.fw.ihex  | 5804 ++++++++++++++++++++++
 firmware/bnx2/bnx2-mips-09-6.2.1b.fw.ihex | 6496 +++++++++++++++++++++++++
 kernel/acct.c                             |   21 +-
 net/sunrpc/auth.c                         |    2 +-
 net/sunrpc/clnt.c                         |   10 +
 net/sunrpc/xprt.c                         |    8 +-
 7 files changed, 12335 insertions(+), 10 deletions(-)

None of them look relevant, but I'm building vanilla -rc4 with
CONFIG_DEBUG_PAGEALLOC and the patch above.

Simon-

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-19 21:16       ` Linus Torvalds
  2013-08-19 21:24         ` Chris Mason
@ 2013-08-20  4:06         ` Al Viro
  2013-08-20  7:17           ` Ian Applegate
  1 sibling, 1 reply; 19+ messages in thread
From: Al Viro @ 2013-08-20  4:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Simon Kirby, Pekka Enberg, LKML, Chris Mason

On Mon, Aug 19, 2013 at 02:16:36PM -0700, Linus Torvalds wrote:
> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <cl@gentwo.org> wrote:
> > On Mon, 19 Aug 2013, Simon Kirby wrote:
> >
> >>    [... ]  The
> >> alloc/free traces are always the same -- always alloc_pipe_info and
> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> >>
> >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
> >
> > This looks like an increment after free in the second 32 bit value of the
> > structure. First 32 bit value's poison is unchanged.
> 
> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> the "wait_lock" spinlock that is part of the mutex.
> 
> Doing a "spin_lock()" could indeed cause an increment operation. But
> it still sounds like a very odd case. And even for some wild pointer
> I'd then expect the spin_unlock to also happen, and to then increment
> the next byte (or word) too. More importantly, for a mutex, I'd expect
> the *other* fields to be corrupted too (the "waiter" field etc). That
> is, unless we're still spinning waiting for the mutex, but with that
> value we shouldn't, as far as I can see.

Point...  I would probably start with CONFIG_DEBUG_MUTEXES and see if
it screams about mutex magic, etc. having been buggered.  FWIW, pipe is
neither a FIFO nor an internal per-task one - it's a usual pipe(2) one,
which should've excluded most of the weird codepaths...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-20  4:06         ` Al Viro
@ 2013-08-20  7:17           ` Ian Applegate
  2013-08-20  7:21             ` Al Viro
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Applegate @ 2013-08-20  7:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christoph Lameter, Simon Kirby, Pekka Enberg,
	LKML, Chris Mason

We are also seeing this or a similar issue. On a fairly widespread
deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
order of 36 days (combined MTBF.)

[28974.739774] ------------[ cut here ]------------
[28974.744980] kernel BUG at mm/slub.c:3352!
[28974.749502] invalid opcode: 0000 [#1] SMP
[28974.754143] Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler
dm_mod md_mod nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter
ip6table_raw ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
iptable_filter xt_tcpudp xt_CT nf_conntrack xt_multiport iptable_raw
ip_tables x_tables rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 fuse
nfsv3 nfs_acl nfs fscache lockd sunrpc bonding tcp_cubic sg sfc(O) mtd
mdio igb dca i2c_algo_bit i2c_core ptp pps_core coretemp kvm_intel kvm
crc32c_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper
aes_x86_64 acpi_cpufreq evdev sd_mod crc_t10dif snd_pcm tpm_tis
microcode tpm snd_timer tpm_bios snd soundcore snd_page_alloc pcspkr
ahci libahci uhci_hcd ehci_pci ehci_hcd lpc_ich libata mfd_core
usbcore usb_common hpsa scsi_mod mperf button processor thermal_sys
[28974.835407] CPU: 17 PID: 21400 Comm: nginx-fl Tainted: G
O 3.10.1-cloudflare-trace+ #2
[28974.845307] Hardware name: HP ProLiant DL180 G6  , BIOS O20 01/24/2011
[28974.852653] task: ffff8805bea7b390 ti: ffff8809a5f0e000 task.ti:
ffff8809a5f0e000
[28974.861095] RIP: 0010:[<ffffffff810ebcb2>]  [<ffffffff810ebcb2>]
kfree+0x59/0xe7
[28974.869453] RSP: 0018:ffff8809a5f0fe30  EFLAGS: 00010246
[28974.875434] RAX: 0000000000000000 RBX: ffff880b21e79d40 RCX: 0000000000000028
[28974.883458] RDX: 006ffc0000080068 RSI: 000000000001183c RDI: ffffea002c879e40
[28974.891483] RBP: ffffea002c879e40 R08: 0000000000017100 R09: ffffea0009ea0480
[28974.899507] R10: ffff880ae9746250 R11: 0000000000000000 R12: ffffffff81119456
[28974.907533] R13: 000000000000000a R14: ffff8809a5f0ff48 R15: 0000000000013b80
[28974.915558] FS:  00007fb36c115710(0000) GS:ffff880627d60000(0000)
knlGS:0000000000000000
[28974.924681] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28974.931149] CR2: ffffffffff600400 CR3: 0000000761b93000 CR4: 00000000000007e0
[28974.939173] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[28974.947190] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[28974.955205] Stack:
[28974.957486]  ffff880a3e72a700 000000000001183c 0000000000000000
ffffffff81119456
[28974.965873]  ffff8809a5f0fe48 0000000000000000 ffff880b21e79d40
000000000000c350
[28974.974253]  ffff8809a5f0fec0 ffff8802b4781bc0 0000000000000200
ffffffff811d96eb
[28974.982646] Call Trace:
[28974.985419]  [<ffffffff81119456>] ? do_readv_writev+0xfc/0x135
[28974.991987]  [<ffffffff811d96eb>] ? ep_poll+0x215/0x286
[28974.997875]  [<ffffffff8112ccb4>] ? fget_light+0x2e/0x7c
[28975.003861]  [<ffffffff811182fd>] ? fdget+0x16/0x1c
[28975.009359]  [<ffffffff8111956c>] ? SyS_readv+0x5a/0xb0
[28975.015245]  [<ffffffff811da1ae>] ? SyS_epoll_wait+0x86/0xc1
[28975.021622]  [<ffffffff814e4d92>] ? system_call_fastpath+0x16/0x1b
[28975.028578] Code: 48 83 fb 10 0f 86 aa 00 00 00 48 89 df e8 8c de
ff ff 48 89 c7 48 89 c5 e8 b5 d7 ff ff 85 c0 75 22 48 f7 45 00 00 c0
00 00 75 02 <0f> 0b 48 89 ef e8 a8 d7 ff ff 5b 48 89 ef 89 c6 5d 41 5c
e9 9a
[28975.050651] RIP  [<ffffffff810ebcb2>] kfree+0x59/0xe7
[28975.056354]  RSP <ffff8809a5f0fe30>
[28975.062479] ---[ end trace 29f90372a2c3b0ac ]---

The machine hobbles along until all processes crash at this point.

[14064.855658] =============================================================================
[14064.864884] BUG kmalloc-192 (Tainted: G        W   ): Poison overwritten
[14064.872424] -----------------------------------------------------------------------------
[14064.872424]
[14064.883322] Disabling lock debugging due to kernel taint
[14064.889304] INFO: 0xffff880930a54824-0xffff880930a54824. First byte
0x6c instead of 0x6b
[14064.898433] INFO: Allocated in alloc_pipe_info+0x17/0x94 age=80
cpu=8 pid=28897
[14064.906684]  set_track+0x5c/0xd7
[14064.910332]  alloc_debug_processing+0x76/0xfd
[14064.915250]  __slab_alloc+0x3e0/0x417
[14064.919386]  alloc_inode+0x26/0x6c
[14064.923232]  alloc_pipe_info+0x17/0x94
[14064.927467]  kmem_cache_alloc_trace+0xbe/0x14e
[14064.932486]  alloc_pipe_info+0x17/0x94
[14064.936721]  alloc_pipe_info+0x17/0x94
[14064.940956]  create_pipe_files+0x3c/0x1e5
[14064.945481]  __do_pipe_flags+0x23/0xa7
[14064.949718]  SyS_pipe2+0x18/0x86
[14064.953374]  system_call_fastpath+0x16/0x1b
[14064.958098] INFO: Freed in pipe_release+0xc4/0xcd age=76 cpu=23 pid=28897
[14064.965740]  set_track+0x5c/0xd7
[14064.969389]  free_debug_processing+0x11d/0x1a9
[14064.974391]  __slab_free+0x32/0x30a
[14064.978334]  free_pages_prepare+0x104/0x128
[14064.983053]  pipe_release+0xc4/0xcd
[14064.986991]  __fput+0xe1/0x1e4
[14064.990443]  task_work_run+0x7b/0x8f
[14064.994482]  do_notify_resume+0x59/0x68
[14064.998816]  int_signal+0x12/0x17
[14065.002561] INFO: Slab 0xffffea0024c29500 objects=31 used=28
fp=0xffff880930a570c0 flags=0x6ffc0000004080
[14065.013341] INFO: Object 0xffff880930a54820 @offset=2080
fp=0xffff880930a55040
[14065.023177] Bytes b4 ffff880930a54810: 4d fa 14 00 01 00 00 00 5a
5a 5a 5a 5a 5a 5a 5a  M.......ZZZZZZZZ
[14065.033864] Object ffff880930a54820: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
[14065.044354] Object ffff880930a54830: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.054845] Object ffff880930a54840: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.065335] Object ffff880930a54850: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.075827] Object ffff880930a54860: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.086321] Object ffff880930a54870: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.096806] Object ffff880930a54880: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.107298] Object ffff880930a54890: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.117788] Object ffff880930a548a0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.128278] Object ffff880930a548b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.138770] Object ffff880930a548c0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[14065.149262] Object ffff880930a548d0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[14065.159751] Redzone ffff880930a548e0: bb bb bb bb bb bb bb bb
                   ........
[14065.169559] Padding ffff880930a54a20: 5a 5a 5a 5a 5a 5a 5a 5a
                   ZZZZZZZZ
[14065.179372] CPU: 10 PID: 7969 Comm: nginx-cache Tainted: G    B   W
   3.10.6-cloudflare-debug+ #5
[14065.189469] Hardware name: HP ProLiant DL180 G6  , BIOS O20 01/24/2011
[14065.196807]  0000000000000000 ffffffff810ee8d2 ffff880930a54820
ffffffff81827fe7
[14065.205185]  ffffea0024c29500 ffff880627804900 ffff880930a54820
ffffea0024c29500
[14065.213576]  ffffffff81223ebb ffff880930a54820 ffffea0024c29500
ffffffff810ee9fe
[14065.221965] Call Trace:
[14065.224743]  [<ffffffff810ee8d2>] ? check_bytes_and_report+0xa7/0xf4
[14065.231902]  [<ffffffff81223ebb>] ? kmem_alloc+0x43/0x6d
[14065.237890]  [<ffffffff810ee9fe>] ? check_object+0xdf/0x19e
[14065.244170]  [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.250154]  [<ffffffff814e834d>] ? alloc_debug_processing+0x5c/0xfd
[14065.257310]  [<ffffffff814e87ce>] ? __slab_alloc+0x3e0/0x417
[14065.263687]  [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.269671]  [<ffffffff810efe41>] ? __kmalloc+0x15b/0x191
[14065.275755]  [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.281733]  [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.287709]  [<ffffffff8124ce2f>] ? xfs_idata_realloc+0xd5/0xf7
[14065.294378]  [<ffffffff810efde0>] ? __kmalloc+0xfa/0x191
[14065.300366]  [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.306354]  [<ffffffff81223e9d>] ? kmem_alloc+0x25/0x6d
[14065.312340]  [<ffffffff8125c4ab>] ? xfs_log_commit_cil+0xe8/0x3ce
[14065.319202]  [<ffffffff812589a1>] ? xfs_trans_commit+0x5c/0x1c5
[14065.325873]  [<ffffffff81222ed4>] ? xfs_create+0x404/0x527
[14065.332055]  [<ffffffff8121d07d>] ? xfs_vn_mknod+0xb4/0x15e
[14065.338334]  [<ffffffff8112d010>] ? vfs_create+0x5f/0x84
[14065.344318]  [<ffffffff8112e0f6>] ? do_last+0x54b/0x9b7
[14065.350206]  [<ffffffff8112e624>] ? path_openat+0xc2/0x327
[14065.356386]  [<ffffffff81123b89>] ? do_readv_writev+0x119/0x135
[14065.363053]  [<ffffffff8112eb2c>] ? do_filp_open+0x2a/0x6e
[14065.369232]  [<ffffffff8112bda0>] ? getname_flags.part.32+0x22/0x10a
[14065.376386]  [<ffffffff8113791b>] ? __alloc_fd+0xcc/0xdb
[14065.382373]  [<ffffffff81122803>] ? do_sys_open+0x5c/0xe0
[14065.388457]  [<ffffffff814f43d2>] ? system_call_fastpath+0x16/0x1b
[14065.395413] FIX kmalloc-192: Restoring
0xffff880930a54824-0xffff880930a54824=0x6b
[14065.395413]
[14065.405533] FIX kmalloc-192: Marking all objects used

On Mon, Aug 19, 2013 at 9:06 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Aug 19, 2013 at 02:16:36PM -0700, Linus Torvalds wrote:
>> On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <cl@gentwo.org> wrote:
>> > On Mon, 19 Aug 2013, Simon Kirby wrote:
>> >
>> >>    [... ]  The
>> >> alloc/free traces are always the same -- always alloc_pipe_info and
>> >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
>> >>
>> >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
>> >
>> > This looks like an increment after free in the second 32 bit value of the
>> > structure. First 32 bit value's poison is unchanged.
>>
>> Ugh. If that is "struct pipe_inode_info" and I read it right, that's
>> the "wait_lock" spinlock that is part of the mutex.
>>
>> Doing a "spin_lock()" could indeed cause an increment operation. But
>> it still sounds like a very odd case. And even for some wild pointer
>> I'd then expect the spin_unlock to also happen, and to then increment
>> the next byte (or word) too. More importantly, for a mutex, I'd expect
>> the *other* fields to be corrupted too (the "waiter" field etc). That
>> is, unless we're still spinning waiting for the mutex, but with that
>> value we shouldn't, as far as I can see.
>
> Point...  I would probably start with CONFIG_DEBUG_MUTEXES and see if
> it screams about mutex magic, etc. having been buggered.  FWIW, pipe is
> neither a FIFO nor an internal per-task one - it's a usual pipe(2) one,
> which should've excluded most of the weird codepaths...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-20  7:17           ` Ian Applegate
@ 2013-08-20  7:21             ` Al Viro
  2013-08-20  7:51               ` Ian Applegate
  0 siblings, 1 reply; 19+ messages in thread
From: Al Viro @ 2013-08-20  7:21 UTC (permalink / raw)
  To: Ian Applegate
  Cc: Linus Torvalds, Christoph Lameter, Simon Kirby, Pekka Enberg,
	LKML, Chris Mason

On Tue, Aug 20, 2013 at 12:17:52AM -0700, Ian Applegate wrote:
> We are also seeing this or a similar issue. On a fairly widespread
> deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
> order of 36 days (combined MTBF.)

Do you have any boxen with CONFIG_DEBUG_MUTEXES among those?  What
kind of setup do you have on those, BTW?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-20  7:21             ` Al Viro
@ 2013-08-20  7:51               ` Ian Applegate
  2013-11-26  0:44                 ` Simon Kirby
  0 siblings, 1 reply; 19+ messages in thread
From: Ian Applegate @ 2013-08-20  7:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Christoph Lameter, Simon Kirby, Pekka Enberg,
	LKML, Chris Mason

Unfortunately no boxen with CONFIG_DEBUG_MUTEXES among them. I can
enable on a few and should have some results within the day. These
mainly serve (quite a bit of) HTTP/S cache traffic.

On Tue, Aug 20, 2013 at 12:21 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Aug 20, 2013 at 12:17:52AM -0700, Ian Applegate wrote:
>> We are also seeing this or a similar issue. On a fairly widespread
>> deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
>> order of 36 days (combined MTBF.)
>
> Do you have any boxen with CONFIG_DEBUG_MUTEXES among those?  What
> kind of setup do you have on those, BTW?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-19 23:31           ` Simon Kirby
@ 2013-09-03 20:43             ` Simon Kirby
  0 siblings, 0 replies; 19+ messages in thread
From: Simon Kirby @ 2013-09-03 20:43 UTC (permalink / raw)
  To: Chris Mason
  Cc: Linus Torvalds, Christoph Lameter, Al Viro, Pekka Enberg, LKML

On Mon, Aug 19, 2013 at 04:31:38PM -0700, Simon Kirby wrote:

> On Mon, Aug 19, 2013 at 05:24:41PM -0400, Chris Mason wrote:
> 
> > Quoting Linus Torvalds (2013-08-19 17:16:36)
> > > On Mon, Aug 19, 2013 at 1:29 PM, Christoph Lameter <cl@gentwo.org> wrote:
> > > > On Mon, 19 Aug 2013, Simon Kirby wrote:
> > > >
> > > >>    [... ]  The
> > > >> alloc/free traces are always the same -- always alloc_pipe_info and
> > > >> free_pipe_info. This is seen on 3.10 and (now) 3.11-rc4:
> > > >>
> > > >> Object ffff880090f19e78: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk
> > > >
> > > > This looks like an increment after free in the second 32 bit value of the
> > > > structure. First 32 bit value's poison is unchanged.
> > > 
> > > Ugh. If that is "struct pipe_inode_info" and I read it right, that's
> > > the "wait_lock" spinlock that is part of the mutex.
> > > 
> > > Doing a "spin_lock()" could indeed cause an increment operation. But
> > > it still sounds like a very odd case. And even for some wild pointer
> > > I'd then expect the spin_unlock to also happen, and to then increment
> > > the next byte (or word) too. More importantly, for a mutex, I'd expect
> > > the *other* fields to be corrupted too (the "waiter" field etc). That
> > > is, unless we're still spinning waiting for the mutex, but with that
> > > value we shouldn't, as far as I can see.
> > > 
> > 
> > Simon, is this box doing btrfs send/receive?  If so, it's probably where
> > this pipe is coming from.
> 
> No, not for some time (a few kernel versions ago).
> 
> > Linus' CONFIG_DEBUG_PAGE_ALLOC suggestions are going to be the fastest
> > way to find it, I can give you a patch if it'll help.
> 
> I presume it's just:
> 
> diff --git a/fs/pipe.c b/fs/pipe.c
> index d2c45e1..30d5b8d 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -780,7 +780,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
>  {
>  	struct pipe_inode_info *pipe;
>  
> -	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
> +	pipe = (void *)get_zeroed_page(GFP_KERNEL);
>  	if (pipe) {
>  		pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * PIPE_DEF_BUFFERS, GFP_KERNEL);
>  		if (pipe->bufs) {
> @@ -790,7 +790,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
>  			mutex_init(&pipe->mutex);
>  			return pipe;
>  		}
> -		kfree(pipe);
> +		free_page((unsigned long)pipe);
>  	}
>  
>  	return NULL;
> @@ -808,7 +808,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
>  	if (pipe->tmp_page)
>  		__free_page(pipe->tmp_page);
>  	kfree(pipe->bufs);
> -	kfree(pipe);
> +	free_page((unsigned long)pipe);
>  }
>  
>  static struct vfsmount *pipe_mnt __read_mostly;
> 
> ...and CONFIG_DEBUG_PAGEALLOC enabled.
> 
> > It would be nice if you could trigger this on plain 3.11-rcX instead of
> > btrfs-next.
> 
> On 3.10 it was with some btrfs-next pulled in, but the 3.11-rc4 traces
> were from 3.11-rc4 with just some of our local patches:
> 
> > git diff --stat v3.11-rc4..master
>  firmware/Makefile                         |    4 +-
>  firmware/bnx2/bnx2-mips-06-6.2.3.fw.ihex  | 5804 ++++++++++++++++++++++
>  firmware/bnx2/bnx2-mips-09-6.2.1b.fw.ihex | 6496 +++++++++++++++++++++++++
>  kernel/acct.c                             |   21 +-
>  net/sunrpc/auth.c                         |    2 +-
>  net/sunrpc/clnt.c                         |   10 +
>  net/sunrpc/xprt.c                         |    8 +-
>  7 files changed, 12335 insertions(+), 10 deletions(-)
> 
> None of them look relevant, but I'm building vanilla -rc4 with
> CONFIG_DEBUG_PAGEALLOC and the patch above.

Stock 3.11-rc4 plus the above get_zeroed_page() for pipe allocations has
been running since August 19th on a few btrfs boxes. It has been quiet
until a few days ago, where we hit this:

BUG: soft lockup - CPU#5 stuck for 22s! [btrfs-cleaner:5871]
Modules linked in: ipmi_devintf ipmi_si ipmi_msghandler aoe serio_raw bnx2 evdev
CPU: 5 PID: 5871 Comm: btrfs-cleaner Not tainted 3.11.0-rc4-hw+ #48
Hardware name: Dell Inc. PowerEdge 2950/0NH278, BIOS 2.7.0 10/30/2010
task: ffff8804261117d0 ti: ffff8804120d8000 task.ti: ffff8804120d8000
RIP: 0010:[<ffffffff817b4c8c>]  [<ffffffff817b4c8c>] _raw_spin_unlock_irqrestore+0xc/0x20
RSP: 0018:ffff8804120d98b8  EFLAGS: 00000296
RAX: ffff88042d800a80 RBX: ffffffff81130ffc RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000296 RDI: 0000000000000296
RBP: ffff8804120d98b8 R08: ffff8804120d9840 R09: 0000000000000000
R10: ffff8803c6f47400 R11: ffff88042d807180 R12: ffff8803c6f474a8
R13: 00000000000000cc R14: ffffffff8101f6da R15: ffff8804120d9828
FS:  0000000000000000(0000) GS:ffff88043fd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffff600400 CR3: 00000001a9fc4000 CR4: 00000000000007e0
Stack:
 ffff8804120d9958 ffffffff817abea8 ffff8804120d98f8 ffffffff813733d0
 00000034ffffffff 0000000000000296 ffff88004edd99d0 0000000000000010
 ffff8804120d9918 ffffffff81319211 0000000000000004 ffff88004edd99d0
Call Trace:
 [<ffffffff817abea8>] __slab_free+0x45/0x2df
 [<ffffffff813733d0>] ? btrfs_set_lock_blocking_rw+0xa0/0xe0
 [<ffffffff81319211>] ? btrfs_set_path_blocking+0x31/0x70
 [<ffffffff8131f15b>] ? btrfs_search_old_slot+0x75b/0x980
 [<ffffffff8139899e>] ? ulist_free+0x1e/0x30
 [<ffffffff811336ed>] kfree+0xfd/0x130
 [<ffffffff8139899e>] ulist_free+0x1e/0x30
 [<ffffffff81396b96>] __resolve_indirect_refs+0x436/0x5b0
 [<ffffffff8139757c>] find_parent_nodes+0x47c/0x6c0
 [<ffffffff81397851>] btrfs_find_all_roots+0x91/0x100
 [<ffffffff8139b7e8>] btrfs_qgroup_account_ref+0x148/0x500
 [<ffffffff81328016>] btrfs_delayed_refs_qgroup_accounting+0xa6/0x100
 [<ffffffff8132f6f9>] btrfs_run_delayed_refs+0x49/0x5b0
 [<ffffffff81328460>] ? walk_down_proc+0x120/0x2d0
 [<ffffffff8135e359>] ? free_extent_buffer+0x59/0xb0
 [<ffffffff8132c741>] ? walk_up_tree+0xe1/0x1e0
 [<ffffffff8133f03a>] btrfs_should_end_transaction+0x4a/0x70
 [<ffffffff8132df2b>] btrfs_drop_snapshot+0x3db/0x6a0
 [<ffffffff8138e726>] ? btrfs_kill_all_delayed_nodes+0x56/0xf0
 [<ffffffff817b4bee>] ? _raw_spin_lock+0xe/0x30
 [<ffffffff81340b06>] btrfs_clean_one_deleted_snapshot+0xc6/0x130
 [<ffffffff81337ab0>] cleaner_kthread+0x120/0x180
 [<ffffffff81337990>] ? transaction_kthread+0x210/0x210
 [<ffffffff8107af5b>] kthread+0xbb/0xc0
 [<ffffffff81010000>] ? xen_smp_intr_init+0x100/0x220
 [<ffffffff8107aea0>] ? kthread_freezable_should_stop+0x60/0x60
 [<ffffffff817bcdbc>] ret_from_fork+0x7c/0xb0
 [<ffffffff8107aea0>] ? kthread_freezable_should_stop+0x60/0x60
Code: 66 0f c1 07 0f b6 d4 38 c2 74 0d 0f 1f 40 00 f3 90 0f b6 07 38 d0 75 f7 5d c3 0f 1f 44 00 00 55 48 89 e5 80 07 01 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55

Perhaps something is still be wrong with the btrfs ulist stuff, even with
35f0399db6658f465b00893bdd13b992a0acfef0?

I was thinking of going back to the regular slab pipe allocations and
DEBUG_MUTEXES, but it may not be pipe-related at this point...

Simon-

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-08-20  7:51               ` Ian Applegate
@ 2013-11-26  0:44                 ` Simon Kirby
  2013-11-26 23:16                   ` Linus Torvalds
  0 siblings, 1 reply; 19+ messages in thread
From: Simon Kirby @ 2013-11-26  0:44 UTC (permalink / raw)
  To: Ian Applegate
  Cc: Al Viro, Linus Torvalds, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

On Tue, Aug 20, 2013 at 12:51:11AM -0700, Ian Applegate wrote:

> Unfortunately no boxen with CONFIG_DEBUG_MUTEXES among them. I can
> enable on a few and should have some results within the day. These
> mainly serve (quite a bit of) HTTP/S cache traffic.
> 
> On Tue, Aug 20, 2013 at 12:21 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Aug 20, 2013 at 12:17:52AM -0700, Ian Applegate wrote:
> >> We are also seeing this or a similar issue. On a fairly widespread
> >> deployment of 3.10.1 & 3.10.6 this occurred fairly consistently on the
> >> order of 36 days (combined MTBF.)
> >
> > Do you have any boxen with CONFIG_DEBUG_MUTEXES among those?  What
> > kind of setup do you have on those, BTW?

Hmm. So, we went through a few months of running with Linus' suggested
culprit-catching patch w/DEBUG_PAGE_ALLOC, but it never tripped. We also
ran with DEBUG_MUTEXES, but that never seem to catch anything, either.

Ian, is it true that what you saw involved no btrfs? I was still guessing
this is related to btrfs, as we are only seeing this on boxes doing btrfs
rsync-snapshot backups. I don't know what else is interesting about our
workload there, since we're not doing anything exotic.

Meanwhille, with DEBUG_LIST on 3.12-rc, we found list corruption, which
Josef fixed in 93858769172c4e3678917810e9d5de360eb991cc. This missed
3.12, unfortunately, so I built a 3.12 with Josef's btrfs-next merged (to
54563d41a58be77e9bd9ef7af1ea4026cf0e7e07, which contained that fix).

I was hoping this or something else by 3.12 would have fixed it, so after
testing we deployed this everywhere and turned off the rest of the debug
options. I missed slub_debug on one server, though...and it just hit
another case of overwritten poison.

Is it true that with slub_debug, aliasing of equal-sized objects is
turned off, and so they shouldn't be immediately side-by-side? In other
words, would there be similar scrawling victim chances as allocating
pipe_inode_info with pages instead of slabs? "slabinfo -a" is empty.

[158037.526662] =============================================================================
[158037.528014] BUG kmalloc-192 (Not tainted): Poison overwritten
[158037.528014] -----------------------------------------------------------------------------
[158037.528014] 
[158037.528014] Disabling lock debugging due to kernel taint
[158037.528014] INFO: 0xffff88013af3da6c-0xffff88013af3da6c. First byte 0x6c instead of 0x6b
[158037.528014] INFO: Allocated in alloc_pipe_info+0x1f/0xb0 age=22 cpu=3 pid=26402
[158037.528014]         __slab_alloc.constprop.63+0x35b/0x3a0
[158037.528014]         kmem_cache_alloc_trace+0xab/0x110
[158037.528014]         alloc_pipe_info+0x1f/0xb0
[158037.528014]         create_pipe_files+0x41/0x1f0
[158037.528014]         __do_pipe_flags+0x3c/0xb0   
[158037.528014]         SyS_pipe2+0x1b/0xa0
[158037.528014]         SyS_pipe+0xb/0x10  
[158037.528014]         system_call_fastpath+0x16/0x1b
[158037.528014] INFO: Freed in free_pipe_info+0x6a/0x70 age=39 cpu=1 pid=26402
[158037.528014]         __slab_free+0x2d/0x2d4
[158037.528014]         kfree+0xfd/0x130
[158037.528014]         free_pipe_info+0x6a/0x70
[158037.528014]         pipe_release+0x94/0xf0  
[158037.528014]         __fput+0xa7/0x230
[158037.528014]         ____fput+0x9/0x10
[158037.528014]         task_work_run+0x97/0xd0
[158037.528014]         do_notify_resume+0x66/0x70
[158037.528014]         int_signal+0x12/0x17
[158037.528014] INFO: Slab 0xffffea0004ebcf00 objects=31 used=29 fp=0xffff88013af3e080 flags=0x8000000000004080
[158037.528014] INFO: Object 0xffff88013af3da68 @offset=6760 fp=0xffff88013af3ca28
[158037.528014] 
[158037.528014] Bytes b4 ffff88013af3da58: 97 b8 59 02 01 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ..Y.....ZZZZZZZZ
[158037.528014] Object ffff88013af3da68: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk  
[158037.528014] Object ffff88013af3da78: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3da88: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3da98: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3daa8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3dab8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3dac8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3dad8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3dae8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3daf8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3db08: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk  
[158037.528014] Object ffff88013af3db18: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.  
[158037.528014] Redzone ffff88013af3db28: bb bb bb bb bb bb bb bb                          ........
[158037.528014] Padding ffff88013af3dc68: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
[158037.528014] CPU: 3 PID: 26402 Comm: perl Tainted: G    B        3.12.0-hw+ #75
[158037.528014] Hardware name: Dell Inc. PowerEdge 1950/0UR033, BIOS 2.0.1 10/30/2007
[158037.528014]  ffff88013af3da68 ffff88012252fc98 ffffffff817f6c6e 0000000000000007 
[158037.528014]  ffff88042d804900 ffff88012252fcc8 ffffffff81135099 ffff88013af3da6d 
[158037.528014]  ffff88042d804900 000000000000006b ffff88013af3da68 ffff88012252fd18 
[158037.528014] Call Trace:
[158037.528014]  [<ffffffff817f6c6e>] dump_stack+0x46/0x58
[158037.528014]  [<ffffffff81135099>] print_trailer+0xf9/0x160
[158037.528014]  [<ffffffff81135232>] check_bytes_and_report+0xe2/0x120
[158037.528014]  [<ffffffff81135437>] check_object+0x1c7/0x240
[158037.528014]  [<ffffffff8114325f>] ? alloc_pipe_info+0x1f/0xb0
[158037.528014]  [<ffffffff817f2d0c>] alloc_debug_processing+0x153/0x168
[158037.528014]  [<ffffffff817f307c>] __slab_alloc.constprop.63+0x35b/0x3a0
[158037.528014]  [<ffffffff8114325f>] ? alloc_pipe_info+0x1f/0xb0
[158037.528014]  [<ffffffff811369ab>] kmem_cache_alloc_trace+0xab/0x110
[158037.528014]  [<ffffffff81152f2d>] ? inode_init_always+0xed/0x1b0   
[158037.528014]  [<ffffffff8114325f>] alloc_pipe_info+0x1f/0xb0
[158037.528014]  [<ffffffff81143781>] create_pipe_files+0x41/0x1f0
[158037.528014]  [<ffffffff8114396c>] __do_pipe_flags+0x3c/0xb0   
[158037.528014]  [<ffffffff8107808c>] ? task_work_run+0x9c/0xd0   
[158037.528014]  [<ffffffff81143a3b>] SyS_pipe2+0x1b/0xa0
[158037.528014]  [<ffffffff818045f0>] ? int_signal+0x12/0x17
[158037.528014]  [<ffffffff81143acb>] SyS_pipe+0xb/0x10
[158037.528014]  [<ffffffff818043b9>] system_call_fastpath+0x16/0x1b
[158037.528014] FIX kmalloc-192: Restoring 0xffff88013af3da6c-0xffff88013af3da6c=0x6b
[158037.528014] 
[158037.528014] FIX kmalloc-192: Marking all objects used

There was a crash from btrfs memcpy shortly after this, which may or may
not be related, followed by another poison overwritten. Writeback was
wedged at this point.

Full kernel log: http://0x.ca/sim/ref/3.12/dmesg
Other stuff: http://0x.ca/sim/ref/3.12/

Simon-

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-11-26  0:44                 ` Simon Kirby
@ 2013-11-26 23:16                   ` Linus Torvalds
  2013-11-26 23:44                     ` Linus Torvalds
  2013-11-30  9:43                     ` Simon Kirby
  0 siblings, 2 replies; 19+ messages in thread
From: Linus Torvalds @ 2013-11-26 23:16 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

On Mon, Nov 25, 2013 at 4:44 PM, Simon Kirby <sim@hostway.ca> wrote:
>
> I was hoping this or something else by 3.12 would have fixed it, so after
> testing we deployed this everywhere and turned off the rest of the debug
> options. I missed slub_debug on one server, though...and it just hit
> another case of overwritten poison.

Your thing is *very* consistent, it's once more four bytes into that
pipe-info. And it's once more that exact same "increment second word
in the allocation" pattern.

But the fact that it goes away when you enable other debug options is
really really annoying. It's consistent only if you don't have the
options on that might help us debug it further. Damn.

> Is it true that with slub_debug, aliasing of equal-sized objects is
> turned off, and so they shouldn't be immediately side-by-side? In other
> words, would there be similar scrawling victim chances as allocating
> pipe_inode_info with pages instead of slabs? "slabinfo -a" is empty.

So the thing is, with slub debugging, slub shouldn't be merging
different slab caches.

HOWEVER.

The pipe-info structure isn't using its own slab cache, it's just
using "kmalloc()". So it by definition will merge with all other
kmalloc() allocations of the same size (or, to be exact, of "similar
enough size to hit the same size bucket"). In your case it's the
192-byte-sized bucket.

But all the debugging code talks purely about pipe_info allocations -
both the previous kmalloc/kfree _and_ the kmalloc() that actually sees
the slub debugging error. So if it's mixing with something else, I'm
not seeing what that would be. It would have to be an older allocation
(as it "it got re-allocated to a pipe in between") or another type
that was the similar size.

Which doesn't look all that likely. Not when your problems are so
consistent, and seem to be *always* about that pipe_inode_info.

But dammit, that it such a simple set of allocations. I still don't
see how they could be to blame. And if it's some suspicious access to
the pipe mutex (that second word is still the "wait_lock" spinlock in
the pipe inode mutex) I really would have expected the mutex debugging
to have screamed loudly. Or the DEBUG_PAGEALLOC.

I'm really not very happy with the whole pipe locking logic (or the
refcounting we do, separately from the "struct inode"), and in that
sense I'm perfectly willing to blame that code for doing bad things.
But the fact that it all goes away with debugging makes me very very
unhappy.

                Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-11-26 23:16                   ` Linus Torvalds
@ 2013-11-26 23:44                     ` Linus Torvalds
  2013-11-30  9:43                     ` Simon Kirby
  1 sibling, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2013-11-26 23:44 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

[-- Attachment #1: Type: text/plain, Size: 1397 bytes --]

On Tue, Nov 26, 2013 at 3:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm really not very happy with the whole pipe locking logic (or the
> refcounting we do, separately from the "struct inode"), and in that
> sense I'm perfectly willing to blame that code for doing bad things.
> But the fact that it all goes away with debugging makes me very very
> unhappy.

Al, I really hate the "pipe_lock()" function that tests for
"pipe->files" being non-zero. I don't think there are any valid cases
where pipe->file ever *could* be zero, and if there are, they are
fundamentally racy.

Is there really any reason for that test? It used to test for
"pipe->inode" and that kind of made sense as the pipe didn't even have
a lock (it re-used the inode one). But these days that test makes zero
sense, except as a "don't even bother locking if this is some fake
internal pipe", but is that even valid?

Also, why does "pipe_release()" still use i_pipe. I'd much rather it
used "file->private_data" like everything else (and then it
unconditionally clear it). We are, after all, talking about releasing
the *file*, and we shouldn't be mixing up that inode in there.

IOW, what is wrong with the attached patch? Then we could/should

 - make the free_pipe_info() happen from the drop_inode()

 - delete pipe->files counter entirely because it has no valid use

Hmm?

                 Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1113 bytes --]

 fs/pipe.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index d2c45e14e6d8..719214ed5e5e 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -56,8 +56,8 @@ unsigned int pipe_min_size = PAGE_SIZE;
 
 static void pipe_lock_nested(struct pipe_inode_info *pipe, int subclass)
 {
-	if (pipe->files)
-		mutex_lock_nested(&pipe->mutex, subclass);
+	WARN_ON_ONCE(!pipe->files);
+	mutex_lock_nested(&pipe->mutex, subclass);
 }
 
 void pipe_lock(struct pipe_inode_info *pipe)
@@ -71,8 +71,8 @@ EXPORT_SYMBOL(pipe_lock);
 
 void pipe_unlock(struct pipe_inode_info *pipe)
 {
-	if (pipe->files)
-		mutex_unlock(&pipe->mutex);
+	WARN_ON_ONCE(!pipe->files);
+	mutex_unlock(&pipe->mutex);
 }
 EXPORT_SYMBOL(pipe_unlock);
 
@@ -729,9 +729,10 @@ pipe_poll(struct file *filp, poll_table *wait)
 static int
 pipe_release(struct inode *inode, struct file *file)
 {
-	struct pipe_inode_info *pipe = inode->i_pipe;
+	struct pipe_inode_info *pipe = file->private_data;
 	int kill = 0;
 
+	file->private_data = NULL;
 	__pipe_lock(pipe);
 	if (file->f_mode & FMODE_READ)
 		pipe->readers--;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-11-26 23:16                   ` Linus Torvalds
  2013-11-26 23:44                     ` Linus Torvalds
@ 2013-11-30  9:43                     ` Simon Kirby
  2013-11-30 17:25                       ` Linus Torvalds
  2013-11-30 21:08                       ` Linus Torvalds
  1 sibling, 2 replies; 19+ messages in thread
From: Simon Kirby @ 2013-11-30  9:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

On Tue, Nov 26, 2013 at 03:16:09PM -0800, Linus Torvalds wrote:

> On Mon, Nov 25, 2013 at 4:44 PM, Simon Kirby <sim@hostway.ca> wrote:
> >
> > I was hoping this or something else by 3.12 would have fixed it, so after
> > testing we deployed this everywhere and turned off the rest of the debug
> > options. I missed slub_debug on one server, though...and it just hit
> > another case of overwritten poison.
> 
> Your thing is *very* consistent, it's once more four bytes into that
> pipe-info. And it's once more that exact same "increment second word
> in the allocation" pattern.
> 
> > Is it true that with slub_debug, aliasing of equal-sized objects is
> > turned off, and so they shouldn't be immediately side-by-side? In other
> > words, would there be similar scrawling victim chances as allocating
> > pipe_inode_info with pages instead of slabs? "slabinfo -a" is empty.
> 
> So the thing is, with slub debugging, slub shouldn't be merging
> different slab caches.
> 
> HOWEVER.
> 
> The pipe-info structure isn't using its own slab cache, it's just
> using "kmalloc()". So it by definition will merge with all other
> kmalloc() allocations of the same size (or, to be exact, of "similar
> enough size to hit the same size bucket"). In your case it's the
> 192-byte-sized bucket.

I turned on kmalloc-192 tracing to find what else is using it: struct
nfs_fh, struct bio, and struct cred. Poking around those, struct bio has
bi_cnt, but it is way down in the struct. struct cred has "usage", but it
comes first. Hmm. Nevertheless, I set:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LIST=y
CONFIG_DEBUG_CREDENTIALS=y

And tried:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index ec48bac..216dc43 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -168,7 +168,7 @@ static inline void *bio_data(struct bio *bio)
  * returns. and then bio would be freed memory when if (bio->bi_flags ...)
  * runs
  */
-#define bio_get(bio)	atomic_inc(&(bio)->bi_cnt)
+#define bio_get(bio)	WARN_ON(atomic_inc_return(&(bio)->bi_cnt) == 0x6c)
 
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 /*
diff --git a/include/linux/cred.h b/include/linux/cred.h
index 04421e8..2646fe9 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -205,7 +205,9 @@ static inline void validate_process_creds(void)
  */
 static inline struct cred *get_new_cred(struct cred *cred)
 {
-	atomic_inc(&cred->usage);
+	if (atomic_inc_return(&cred->usage) == 0x6c) {
+		WARN_ON(cred->uid == 0x6b);
+	}
 	return cred;
 }
 
On the same server, this last hunk warned fairly quickly:

[  850.303535] ------------[ cut here ]------------
[  850.312774] WARNING: CPU: 3 PID: 6169 at include/linux/cred.h:209 get_empty_filp+0x109/0x1b0()
[  850.329974] Modules linked in: ipmi_devintf aoe ipmi_si bnx2 ipmi_msghandler evdev serio_raw
[  850.346913] CPU: 3 PID: 6169 Comm: omreport Not tainted 3.12.0-hw-debug-mutexes+ #83
[  850.362374] Hardware name: Dell Inc. PowerEdge 1950/0UR033, BIOS 2.0.1 10/30/2007
[  850.377316]  0000000000000009 ffff880428d0fd28 ffffffff817f2407 ffff88043fccf9e8
[  850.392134]  0000000000000000 ffff880428d0fd68 ffffffff8105a537 ffff880428d0fd58
[  850.406936]  ffff880428d89e00 ffff88042960f480 ffff880428d0ff24 ffff88042a190000
[  850.421746] Call Trace:
[  850.426627]  [<ffffffff817f2407>] dump_stack+0x46/0x58
[  850.436888]  [<ffffffff8105a537>] warn_slowpath_common+0x87/0xb0
[  850.448878]  [<ffffffff8105a575>] warn_slowpath_null+0x15/0x20
[  850.460523]  [<ffffffff8113c7c9>] get_empty_filp+0x109/0x1b0
[  850.471818]  [<ffffffff811499c3>] path_openat+0x43/0x660
[  850.482426]  [<ffffffff8118595b>] ? fcntl_setlk+0x5b/0x2d0
[  850.493391]  [<ffffffff8114a38e>] do_filp_open+0x3e/0xa0
[  850.504008]  [<ffffffff81157bc4>] ? mntput_no_expire+0x44/0x130
[  850.515842]  [<ffffffff81156032>] ? __alloc_fd+0x42/0x110
[  850.526630]  [<ffffffff81139e9c>] do_sys_open+0x13c/0x230
[  850.537428]  [<ffffffff81187946>] compat_SyS_open+0x16/0x20
[  850.548579]  [<ffffffff81802268>] sysenter_dispatch+0x7/0x25
[  850.559888] ---[ end trace acdbea3e141dbaec ]---

All traces are the same, and all Comms are "omreport", which is from the
Dell OpenManage tools blob, executed regularly for RAID monitoring.
Running it directly does not seem to cause the warning. kern.log shows it
seems to warn every 20 minutes. No warnings from CONFIG_DEBUG_CREDENTIALS
magic checking at all.

Is there anything interesting about this tool? It is 32-bit. I can hook
path_openat() and check for the cred contents there to print the path, if
that would help.

Simon-

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-11-30  9:43                     ` Simon Kirby
@ 2013-11-30 17:25                       ` Linus Torvalds
  2013-11-30 21:04                         ` Simon Kirby
  2013-11-30 21:08                       ` Linus Torvalds
  1 sibling, 1 reply; 19+ messages in thread
From: Linus Torvalds @ 2013-11-30 17:25 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

On Sat, Nov 30, 2013 at 1:43 AM, Simon Kirby <sim@hostway.ca> wrote:
> On Tue, Nov 26, 2013 at 03:16:09PM -0800, Linus Torvalds wrote:
>>
>> The pipe-info structure isn't using its own slab cache, it's just
>> using "kmalloc()". So it by definition will merge with all other
>> kmalloc() allocations of the same size (or, to be exact, of "similar
>> enough size to hit the same size bucket"). In your case it's the
>> 192-byte-sized bucket.
>
> I turned on kmalloc-192 tracing to find what else is using it: struct
> nfs_fh, struct bio, and struct cred. Poking around those, struct bio has
> bi_cnt, but it is way down in the struct. struct cred has "usage", but it
> comes first. Hmm. Nevertheless, I set:
>
> CONFIG_DEBUG_MUTEXES=y
> CONFIG_DEBUG_LIST=y
> CONFIG_DEBUG_CREDENTIALS=y
>
> And tried:
>
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index 04421e8..2646fe9 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -205,7 +205,9 @@ static inline void validate_process_creds(void)
>   */
>  static inline struct cred *get_new_cred(struct cred *cred)
>  {
> -       atomic_inc(&cred->usage);
> +       if (atomic_inc_return(&cred->usage) == 0x6c) {
> +               WARN_ON(cred->uid == 0x6b);

Oh, damn, I thought you had found it, and got very excited and already
wrote a long email about things I wanted you to try. And then I
started looking closer...

That test is wrong. Both of those fields are 32-bit, so testing them
against 0x6b/0x6c is bogus: you're just catching real cases. The
reason it catches omreport is presumably because omreport runs as some
special user that happens to have uid 107 (on my machine that happens
to be qemu). And having a usage count of 108 isn't particularly
strange either - creds get a lot of re-use.

So close. It *might* still be one of those cases, but it doesn't
really sound very likely. "bi_cnt" is deep inside the struct bio, and
"usage" is at offset 0, not offset 4. And the ns_fh isn't very
interesting.

             Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-11-30 17:25                       ` Linus Torvalds
@ 2013-11-30 21:04                         ` Simon Kirby
  0 siblings, 0 replies; 19+ messages in thread
From: Simon Kirby @ 2013-11-30 21:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

On Sat, Nov 30, 2013 at 09:25:33AM -0800, Linus Torvalds wrote:

> On Sat, Nov 30, 2013 at 1:43 AM, Simon Kirby <sim@hostway.ca> wrote:
>
> > I turned on kmalloc-192 tracing to find what else is using it: struct
> > nfs_fh, struct bio, and struct cred. Poking around those, struct bio has
> > bi_cnt, but it is way down in the struct. struct cred has "usage", but it
> > comes first. Hmm. Nevertheless, I set:
> >
> > CONFIG_DEBUG_MUTEXES=y
> > CONFIG_DEBUG_LIST=y
> > CONFIG_DEBUG_CREDENTIALS=y
> >
> > And tried:
> >
> > diff --git a/include/linux/cred.h b/include/linux/cred.h
> > index 04421e8..2646fe9 100644
> > --- a/include/linux/cred.h
> > +++ b/include/linux/cred.h
> > @@ -205,7 +205,9 @@ static inline void validate_process_creds(void)
> >   */
> >  static inline struct cred *get_new_cred(struct cred *cred)
> >  {
> > -       atomic_inc(&cred->usage);
> > +       if (atomic_inc_return(&cred->usage) == 0x6c) {
> > +               WARN_ON(cred->uid == 0x6b);
> 
> Oh, damn, I thought you had found it, and got very excited and already
> wrote a long email about things I wanted you to try. And then I
> started looking closer...
> 
> That test is wrong. Both of those fields are 32-bit, so testing them
> against 0x6b/0x6c is bogus: you're just catching real cases. The
> reason it catches omreport is presumably because omreport runs as some
> special user that happens to have uid 107 (on my machine that happens
> to be qemu). And having a usage count of 108 isn't particularly
> strange either - creds get a lot of re-use.
> 
> So close. It *might* still be one of those cases, but it doesn't
> really sound very likely. "bi_cnt" is deep inside the struct bio, and
> "usage" is at offset 0, not offset 4. And the ns_fh isn't very
> interesting.

*head smack* Too much 8-bit AVR coding...

Makes sense now: uid=107(nagios) gid=109(nagios) groups=109(nagios)

Well, the chances of atomic_inc intentionally incrementing 0x6b6b6b6b are
probably pretty rare. I'll try that.

Simon-

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [3.10] Oopses in kmem_cache_allocate() via prepare_creds()
  2013-11-30  9:43                     ` Simon Kirby
  2013-11-30 17:25                       ` Linus Torvalds
@ 2013-11-30 21:08                       ` Linus Torvalds
  1 sibling, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2013-11-30 21:08 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Ian Applegate, Al Viro, Christoph Lameter, Pekka Enberg, LKML,
	Chris Mason

[-- Attachment #1: Type: text/plain, Size: 752 bytes --]

On Sat, Nov 30, 2013 at 1:43 AM, Simon Kirby <sim@hostway.ca> wrote:
>
> I turned on kmalloc-192 tracing to find what else is using it: struct
> nfs_fh, struct bio, and struct cred. Poking around those, struct bio has
> bi_cnt, but it is way down in the struct. struct cred has "usage", but it
> comes first.

You could also try to avoid the whole aliasing by using a separate
slab for just the pipe_inode_info.

Something like the attached patch (which also enables debugging just
for that slab).

I still don't see what could be wrong with the pipe_inode_info thing,
but the fact that it's been so consistent in your traces does make me
suspect it really is *that* particular slab. But I dunno, and this
would just confirm it.

                Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 1901 bytes --]

 fs/pipe.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index d2c45e14e6d8..7cfd6cdf0147 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -729,7 +729,7 @@ pipe_poll(struct file *filp, poll_table *wait)
 static int
 pipe_release(struct inode *inode, struct file *file)
 {
-	struct pipe_inode_info *pipe = inode->i_pipe;
+	struct pipe_inode_info *pipe = file->private_data;
 	int kill = 0;
 
 	__pipe_lock(pipe);
@@ -776,11 +776,13 @@ pipe_fasync(int fd, struct file *filp, int on)
 	return retval;
 }
 
+static struct kmem_cache *pipe_info_slab;
+
 struct pipe_inode_info *alloc_pipe_info(void)
 {
 	struct pipe_inode_info *pipe;
 
-	pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL);
+	pipe = kmem_cache_zalloc(pipe_info_slab, GFP_KERNEL);
 	if (pipe) {
 		pipe->bufs = kzalloc(sizeof(struct pipe_buffer) * PIPE_DEF_BUFFERS, GFP_KERNEL);
 		if (pipe->bufs) {
@@ -790,7 +792,7 @@ struct pipe_inode_info *alloc_pipe_info(void)
 			mutex_init(&pipe->mutex);
 			return pipe;
 		}
-		kfree(pipe);
+		kmem_cache_free(pipe_info_slab, pipe);
 	}
 
 	return NULL;
@@ -808,7 +810,7 @@ void free_pipe_info(struct pipe_inode_info *pipe)
 	if (pipe->tmp_page)
 		__free_page(pipe->tmp_page);
 	kfree(pipe->bufs);
-	kfree(pipe);
+	kmem_cache_free(pipe_info_slab, pipe);
 }
 
 static struct vfsmount *pipe_mnt __read_mostly;
@@ -1311,8 +1313,17 @@ static struct file_system_type pipe_fs_type = {
 
 static int __init init_pipe_fs(void)
 {
-	int err = register_filesystem(&pipe_fs_type);
+	int err;
+
+	pipe_info_slab = kmem_cache_create("pipe-inode-info",
+		sizeof(struct pipe_inode_info),
+		0,
+		SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | SLAB_DEBUG_FREE,
+		NULL);
+	if (!pipe_info_slab)
+		return -ENOMEM;
 
+	err = register_filesystem(&pipe_fs_type);
 	if (!err) {
 		pipe_mnt = kern_mount(&pipe_fs_type);
 		if (IS_ERR(pipe_mnt)) {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2013-11-30 21:08 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-06  0:09 [3.10] Oopses in kmem_cache_allocate() via prepare_creds() Simon Kirby
2013-07-06  8:27 ` Pekka Enberg
2013-08-19 20:17   ` Simon Kirby
2013-08-19 20:29     ` Christoph Lameter
2013-08-19 21:16       ` Linus Torvalds
2013-08-19 21:24         ` Chris Mason
2013-08-19 23:31           ` Simon Kirby
2013-09-03 20:43             ` Simon Kirby
2013-08-20  4:06         ` Al Viro
2013-08-20  7:17           ` Ian Applegate
2013-08-20  7:21             ` Al Viro
2013-08-20  7:51               ` Ian Applegate
2013-11-26  0:44                 ` Simon Kirby
2013-11-26 23:16                   ` Linus Torvalds
2013-11-26 23:44                     ` Linus Torvalds
2013-11-30  9:43                     ` Simon Kirby
2013-11-30 17:25                       ` Linus Torvalds
2013-11-30 21:04                         ` Simon Kirby
2013-11-30 21:08                       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).