kernel crash when using libnuma

linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* kernel crash when using libnuma
@ 2012-02-06  8:57 Trevor Kramer
  2012-02-06 22:23 ` Andi Kleen
  0 siblings, 1 reply; 3+ messages in thread
From: Trevor Kramer @ 2012-02-06  8:57 UTC (permalink / raw)
  To: linux-numa

I have a program which can use libnuma to allocate memory using
numa_alloc_onnode() or using malloc. When running in malloc mode
everything works fine but when running under libnuma mode I get
consistent kernel panics with the following traces. This only occurs
when multiple threads are running. Has anyone seen this before or have
any recommendations on how to debug further?

crash> bt
PID: 62333  TASK: ffff883ff5698b40  CPU: 17  COMMAND: "test"
 #0 [ffff883ff58378f0] machine_kexec at ffffffff810310cb
 #1 [ffff883ff5837950] crash_kexec at ffffffff810b6392
 #2 [ffff883ff5837a20] oops_end at ffffffff814de670
 #3 [ffff883ff5837a50] die at ffffffff8100f2eb
 #4 [ffff883ff5837a80] do_trap at ffffffff814ddf64
 #5 [ffff883ff5837ae0] do_invalid_op at ffffffff8100ceb5
 #6 [ffff883ff5837b80] invalid_op at ffffffff8100bf5b
    [exception RIP: split_huge_page+2021]
    RIP: ffffffff8116c605  RSP: ffff883ff5837c38  RFLAGS: 00010297
    RAX: 0000000000000001  RBX: ffff880ff704bc38  RCX: 000000000000fe9e
    RDX: 0000000000000000  RSI: 0000000000000046  RDI: 0000000000000246
    RBP: ffff883ff5837d08   R8: 0000000000000000   R9: 0000000000000004
    R10: 0000000000000001  R11: ffff880ff6fb7906  R12: ffff880ff84b7aa8
    R13: fffffffffffffff2  R14: ffffea006c34c000  R15: ffffea006c34c000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff883ff5837c30] split_huge_page at ffffffff8116c5aa
 #8 [ffff883ff5837d10] __split_huge_page_pmd at ffffffff8116c6d1
 #9 [ffff883ff5837d40] unmap_vmas at ffffffff8113559e
#10 [ffff883ff5837e80] unmap_region at ffffffff8113cce1
#11 [ffff883ff5837ef0] do_munmap at ffffffff8113d3a6
#12 [ffff883ff5837f50] sys_munmap at ffffffff8113d4e6
#13 [ffff883ff5837f80] system_call_fastpath at ffffffff8100b172
    RIP: 00007f12d33154d2  RSP: 00007f12884731f8  RFLAGS: 00010283
    RAX: 000000000000000b  RBX: ffffffff8100b172  RCX: 0000000000000020
    RDX: 0000000000000000  RSI: 00000000003fe560  RDI: 00007f129f460000
    RBP: 00000000003fe560   R8: 00007f1288475300   R9: 00007f1288475300
    R10: 0000003d9c0eb3b0  R11: 0000000000000246  R12: 0000003d9c0f1fc0
    R13: 0000003d9c0f0e00  R14: 00007f129f460000  R15: 00007f129f460000
    ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b

The machine is running RedHat Enterprise Server 6 with
2.6.32-220.4.1.el6.x86_64.

Thanks,

Trevor

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kernel crash when using libnuma
  2012-02-06  8:57 kernel crash when using libnuma Trevor Kramer
@ 2012-02-06 22:23 ` Andi Kleen
  2012-02-06 22:43   ` Andrea Arcangeli
  0 siblings, 1 reply; 3+ messages in thread
From: Andi Kleen @ 2012-02-06 22:23 UTC (permalink / raw)
  To: Trevor Kramer; +Cc: linux-numa, aarcange

On Mon, Feb 06, 2012 at 03:57:52AM -0500, Trevor Kramer wrote:
> I have a program which can use libnuma to allocate memory using
> numa_alloc_onnode() or using malloc. When running in malloc mode
> everything works fine but when running under libnuma mode I get
> consistent kernel panics with the following traces. This only occurs
> when multiple threads are running. Has anyone seen this before or have
> any recommendations on how to debug further?


Looks like a THP problem.

For RHEL issues you normally need to talk to RedHat, these lists
are more for mainline.

-Andi

> 
> crash> bt
> PID: 62333  TASK: ffff883ff5698b40  CPU: 17  COMMAND: "test"
>  #0 [ffff883ff58378f0] machine_kexec at ffffffff810310cb
>  #1 [ffff883ff5837950] crash_kexec at ffffffff810b6392
>  #2 [ffff883ff5837a20] oops_end at ffffffff814de670
>  #3 [ffff883ff5837a50] die at ffffffff8100f2eb
>  #4 [ffff883ff5837a80] do_trap at ffffffff814ddf64
>  #5 [ffff883ff5837ae0] do_invalid_op at ffffffff8100ceb5
>  #6 [ffff883ff5837b80] invalid_op at ffffffff8100bf5b
>     [exception RIP: split_huge_page+2021]
>     RIP: ffffffff8116c605  RSP: ffff883ff5837c38  RFLAGS: 00010297
>     RAX: 0000000000000001  RBX: ffff880ff704bc38  RCX: 000000000000fe9e
>     RDX: 0000000000000000  RSI: 0000000000000046  RDI: 0000000000000246
>     RBP: ffff883ff5837d08   R8: 0000000000000000   R9: 0000000000000004
>     R10: 0000000000000001  R11: ffff880ff6fb7906  R12: ffff880ff84b7aa8
>     R13: fffffffffffffff2  R14: ffffea006c34c000  R15: ffffea006c34c000
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffff883ff5837c30] split_huge_page at ffffffff8116c5aa
>  #8 [ffff883ff5837d10] __split_huge_page_pmd at ffffffff8116c6d1
>  #9 [ffff883ff5837d40] unmap_vmas at ffffffff8113559e
> #10 [ffff883ff5837e80] unmap_region at ffffffff8113cce1
> #11 [ffff883ff5837ef0] do_munmap at ffffffff8113d3a6
> #12 [ffff883ff5837f50] sys_munmap at ffffffff8113d4e6
> #13 [ffff883ff5837f80] system_call_fastpath at ffffffff8100b172
>     RIP: 00007f12d33154d2  RSP: 00007f12884731f8  RFLAGS: 00010283
>     RAX: 000000000000000b  RBX: ffffffff8100b172  RCX: 0000000000000020
>     RDX: 0000000000000000  RSI: 00000000003fe560  RDI: 00007f129f460000
>     RBP: 00000000003fe560   R8: 00007f1288475300   R9: 00007f1288475300
>     R10: 0000003d9c0eb3b0  R11: 0000000000000246  R12: 0000003d9c0f1fc0
>     R13: 0000003d9c0f0e00  R14: 00007f129f460000  R15: 00007f129f460000
>     ORIG_RAX: 000000000000000b  CS: 0033  SS: 002b
> 
> The machine is running RedHat Enterprise Server 6 with
> 2.6.32-220.4.1.el6.x86_64.
> 
> Thanks,
> 
> Trevor
> --
> To unsubscribe from this list: send the line "unsubscribe linux-numa" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kernel crash when using libnuma
  2012-02-06 22:23 ` Andi Kleen
@ 2012-02-06 22:43   ` Andrea Arcangeli
  0 siblings, 0 replies; 3+ messages in thread
From: Andrea Arcangeli @ 2012-02-06 22:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Trevor Kramer, linux-numa

On Mon, Feb 06, 2012 at 11:23:18PM +0100, Andi Kleen wrote:
> On Mon, Feb 06, 2012 at 03:57:52AM -0500, Trevor Kramer wrote:
> > I have a program which can use libnuma to allocate memory using
> > numa_alloc_onnode() or using malloc. When running in malloc mode
> > everything works fine but when running under libnuma mode I get
> > consistent kernel panics with the following traces. This only occurs
> > when multiple threads are running. Has anyone seen this before or have
> > any recommendations on how to debug further?
> 
> 
> Looks like a THP problem.
> 
> For RHEL issues you normally need to talk to RedHat, these lists
> are more for mainline.

Well at this point we don't know yet if this affects mainline too or
not.

To be sure, you can file a bugzilla.redhat.com and we'll fix it ASAP
and submit the fix upstream if it happens there too.

Best of all is if you can send the source of the program that trigger
this attached to the bugzilla (or by email). Or create a small source
testcase that can reproduce it.

A BUG_ON is triggering, probably the rmap mapcount vs page->mapcount
one. I'm unsure why this is related to libnuma only, and a wild guess
could be that the vma policy does something wrong over the vmas to the
point rmap won't find the pmds (like wrong vma splitting or
something). But thanks to the BUG_ON there is close to zero risk of
data or memory corruption, just an annoyance we need to fix (plus the
program I assume requires root to use libnuma).

The last time (more than one year ago) I've seen this it was a bug in
the vma splitting. So again guessing wild, it could be a missing
vma_adjust_trans_huge in mempolicy.c or some other place splitting
vmas to create a partial vma policy on an existing vma.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-02-06 22:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-06  8:57 kernel crash when using libnuma Trevor Kramer
2012-02-06 22:23 ` Andi Kleen
2012-02-06 22:43   ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).