[BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
@ 2015-10-15 12:16 Gonglei (Arei)
  2015-10-15 13:48 ` Christoph Lameter
  2015-10-15 14:39 ` Christoph Lameter
  0 siblings, 2 replies; 7+ messages in thread
From: Gonglei (Arei) @ 2015-10-15 12:16 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org
  Cc: akpm@linux-foundation.org, vdavydov@parallels.com, cl@linux.com,
	rientjes@google.com, Lizefan, lqymgt@gmail.com, paulus@samba.org,
	tglx@linutronix.de, mingo@kernel.org, bp@suse.de,
	boris.ostrovsky@oracle.com

Hi,

When I start a SLES11-sp1-x86_64 virtual machine (The kernel version: 2.6.32.12-0.7), which
configured 16 cpus, and the guest is stuck during the bootup processing. 

I get some information from the dumping core file:

crash> log
... ...
[    0.038989] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0
[    0.138151] CPU0: Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz stepping 05
[    0.140001] APIC calibration not consistent with PM-Timer: 181ms instead of 100ms
[    0.140001] APIC delta adjusted to PM-Timer: 625000 (1135380)
[    0.140001] Booting Node   0, Processors  #1
[    0.016000] mce: CPU supports 0 MCE banks
[    0.584104]  #2
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    0.016000] mce: CPU supports 0 MCE banks
[    1.048123]  #3
[    0.016000] mce: CPU supports 0 MCE banks
[    1.400115]  #4
[    0.016000] mce: CPU supports 0 MCE banks
[    1.683052]  #5
[    0.016000] mce: CPU supports 0 MCE banks
[    2.080109]  #6
[    0.016000] mce: CPU supports 0 MCE banks
[    2.408427]  #7
[    0.016000] mce: CPU supports 0 MCE banks
[    2.812048]  Ok.
[    2.812056] Booting Node   1, Processors  #8
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    0.016000] mce: CPU supports 0 MCE banks
[    3.332131]  #9
[    0.016000] mce: CPU supports 0 MCE banks
[    3.464121]  #10
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    0.016000] mce: CPU supports 0 MCE banks
[    3.816124]  #11
[    0.016000] mce: CPU supports 0 MCE banks
[    4.092492]  #12
[    0.016000] mce: CPU supports 0 MCE banks
[    4.240126]  #13
[    4.244103] CPU13: Stuck ??
[    4.304127]  #14
[    4.313094] CPU14: Stuck ??
[    4.336091]  #15
[    4.345380] CPU15: Stuck ??
[    4.345499] Brought up 13 CPUs
[    4.345499] Total of 13 processors activated (45884.68 BogoMIPS).
[    0.016000] mce: CPU supports 0 MCE banks
[    0.016000] mce: CPU supports 0 MCE banks
[    0.016000] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
[    0.016000] IP: [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000] PGD 0 
[    0.016000] Oops: 0000 [#1] SMP 
[    0.016000] last sysfs file: 
[    0.016000] CPU 14 
[    0.016000] Modules linked in:
[    0.016000] Supported: Yes
[    0.016000] Pid: 0, comm: swapper Not tainted 2.6.32.12-0.7-default #1 HVM domU
[    0.016000] RIP: 0010:[<ffffffff810ed93f>]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000] RSP: 0018:ffff88001f32fe18  EFLAGS: 00010046
[    0.016000] RAX: 0000000000000000 RBX: ffff88001f32fe68 RCX: 000000000000000e
[    0.016000] RDX: 0000000000000001 RSI: 0000000000000020 RDI: ffff88001fc90240
[    0.016000] RBP: 0000000000000020 R08: 0000000000000010 R09: 0000000000001fff
[    0.016000] R10: 00000000ffffffff R11: 00000000000f4240 R12: 0000000000000001
[    0.016000] R13: ffff88001fc90240 R14: 0000000000000002 R15: ffffffff81015580
[    0.016000] FS:  0000000000000000(0000) GS:ffff880020ac0000(0000) knlGS:0000000000000000
[    0.016000] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.016000] CR2: 000000000000000c CR3: 0000000001804000 CR4: 00000000000006e0
[    0.016000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.016000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.016000] Process swapper (pid: 0, threadinfo ffff88001f32e000, task ffff88003e07a200)
[    0.016000] Stack:
[    0.016000]  ffff88001f32fe68 ffffffff81927400 000000000000000e ffffffff81927400
[    0.016000] <0> ffff88001f32fed8 ffffffff811dc176 ffff88003e05ce00 ffffffff81019888
[    0.016000] <0> ffffffff81927400 000000000000000e ffffffff81927400 ffff880020acfee0
[    0.016000] Call Trace:
[    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
[    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
[    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
[    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
[    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
[    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
[    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
[    0.016000] Code: 24 18 4c 8b 74 24 20 48 83 c4 28 c3 65 44 8b 24 25 c8 e2 00 00 eb 88 0f 1f 44 00 00 65 8b 04 25 98 cd 00 00 48 
98 49 8b 44 c5 00 <f6> 40 0c 02 75 3b 65 8b 04 25 98 cd 00 00 48 98 49 8b 54 c5 00 
[    0.016000] RIP  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000]  RSP <ffff88001f32fe18>
[    0.016000] CR2: 000000000000000c
[    0.016000] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.016000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.016000] Pid: 0, comm: swapper Tainted: G      D      2.6.32.12-0.7-default #1
[    0.016000] Call Trace:
[    0.016000]  [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[    0.016000]  [<ffffffff81394288>] dump_stack+0x69/0x71
[    0.016000]  [<ffffffff81394308>] panic+0x78/0x199
[    0.016000]  [<ffffffff81050fff>] do_exit+0x26f/0x360
[    0.016000]  [<ffffffff813980a1>] oops_end+0xe1/0xf0
[    0.016000]  [<ffffffff8102d955>] __bad_area_nosemaphore+0x155/0x230
[    0.016000]  [<ffffffff813972ef>] page_fault+0x1f/0x30
[    0.016000]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
[    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
[    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
[    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
[    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
[    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
[    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
crash>  

[    0.038989] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=0 pin2=0
[    0.138151] CPU0: Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz stepping 05
[    0.140001] APIC calibration not consistent with PM-Timer: 181ms instead of 100ms
[    0.140001] APIC delta adjusted to PM-Timer: 625000 (1135380)
[    0.140001] Booting Node   0, Processors  #1
[    0.016000] mce: CPU supports 0 MCE banks
[    0.584104]  #2
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    0.016000] mce: CPU supports 0 MCE banks
[    1.048123]  #3
[    0.016000] mce: CPU supports 0 MCE banks
[    1.400115]  #4
[    0.016000] mce: CPU supports 0 MCE banks
[    1.683052]  #5
[    0.016000] mce: CPU supports 0 MCE banks
[    2.080109]  #6
[    0.016000] mce: CPU supports 0 MCE banks
[    2.408427]  #7
[    0.016000] mce: CPU supports 0 MCE banks
[    2.812048]  Ok.
[    2.812056] Booting Node   1, Processors  #8
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    0.016000] mce: CPU supports 0 MCE banks
[    3.332131]  #9
[    0.016000] mce: CPU supports 0 MCE banks
[    3.464121]  #10
[    0.016000] calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy.
[    0.016000] Probably due to long platform interrupts. Consider using "lpj=" boot option.
[    0.016000] mce: CPU supports 0 MCE banks
[    3.816124]  #11
[    0.016000] mce: CPU supports 0 MCE banks
[    4.092492]  #12
[    0.016000] mce: CPU supports 0 MCE banks
[    4.240126]  #13
[    4.244103] CPU13: Stuck ??
[    4.304127]  #14
[    4.313094] CPU14: Stuck ??		<-----------------
[    4.336091]  #15
[    4.345380] CPU15: Stuck ??
[    4.345499] Brought up 13 CPUs
[    4.345499] Total of 13 processors activated (45884.68 BogoMIPS).
[    0.016000] mce: CPU supports 0 MCE banks
[    0.016000] mce: CPU supports 0 MCE banks
[    0.016000] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
[    0.016000] IP: [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000] PGD 0 
[    0.016000] Oops: 0000 [#1] SMP 
[    0.016000] last sysfs file: 
[    0.016000] CPU 14              <----------------- That's very strange! Because CPU14 is stuck!
[    0.016000] Modules linked in:
[    0.016000] Supported: Yes
[    0.016000] Pid: 0, comm: swapper Not tainted 2.6.32.12-0.7-default #1 HVM domU
[    0.016000] RIP: 0010:[<ffffffff810ed93f>]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000] RSP: 0018:ffff88001f32fe18  EFLAGS: 00010046
[    0.016000] RAX: 0000000000000000 RBX: ffff88001f32fe68 RCX: 000000000000000e
[    0.016000] RDX: 0000000000000001 RSI: 0000000000000020 RDI: ffff88001fc90240
[    0.016000] RBP: 0000000000000020 R08: 0000000000000010 R09: 0000000000001fff
[    0.016000] R10: 00000000ffffffff R11: 00000000000f4240 R12: 0000000000000001
[    0.016000] R13: ffff88001fc90240 R14: 0000000000000002 R15: ffffffff81015580
[    0.016000] FS:  0000000000000000(0000) GS:ffff880020ac0000(0000) knlGS:0000000000000000
[    0.016000] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[    0.016000] CR2: 000000000000000c CR3: 0000000001804000 CR4: 00000000000006e0
[    0.016000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    0.016000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[    0.016000] Process swapper (pid: 0, threadinfo ffff88001f32e000, task ffff88003e07a200)
[    0.016000] Stack:
[    0.016000]  ffff88001f32fe68 ffffffff81927400 000000000000000e ffffffff81927400
[    0.016000] <0> ffff88001f32fed8 ffffffff811dc176 ffff88003e05ce00 ffffffff81019888
[    0.016000] <0> ffffffff81927400 000000000000000e ffffffff81927400 ffff880020acfee0
[    0.016000] Call Trace:
[    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
[    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
[    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
[    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
[    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
[    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
[    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
[    0.016000] Code: 24 18 4c 8b 74 24 20 48 83 c4 28 c3 65 44 8b 24 25 c8 e2 00 00 eb 88 0f 1f 44 00 00 65 8b 04 25 98 cd 00 00 48 
98 49 8b 44 c5 00 <f6> 40 0c 02 75 3b 65 8b 04 25 98 cd 00 00 48 98 49 8b 54 c5 00 
[    0.016000] RIP  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000]  RSP <ffff88001f32fe18>
[    0.016000] CR2: 000000000000000c
[    0.016000] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.016000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.016000] Pid: 0, comm: swapper Tainted: G      D      2.6.32.12-0.7-default #1
[    0.016000] Call Trace:
[    0.016000]  [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[    0.016000]  [<ffffffff81394288>] dump_stack+0x69/0x71
[    0.016000]  [<ffffffff81394308>] panic+0x78/0x199
[    0.016000]  [<ffffffff81050fff>] do_exit+0x26f/0x360
[    0.016000]  [<ffffffff813980a1>] oops_end+0xe1/0xf0
[    0.016000]  [<ffffffff8102d955>] __bad_area_nosemaphore+0x155/0x230
[    0.016000]  [<ffffffff813972ef>] page_fault+0x1f/0x30
[    0.016000]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
[    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
[    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
[    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
[    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
[    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
[    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
[    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
crash>  
crash> p cache_cache
cache_cache = $7 = {
  array = {0xffff88001fe99940, 0xffff88001f102cc0, 0xffff88001f1321c0, 0xffff88001f1621c0, 0xffff88001f1911c0, 0xffff88001f1c01c0, 0
xffff88001f1f01c0, 0xffff88001f2211c0, 0xffff88003ebaf0c0, 0xffff88003eba07c0, 0xffff88003ebaf440, 0xffff88003e00d1c0, 0xffff88003e0
341c0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x
0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x
0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0...}, 
  batchcount = 4, 
  limit = 8, 
  shared = 0, 
  buffer_size = 32896, 
  reciprocal_buffer_size = 130562,
[skip]
crash> struct kmem_cache ffff88001fc90240 -x
struct kmem_cache {
  array = {0xffff88001fe94200, 0xffff88001f100400, 0xffff88001f13a600, 0xffff88001f16a600, 0xffff88001f199600, 0xffff88001f1c8600, 0
xffff88001f1f8600, 0xffff88001f229600, 0xffff88003eb9e800, 0xffff88003ebc7a00, 0xffff88003ebed600, 0xffff88003e015600, 0xffff88003e0
3d600, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x
0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0
, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x
0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0
x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0...}, 
  batchcount = 0x1b, 
  limit = 0x36, 
  shared = 0x8, 
  buffer_size = 0x200, 
  reciprocal_buffer_size = 0x800000, 
  flags = 0x80042000,
[skip]

We can get the corresponding code by dis command:

static inline struct array_cache *cpu_cache_get(struct kmem_cache *cachep)
{
        return cachep->array[smp_processor_id()];
     724:       65 8b 04 25 00 00 00    mov    %gs:0x0,%eax
     72b:       00 
     72c:       48 98                   cltq
     72e:       41 89 f5                mov    %esi,%r13d
 * If the last page came from the reserves, and the current allocation context
 * does not have access to them, force an allocation to test the watermarks.
 */
static inline int slab_force_alloc(struct kmem_cache *cachep, gfp_t flags)
{
        if (unlikely(cpu_cache_get(cachep)->reserve) &&
     731:       48 8b 04 c7             mov    (%rdi,%rax,8),%rax
     735:       f6 40 0c 02             testb  $0x2,0xc(%rax)
     739:       0f 85 09 01 00 00       jne    848 <____cache_alloc_node+0x148>
        if (unlikely(slab_force_alloc(cachep, flags)))
                goto force_grow;


smp_processor_id() return 14, the CPU14, but the CPU14 is *stuck*, so cachep->array[14] is NULL,
why did this situation happen? And cause NULL pointer accessing? Is this a kernel bug?

Any helps will be appreciated! If you need some more information, pls let me know. :)

Regards,
-Gonglei


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
  2015-10-15 12:16 [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id() Gonglei (Arei)
@ 2015-10-15 13:48 ` Christoph Lameter
  2015-10-15 14:26   ` Gonglei
  2015-10-15 14:39 ` Christoph Lameter
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2015-10-15 13:48 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	vdavydov@parallels.com, rientjes@google.com, Lizefan,
	lqymgt@gmail.com, paulus@samba.org, tglx@linutronix.de,
	mingo@kernel.org, bp@suse.de, boris.ostrovsky@oracle.com

On Thu, 15 Oct 2015, Gonglei (Arei) wrote:

> I get some information from the dumping core file:

Please run with the parameter slub_debug on the kernel command line (grub)
to debug this. Likely an overwrite of an object after free.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
  2015-10-15 13:48 ` Christoph Lameter
@ 2015-10-15 14:26   ` Gonglei
  0 siblings, 0 replies; 7+ messages in thread
From: Gonglei @ 2015-10-15 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	vdavydov@parallels.com, rientjes@google.com, Lizefan,
	lqymgt@gmail.com, paulus@samba.org, tglx@linutronix.de,
	mingo@kernel.org, bp@suse.de, boris.ostrovsky@oracle.com

On 2015/10/15 21:48, Christoph Lameter wrote:
> On Thu, 15 Oct 2015, Gonglei (Arei) wrote:
> 
>> I get some information from the dumping core file:
> 
> Please run with the parameter slub_debug on the kernel command line (grub)
> to debug this. Likely an overwrite of an object after free.
> 
Unfortunately, I can't reproduce this problem again :(

I just have the dump core file.

Regards,
-Gonglei


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
  2015-10-15 12:16 [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id() Gonglei (Arei)
  2015-10-15 13:48 ` Christoph Lameter
@ 2015-10-15 14:39 ` Christoph Lameter
  2015-10-16  3:49   ` Gonglei
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Lameter @ 2015-10-15 14:39 UTC (permalink / raw)
  To: Gonglei (Arei)
  Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	vdavydov@parallels.com, rientjes@google.com, Lizefan,
	lqymgt@gmail.com, paulus@samba.org, tglx@linutronix.de,
	mingo@kernel.org, bp@suse.de, boris.ostrovsky@oracle.com

On Thu, 15 Oct 2015, Gonglei (Arei) wrote:

> [    0.016000] Call Trace:
> [    0.016000]  [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
> [    0.016000]  [<ffffffff81394288>] dump_stack+0x69/0x71
> [    0.016000]  [<ffffffff81394308>] panic+0x78/0x199
> [    0.016000]  [<ffffffff81050fff>] do_exit+0x26f/0x360
> [    0.016000]  [<ffffffff813980a1>] oops_end+0xe1/0xf0
> [    0.016000]  [<ffffffff8102d955>] __bad_area_nosemaphore+0x155/0x230
> [    0.016000]  [<ffffffff813972ef>] page_fault+0x1f/0x30
> [    0.016000]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
> [    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
> [    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
> [    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
> [    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
> [    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
> [    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
> [    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5

This happened during IPI processing?

> crash> p cache_cache

Arg. This is the SLAB allocator. You cannot enable debugging without
rebuilding the kernel with CONFIG_SLAB_DEBUG.

> smp_processor_id() return 14, the CPU14, but the CPU14 is *stuck*, so cache=
> p->array[14] is NULL,
> why did this situation happen? And cause NULL pointer accessing? Is this a =
> kernel bug?

Its likely a bug in some obscure code in a driver that corrupted memory or
messed up the way memory was handled. set_mtrr()? What was going on at the
time? A special graphics driver being loaded? That could cause issues.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
  2015-10-15 14:39 ` Christoph Lameter
@ 2015-10-16  3:49   ` Gonglei
  2015-10-16  8:08     ` Igor Mammedov
  0 siblings, 1 reply; 7+ messages in thread
From: Gonglei @ 2015-10-16  3:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org,
	vdavydov@parallels.com, rientjes@google.com, Lizefan,
	lqymgt@gmail.com, paulus@samba.org, tglx@linutronix.de,
	mingo@kernel.org, bp@suse.de, boris.ostrovsky@oracle.com,
	imammedo@redhat.com

On 2015/10/15 22:39, Christoph Lameter wrote:
> On Thu, 15 Oct 2015, Gonglei (Arei) wrote:
> 
>> [    0.016000] Call Trace:
>> [    0.016000]  [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
>> [    0.016000]  [<ffffffff81394288>] dump_stack+0x69/0x71
>> [    0.016000]  [<ffffffff81394308>] panic+0x78/0x199
>> [    0.016000]  [<ffffffff81050fff>] do_exit+0x26f/0x360
>> [    0.016000]  [<ffffffff813980a1>] oops_end+0xe1/0xf0
>> [    0.016000]  [<ffffffff8102d955>] __bad_area_nosemaphore+0x155/0x230
>> [    0.016000]  [<ffffffff813972ef>] page_fault+0x1f/0x30
>> [    0.016000]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
>> [    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
>> [    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
>> [    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
>> [    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
>> [    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
>> [    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
>> [    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
> 
> This happened during IPI processing?
> 
>> crash> p cache_cache
> 
> Arg. This is the SLAB allocator. You cannot enable debugging without
> rebuilding the kernel with CONFIG_SLAB_DEBUG.
> 
>> smp_processor_id() return 14, the CPU14, but the CPU14 is *stuck*, so cache=
>> p->array[14] is NULL,
>> why did this situation happen? And cause NULL pointer accessing? Is this a =
>> kernel bug?
> 
> Its likely a bug in some obscure code in a driver that corrupted memory or
> messed up the way memory was handled. set_mtrr()? What was going on at the
> time? A special graphics driver being loaded? That could cause issues.
> 

It seems that the problem was fixed by Igor, right?
	https://lkml.org/lkml/2014/3/6/257

Cced Igor Mammedov.

Regards,
-Gonglei


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
  2015-10-16  3:49   ` Gonglei
@ 2015-10-16  8:08     ` Igor Mammedov
  2015-10-16  8:56       ` Gonglei
  0 siblings, 1 reply; 7+ messages in thread
From: Igor Mammedov @ 2015-10-16  8:08 UTC (permalink / raw)
  To: Gonglei
  Cc: Christoph Lameter, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, vdavydov@parallels.com,
	rientjes@google.com, Lizefan, lqymgt@gmail.com, paulus@samba.org,
	tglx@linutronix.de, mingo@kernel.org, bp@suse.de,
	boris.ostrovsky@oracle.com

On Fri, 16 Oct 2015 11:49:36 +0800
Gonglei <arei.gonglei@huawei.com> wrote:

> On 2015/10/15 22:39, Christoph Lameter wrote:
> > On Thu, 15 Oct 2015, Gonglei (Arei) wrote:
> > 
> >> [    0.016000] Call Trace:
> >> [    0.016000]  [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
> >> [    0.016000]  [<ffffffff81394288>] dump_stack+0x69/0x71
> >> [    0.016000]  [<ffffffff81394308>] panic+0x78/0x199
> >> [    0.016000]  [<ffffffff81050fff>] do_exit+0x26f/0x360
> >> [    0.016000]  [<ffffffff813980a1>] oops_end+0xe1/0xf0
> >> [    0.016000]  [<ffffffff8102d955>] __bad_area_nosemaphore+0x155/0x230
> >> [    0.016000]  [<ffffffff813972ef>] page_fault+0x1f/0x30
> >> [    0.016000]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
> >> [    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
> >> [    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
> >> [    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
> >> [    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
> >> [    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
> >> [    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
> >> [    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
> > 
> > This happened during IPI processing?
> > 
> >> crash> p cache_cache
> > 
> > Arg. This is the SLAB allocator. You cannot enable debugging without
> > rebuilding the kernel with CONFIG_SLAB_DEBUG.
> > 
> >> smp_processor_id() return 14, the CPU14, but the CPU14 is *stuck*, so cache=
> >> p->array[14] is NULL,
> >> why did this situation happen? And cause NULL pointer accessing? Is this a =
> >> kernel bug?
> > 
> > Its likely a bug in some obscure code in a driver that corrupted memory or
> > messed up the way memory was handled. set_mtrr()? What was going on at the
> > time? A special graphics driver being loaded? That could cause issues.
> > 
> 
> It seems that the problem was fixed by Igor, right?
> 	https://lkml.org/lkml/2014/3/6/257
That might help.
"stuck" CPU14 means that master CPU has given up on the attempt
to online AP and tried to clean it up from different maps
*but* AP is still running and that may lead to an unexpected
behavior.

> 
> Cced Igor Mammedov.
> 
> Regards,
> -Gonglei
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id()
  2015-10-16  8:08     ` Igor Mammedov
@ 2015-10-16  8:56       ` Gonglei
  0 siblings, 0 replies; 7+ messages in thread
From: Gonglei @ 2015-10-16  8:56 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Christoph Lameter, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org, vdavydov@parallels.com,
	rientjes@google.com, Lizefan, lqymgt@gmail.com, paulus@samba.org,
	tglx@linutronix.de, mingo@kernel.org, bp@suse.de,
	boris.ostrovsky@oracle.com

On 2015/10/16 16:08, Igor Mammedov wrote:
> On Fri, 16 Oct 2015 11:49:36 +0800
> Gonglei <arei.gonglei@huawei.com> wrote:
> 
>> On 2015/10/15 22:39, Christoph Lameter wrote:
>>> On Thu, 15 Oct 2015, Gonglei (Arei) wrote:
>>>
>>>> [    0.016000] Call Trace:
>>>> [    0.016000]  [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
>>>> [    0.016000]  [<ffffffff81394288>] dump_stack+0x69/0x71
>>>> [    0.016000]  [<ffffffff81394308>] panic+0x78/0x199
>>>> [    0.016000]  [<ffffffff81050fff>] do_exit+0x26f/0x360
>>>> [    0.016000]  [<ffffffff813980a1>] oops_end+0xe1/0xf0
>>>> [    0.016000]  [<ffffffff8102d955>] __bad_area_nosemaphore+0x155/0x230
>>>> [    0.016000]  [<ffffffff813972ef>] page_fault+0x1f/0x30
>>>> [    0.016000]  [<ffffffff810ed93f>] kmem_cache_alloc_node+0xbf/0x140
>>>> [    0.016000]  [<ffffffff811dc176>] alloc_cpumask_var_node+0x16/0x70
>>>> [    0.016000]  [<ffffffff81019888>] native_send_call_func_ipi+0x18/0xf0
>>>> [    0.016000]  [<ffffffff810783ee>] smp_call_function_many+0x1ae/0x250
>>>> [    0.016000]  [<ffffffff810784b0>] smp_call_function+0x20/0x30
>>>> [    0.016000]  [<ffffffff8101580a>] set_mtrr+0x5a/0x140
>>>> [    0.016000]  [<ffffffff8138ef17>] smp_callin+0xf0/0x1b4
>>>> [    0.016000]  [<ffffffff8138efe9>] start_secondary+0xe/0xb5
>>>
>>> This happened during IPI processing?
>>>
>>>> crash> p cache_cache
>>>
>>> Arg. This is the SLAB allocator. You cannot enable debugging without
>>> rebuilding the kernel with CONFIG_SLAB_DEBUG.
>>>
>>>> smp_processor_id() return 14, the CPU14, but the CPU14 is *stuck*, so cache=
>>>> p->array[14] is NULL,
>>>> why did this situation happen? And cause NULL pointer accessing? Is this a =
>>>> kernel bug?
>>>
>>> Its likely a bug in some obscure code in a driver that corrupted memory or
>>> messed up the way memory was handled. set_mtrr()? What was going on at the
>>> time? A special graphics driver being loaded? That could cause issues.
>>>
>>
>> It seems that the problem was fixed by Igor, right?
>> 	https://lkml.org/lkml/2014/3/6/257
> That might help.
> "stuck" CPU14 means that master CPU has given up on the attempt
> to online AP and tried to clean it up from different maps
> *but* AP is still running and that may lead to an unexpected
> behavior.
> 
IIUC, this might be a sequence problem between BP processing and AP processing?

Regards,
-Gonglei

>>
>> Cced Igor Mammedov.
>>
>> Regards,
>> -Gonglei
>>
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-10-16  8:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-15 12:16 [BUG?] kernel OOPS at kmem_cache_alloc_node() because of smp_processor_id() Gonglei (Arei)
2015-10-15 13:48 ` Christoph Lameter
2015-10-15 14:26   ` Gonglei
2015-10-15 14:39 ` Christoph Lameter
2015-10-16  3:49   ` Gonglei
2015-10-16  8:08     ` Igor Mammedov
2015-10-16  8:56       ` Gonglei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox