Is this a kernel bug?

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Is this a kernel bug?
@ 2012-11-03  8:03 Cyberman Wu
  2012-11-07 16:28 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Cyberman Wu @ 2012-11-03  8:03 UTC (permalink / raw)
  To: linux-kernel

Recent days we got a exception in kernel thread [kworker/n:m], but
exception handler
call do_group_exit() -> do_exit() -> schedule() and got another
exception in schedule():
			/*
			 * If a worker is going to sleep, notify and
			 * ask workqueue whether it wants to wake up a
			 * task to maintain concurrency.  If so, wake
			 * up the task.
			 */
			if (prev->flags & PF_WQ_WORKER) {
				struct task_struct *to_wakeup;

				to_wakeup = wq_worker_sleeping(prev, cpu);
				if (to_wakeup)
					try_to_wake_up_local(to_wakeup);
			}

Exception occurred while wq_worker_sleeping() -> kthread_data().

It's because do_exit() -> exit_mm() -> mm_release():
	/* notify parent sleeping on vfork() */
	if (vfork_done) {
		tsk->vfork_done = NULL;
		complete(vfork_done);
	}

I'm using a patched version of kernel 2.6.38.8. But I've checked code
of kernel version 3.6.5,
it seems have the same process, only with files and functions split.



-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is this a kernel bug?
  2012-11-03  8:03 Is this a kernel bug? Cyberman Wu
@ 2012-11-07 16:28 ` Tejun Heo
  2012-11-09  0:53   ` Cyberman Wu
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2012-11-07 16:28 UTC (permalink / raw)
  To: Cyberman Wu; +Cc: linux-kernel, Andrew Morton

Hello, Cyberman.

On Sat, Nov 03, 2012 at 04:03:21PM +0800, Cyberman Wu wrote:
> Recent days we got a exception in kernel thread [kworker/n:m], but
> exception handler

Can you please post kernel messages for the initial exception?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is this a kernel bug?
  2012-11-07 16:28 ` Tejun Heo
@ 2012-11-09  0:53   ` Cyberman Wu
  2012-11-09  1:11     ` Tejun Heo
  2012-11-09  2:07     ` Andrew Morton
  0 siblings, 2 replies; 6+ messages in thread
From: Cyberman Wu @ 2012-11-09  0:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Andrew Morton

A lot of these message on many CPU:


 Pid: 906, comm:         kworker/16:1, CPU: 16
 r0 : 0xfffffe00f9fbfea0 r1 : 0x0000000000000010 r2 : 0x0000000000000002
 r3 : 0xfffffff5001017e4 r4 : 0xfffffffffffffe00 r5 : 0xfffffffffe0000a4
 r6 : 0xfffffffffffffe00 r7 : 0x0000000000000002 r8 : 0x0000000000000000
 r9 : 0xfffffff5001017e0 r10: 0xfffffff5001017dc r11: 0xfffffff5001017c8
 r12: 0x0000000000000001 r13: 0xfffffe40fc690090 r14: 0x0000000000000000
 r15: 0x0000000000000000 r16: 0xfffffe40fc690088 r17: 0xfffffe00f841be80
 r18: 0xfffffe00f841be80 r19: 0xfffffff500101790 r20: 0x0000000000000001
 r21: 0xfffffe40fe710ce8 r22: 0xfffffffffe0000b5 r23: 0xfffffff5001017d8
 r24: 0xfffffe00008e3c80 r25: 0x000001f4ff820000 r26: 0xfffffe0000a40080
 r27: 0xfffffffffe00008e r28: 0x0000000000000010 r29: 0xfffffe0000a40000
 r30: 0x0000000000000000 r31: 0xfffffe00f9fbfe98 r32: 0xfffffffffffffe00
 r33: 0xfffffff5001017c8 r34: 0xfffffe00008e3c80 r35: 0xfffffe40fc6900a0
 r36: 0xfffffe40fc6900a0 r37: 0xfffffff5001017dc r38: 0xfffffe0000b5ad00
 r39: 0xfffffe0000a40000 r40: 0xfffffe0000b5ad04 r41: 0xfffffe00008e0040
 r42: 0xfffffff5001017c8 r43: 0xfffffe00009aa9a0 r44: 0xfffffe00008e3c80
 r45: 0xfffffe40fc6900b0 r46: 0xfffffff5001017d8 r47: 0xfffffe0000b5ad05
 r48: 0xfffffe00008e3c80 r49: 0xfffffe40fc6900b8 r50: 0xfffffff5001017e4
 r51: 0xfffffff5001017c0 r52: 0xfffffe00008e3c80 tp : 0x000001f4ff820000
 sp : 0xfffffe00f9fbfe78 lr : 0x0000000000000002
 pc : 0xfffffff7002fc488 ex1: 1     faultnum: 17

Starting stack dump of tid 906, pid 906 (kworker/16:1) on cpu 16 at
cycle 416925425702833
  frame 0: 0xfffffff7002fc488 worker_enter_idle+0x1c8/0x2e8 (sp
0xfffffe00f9fbfe78)
  frame 1: 0xfffffff7002750c8 worker_thread+0x4c8/0x898 (sp 0xfffffe00f9fbfea0)
  frame 2: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe00f9fbff80)
  frame 3: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp
0xfffffe00f9fbffe8)
Stack dump complete
Unable to handle kernel paging request
 at virtual address 0x00000000fffffff8, pc 0xfffffff700375f58

 Pid: 906, comm:         kworker/16:1, CPU: 16
 r0 : 0xfffffffffffffff8 r1 : 0x0000000000000000 r2 : 0xfffffe00f841c1b8
 r3 : 0x0000000000003459 r4 : 0x0000000000000001 r5 : 0x0000000000000000
 r6 : 0xfffffe00f9fb0028 r7 : 0x000001f4ff820000 r8 : 0xfffffe00f9fb0000
 r9 : 0x0000000000000000 r10: 0x0000000000000081 r11: 0xfffffe00f841be9c
 r12: 0xfffffff500103c68 r13: 0xfffffe00f9fbf488 r14: 0xfffffe00f9fbf4c8
 r15: 0xfffffe00f9fbf490 r16: 0xfffffe00f9fbf498 r17: 0xfffffe00f9fbf4a0
 r18: 0xfffffe00f841c5b0 r19: 0xfffffe00f9fbf4a8 r20: 0xfffffe00f841c0e8
 r21: 0xffffffff8420806c r22: 0x0000000000000020 r23: 0xfffffe0000a7b988
 r24: 0xfffffe00f841be94 r25: 0xfffffffffffffe00 r26: 0xfffffffffe0000a7
 r27: 0xfffffe00f9fbf440 r28: 0xfffffe00f9fbf438 r29: 0xfffffe00f9fbf448
 r30: 0x0000000000000010 r31: 0xfffffe00f841be80 r32: 0x00000000001a1174
 r33: 0x00000000001a1173 r34: 0xfffffe00f9fbf610 r35: 0x00000001f9fbf398
 r36: 0xfffffe401d9008c0 r37: 0xfffffe401d9008c0 r38: 0xfffffe401d9008c8
 r39: 0xfffffe0000a9c770 r40: 0xfffffe0000a9c750 r41: 0x0000000000000001
 r42: 0xfffffe401d900990 r43: 0xfffffff7003dd1b0 r44: 0xfffffe00f9fbf350
 r45: 0xfffffe0000b5865b r46: 0x0000000000000002 r47: 0xfffffe0000b58a50
 r48: 0xfffffff7003dfbe8 r49: 0xfffffe00f9fbf400 r50: 0xffffffff6c102009
 r51: 0x6639666266666538 r52: 0xfffffe00f9fbf790 tp : 0x000001f4ff820000
 sp : 0xfffffe00f9fbf430 lr : 0xfffffff700357fe8
 pc : 0xfffffff700375f58 ex1: 1     faultnum: 18

Starting stack dump of tid 906, pid 906 (kworker/16:1) on cpu 16 at
cycle 416925426066163
  frame 0: 0xfffffff700375f58 kthread_data+0x18/0x20 (sp 0xfffffe00f9fbf430)
  frame 1: 0xfffffff700357fe8 wq_worker_sleeping+0x28/0xf8 (sp
0xfffffe00f9fbf430)
  frame 2: 0xfffffff700021ab8 schedule+0xd00/0x1538 (sp 0xfffffe00f9fbf448)
  frame 3: 0xfffffff70041f950 do_exit+0x510/0x658 (sp 0xfffffe00f9fbf790)
  frame 4: 0xfffffff7000ade50 do_group_exit+0xc0/0x220 (sp 0xfffffe00f9fbf840)
  frame 5: 0xfffffff7001137a0 jit_bundle_gen+0xf20/0x27d8 (sp
0xfffffe00f9fbf878)
  frame 6: 0xfffffff70034e830 do_unaligned+0xe0/0x5b0 (sp 0xfffffe00f9fbfac8)
  frame 7: 0xfffffff700139af8 handle_interrupt+0x270/0x278 (sp
0xfffffe00f9fbfc00)
  <interrupt 17 while in kernel mode>
  frame 8: 0xfffffff7002fc488 worker_enter_idle+0x1c8/0x2e8 (sp
0xfffffe00f9fbfe78)
  frame 9: 0xfffffff7002750c8 worker_thread+0x4c8/0x898 (sp 0xfffffe00f9fbfea0)
  frame 10: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe00f9fbff80)
  frame 11: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp
0xfffffe00f9fbffe8)
Stack dump complete
Fixing recursive fault but reboot is needed!

The first exception is platform specific and should be a hardware error:
fffffff7002fc480:       180906cfc0128d82        { addi r2, sp, 40 ;
addi r31, sp, 32 }
fffffff7002fc488:       87b886ca04218d95        { addi r21, sp, 24 ;
addi r20, sp, 16 ; ld lr, r2 }
While 'ld lr, r2' executed, r2 should be sp+40, but it value is 2.
I've analysis the execute
snap shot and:
1. r2 should be 2 before 'addi r2, sp, 40' executed.
2. r0's value is sp+40 when exception ocurred, but it shouldn't be
that value following
    executing flow in that function.
So it seems while 'addi r2, sp 40' be executed, what it really
executed is 'addi r0, sp, 40',
maybe the instruction was load with a bit reverted for memory error,
or cache error or
problem of CPU? I'm not sure since it never occurred again.

What I thought maybe a kernel bug is that second exception. I've
simulated it try to
generate a exception in kworker, and it occurred again. Then I checked
the code and
it's the execute flow I've described in the first mail cause that
problem. Then I checked
the newest kernel and it seems should have the same issue.
I only tested it on Gx platform from Tilera, but that second exception
should occur on
any platform if kworker got exception and can't be recovered.



On Thu, Nov 8, 2012 at 12:28 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Cyberman.
>
> On Sat, Nov 03, 2012 at 04:03:21PM +0800, Cyberman Wu wrote:
>> Recent days we got a exception in kernel thread [kworker/n:m], but
>> exception handler
>
> Can you please post kernel messages for the initial exception?
>
> Thanks.
>
> --
> tejun



-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is this a kernel bug?
  2012-11-09  0:53   ` Cyberman Wu
@ 2012-11-09  1:11     ` Tejun Heo
  2012-11-12  2:42       ` Cyberman Wu
  2012-11-09  2:07     ` Andrew Morton
  1 sibling, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2012-11-09  1:11 UTC (permalink / raw)
  To: Cyberman Wu; +Cc: linux-kernel, Andrew Morton

Hello,

On Fri, Nov 09, 2012 at 08:53:49AM +0800, Cyberman Wu wrote:
> A lot of these message on many CPU:

What I'm really curious about is the *first* exception.

Is the following the first one?  Some lines (why the stackdump is
happening) are missing at the top.

>  Pid: 906, comm:         kworker/16:1, CPU: 16
...
>  pc : 0xfffffff7002fc488 ex1: 1     faultnum: 17
> 
> Starting stack dump of tid 906, pid 906 (kworker/16:1) on cpu 16 at
> cycle 416925425702833
>   frame 0: 0xfffffff7002fc488 worker_enter_idle+0x1c8/0x2e8 (sp
> 0xfffffe00f9fbfe78)
>   frame 1: 0xfffffff7002750c8 worker_thread+0x4c8/0x898 (sp 0xfffffe00f9fbfea0)
>   frame 2: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe00f9fbff80)
>   frame 3: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp

Is it triggering one of BUG_ON() in worker_enter_idle()?  Can you map
the pc to the source line number using addr2line?

> The first exception is platform specific and should be a hardware error:
> fffffff7002fc480:       180906cfc0128d82        { addi r2, sp, 40 ;
> addi r31, sp, 32 }
> fffffff7002fc488:       87b886ca04218d95        { addi r21, sp, 24 ;
> addi r20, sp, 16 ; ld lr, r2 }
> While 'ld lr, r2' executed, r2 should be sp+40, but it value is 2.
> I've analysis the execute
> snap shot and:
> 1. r2 should be 2 before 'addi r2, sp, 40' executed.
> 2. r0's value is sp+40 when exception ocurred, but it shouldn't be
> that value following
>     executing flow in that function.
> So it seems while 'addi r2, sp 40' be executed, what it really
> executed is 'addi r0, sp, 40',
> maybe the instruction was load with a bit reverted for memory error,
> or cache error or
> problem of CPU? I'm not sure since it never occurred again.

So, the first exception wasn't a software bug?

> What I thought maybe a kernel bug is that second exception. I've
> simulated it try to
> generate a exception in kworker, and it occurred again. Then I checked
> the code and

After a fatal exception in kernel space, nothing is guaranteed to
work.  It's usually in the realm of "if it limps along, great;
otherwise, too bad", so it isn't really a bug.  There are only so many
things you can do after a program segfaults after all.  That said, it
might be a good idea to clear PF_WQ_WORKER from do_exit() so that at
least we can avoid oops from irq context after a work item messes up.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is this a kernel bug?
  2012-11-09  0:53   ` Cyberman Wu
  2012-11-09  1:11     ` Tejun Heo
@ 2012-11-09  2:07     ` Andrew Morton
  1 sibling, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2012-11-09  2:07 UTC (permalink / raw)
  To: Cyberman Wu; +Cc: Tejun Heo, linux-kernel

On Fri, 9 Nov 2012 08:53:49 +0800 Cyberman Wu <cypher.w@gmail.com> wrote:

> A lot of these message on many CPU:
> 
> ...
>
> Starting stack dump of tid 906, pid 906 (kworker/16:1) on cpu 16 at
> cycle 416925426066163
>   frame 0: 0xfffffff700375f58 kthread_data+0x18/0x20 (sp 0xfffffe00f9fbf430)
>   frame 1: 0xfffffff700357fe8 wq_worker_sleeping+0x28/0xf8 (sp
> 0xfffffe00f9fbf430)
>   frame 2: 0xfffffff700021ab8 schedule+0xd00/0x1538 (sp 0xfffffe00f9fbf448)
>   frame 3: 0xfffffff70041f950 do_exit+0x510/0x658 (sp 0xfffffe00f9fbf790)
>   frame 4: 0xfffffff7000ade50 do_group_exit+0xc0/0x220 (sp 0xfffffe00f9fbf840)
>   frame 5: 0xfffffff7001137a0 jit_bundle_gen+0xf20/0x27d8 (sp
> 0xfffffe00f9fbf878)

I don't recognize jit_bundle_gen.  Has this kernel been modified?

>   frame 6: 0xfffffff70034e830 do_unaligned+0xe0/0x5b0 (sp 0xfffffe00f9fbfac8)
>   frame 7: 0xfffffff700139af8 handle_interrupt+0x270/0x278 (sp
> 0xfffffe00f9fbfc00)
>   <interrupt 17 while in kernel mode>
>   frame 8: 0xfffffff7002fc488 worker_enter_idle+0x1c8/0x2e8 (sp
> 0xfffffe00f9fbfe78)
>   frame 9: 0xfffffff7002750c8 worker_thread+0x4c8/0x898 (sp 0xfffffe00f9fbfea0)
>   frame 10: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe00f9fbff80)
>   frame 11: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Is this a kernel bug?
  2012-11-09  1:11     ` Tejun Heo
@ 2012-11-12  2:42       ` Cyberman Wu
  0 siblings, 0 replies; 6+ messages in thread
From: Cyberman Wu @ 2012-11-12  2:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, Andrew Morton

On Fri, Nov 9, 2012 at 9:11 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Fri, Nov 09, 2012 at 08:53:49AM +0800, Cyberman Wu wrote:
>> A lot of these message on many CPU:
>
> What I'm really curious about is the *first* exception.
>
> Is the following the first one?  Some lines (why the stackdump is
> happening) are missing at the top.

It's really the first one. The taskdump is happening because on Gx it
don't solve
unaligned access on hardware, but software, and it unaligned access occurred
in kernel space and it not occurred while get_user() or put_user(),
the exception
handler will dump these things and try to kill that process group causing that
exception.
The second exception occurred while that first exception handler trying to kill
kworker kernel thread.
>
>>  Pid: 906, comm:         kworker/16:1, CPU: 16
> ...
>>  pc : 0xfffffff7002fc488 ex1: 1     faultnum: 17
>>
>> Starting stack dump of tid 906, pid 906 (kworker/16:1) on cpu 16 at
>> cycle 416925425702833
>>   frame 0: 0xfffffff7002fc488 worker_enter_idle+0x1c8/0x2e8 (sp
>> 0xfffffe00f9fbfe78)
>>   frame 1: 0xfffffff7002750c8 worker_thread+0x4c8/0x898 (sp 0xfffffe00f9fbfea0)
>>   frame 2: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe00f9fbff80)
>>   frame 3: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp
>
> Is it triggering one of BUG_ON() in worker_enter_idle()?  Can you map
> the pc to the source line number using addr2line?

Instead of using addr2line, I disassembled the whole function and analyzed it,
exception occurred while function try to load return address from
address pointer by
r2 into LR.
>
>> The first exception is platform specific and should be a hardware error:
>> fffffff7002fc480:       180906cfc0128d82        { addi r2, sp, 40 ;
>> addi r31, sp, 32 }
>> fffffff7002fc488:       87b886ca04218d95        { addi r21, sp, 24 ;
>> addi r20, sp, 16 ; ld lr, r2 }
>> While 'ld lr, r2' executed, r2 should be sp+40, but it value is 2.
>> I've analysis the execute
>> snap shot and:
>> 1. r2 should be 2 before 'addi r2, sp, 40' executed.
>> 2. r0's value is sp+40 when exception ocurred, but it shouldn't be
>> that value following
>>     executing flow in that function.
>> So it seems while 'addi r2, sp 40' be executed, what it really
>> executed is 'addi r0, sp, 40',
>> maybe the instruction was load with a bit reverted for memory error,
>> or cache error or
>> problem of CPU? I'm not sure since it never occurred again.
>
> So, the first exception wasn't a software bug?

I don't think it a software bug, since the exception flow and shouldn't
generate that register snapshot.
>
>> What I thought maybe a kernel bug is that second exception. I've
>> simulated it try to
>> generate a exception in kworker, and it occurred again. Then I checked
>> the code and
>
> After a fatal exception in kernel space, nothing is guaranteed to
> work.  It's usually in the realm of "if it limps along, great;
> otherwise, too bad", so it isn't really a bug.  There are only so many
> things you can do after a program segfaults after all.  That said, it
> might be a good idea to clear PF_WQ_WORKER from do_exit() so that at
> least we can avoid oops from irq context after a work item messes up.
>
> Thanks.
>
> --
> tejun



-- 
Cyberman Wu

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-11-12  2:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-03  8:03 Is this a kernel bug? Cyberman Wu
2012-11-07 16:28 ` Tejun Heo
2012-11-09  0:53   ` Cyberman Wu
2012-11-09  1:11     ` Tejun Heo
2012-11-12  2:42       ` Cyberman Wu
2012-11-09  2:07     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox