6.12-rc1: general protection fault at pick_task

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 6.12-rc1: general protection fault at pick_task_fair()
@ 2024-09-30 12:52 Breno Leitao
  2024-09-30 14:48 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: Breno Leitao @ 2024-09-30 12:52 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, vincent.guittot, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid
  Cc: linux-kernel

Hello,

I've been testing v6.12-rc1 and I got some crashes that I would like to
share, since I haven't seen anything in the mailing list yet.

This kernel was compiled with some debug options, against 11a299a7933e
("Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux").


   [146800.130180] Oops: general protection fault, probably for non-canonical address 0xdffffc000000000a: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
   [146800.156067] KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
   [146800.200109] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP, [N]=TEST
   [146800.218119] Hardware name: Quanta Delta Lake MP 29F0EMA00E0/Delta Lake-Class1, BIOS F0E_3A19 04/27/2023
   [146800.237177] Workqueue:  0x0 (events)
   [146800.244615] RIP: 0010:pick_task_fair (kernel/sched/fair.c:5626 kernel/sched/fair.c:8856) 
   [146800.253955] Code: 74 08 48 89 df e8 3d 78 01 00 e9 29 01 00 00 0f 1f 44 00 00 48 89 df e8 5b ef 01 00 49 89 c6 48 8d 68 51 48 89 eb 48 c1 eb 03 <42> 0f b6 04 3b 84 c0 0f 85 98 00 00 00 80 7d 00 00 0f 84 3a 03 00
   All code
   ========
      0:	74 08                	je     0xa
      2:	48 89 df             	mov    %rbx,%rdi
      5:	e8 3d 78 01 00       	call   0x17847
      a:	e9 29 01 00 00       	jmp    0x138
      f:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
     14:	48 89 df             	mov    %rbx,%rdi
     17:	e8 5b ef 01 00       	call   0x1ef77
     1c:	49 89 c6             	mov    %rax,%r14
     1f:	48 8d 68 51          	lea    0x51(%rax),%rbp
     23:	48 89 eb             	mov    %rbp,%rbx
     26:	48 c1 eb 03          	shr    $0x3,%rbx
     2a:*	42 0f b6 04 3b       	movzbl (%rbx,%r15,1),%eax		<-- trapping instruction
     2f:	84 c0                	test   %al,%al
     31:	0f 85 98 00 00 00    	jne    0xcf
     37:	80 7d 00 00          	cmpb   $0x0,0x0(%rbp)
     3b:	0f                   	.byte 0xf
     3c:	84 3a                	test   %bh,(%rdx)
     3e:	03 00                	add    (%rax),%eax
   
   Code starting with the faulting instruction
   ===========================================
      0:	42 0f b6 04 3b       	movzbl (%rbx,%r15,1),%eax
      5:	84 c0                	test   %al,%al
      7:	0f 85 98 00 00 00    	jne    0xa5
      d:	80 7d 00 00          	cmpb   $0x0,0x0(%rbp)
     11:	0f                   	.byte 0xf
     12:	84 3a                	test   %bh,(%rdx)
     14:	03 00                	add    (%rax),%eax
   [146800.291790] RSP: 0018:ffff8889ae3dfbc0 EFLAGS: 00010006
   [146800.302506] RAX: 0000000000000000 RBX: 000000000000000a RCX: dffffc0000000000
   [146800.317040] RDX: ffff8889ae3dfd30 RSI: ffff8889af5bca00 RDI: ffff888e38546380
   [146800.331573] RBP: 0000000000000051 R08: ffffffff86ddef37 R09: 1ffffffff0dbbde6
   [146800.346109] R10: dffffc0000000000 R11: fffffbfff0dbbde7 R12: 1ffff111c70a8c6a
   [146800.360642] R13: 1ffff111c70a8c71 R14: 0000000000000000 R15: dffffc0000000000
   [146800.375175] FS:  0000000000000000(0000) GS:ffff888e38500000(0000) knlGS:0000000000000000
   [146800.391619] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [146800.403369] CR2: 0000559372540094 CR3: 0000000017c8c001 CR4: 00000000007706f0
   [146800.417906] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   [146800.432437] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   [146800.446969] PKRU: 55555554
   [146800.452638] Call Trace:
   [146800.457784]  <TASK>
   [146800.462238] ? __die_body (arch/x86/kernel/dumpstack.c:421) 
   [146800.469479] ? die_addr (arch/x86/kernel/dumpstack.c:?) 
   [146800.476373] ? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:693) 
   [146800.486078] ? asm_exc_general_protection (./arch/x86/include/asm/idtentry.h:617) 
   [146800.496111] ? pick_task_fair (kernel/sched/fair.c:5626 kernel/sched/fair.c:8856) 
   [146800.504219] ? rcu_is_watching (./include/linux/context_tracking.h:128 kernel/rcu/tree.c:737) 
   [146800.512321] ? util_est_update (./include/trace/events/sched.h:814 kernel/sched/fair.c:5054) 
   [146800.520777] pick_next_task_fair (kernel/sched/fair.c:8877) 
   [146800.529408] __schedule (kernel/sched/core.c:5956 kernel/sched/core.c:6477 kernel/sched/core.c:6629) 
   [146800.536841] ? sched_submit_work (kernel/sched/core.c:6708) 
   [146800.545472] schedule (kernel/sched/core.c:6753 kernel/sched/core.c:6767) 
   [146800.552189] worker_thread (kernel/workqueue.c:3344) 
   [146800.559983] kthread (kernel/kthread.c:390) 
   [146800.566696] ? pr_cont_work (kernel/workqueue.c:3337) 
   [146800.574623] ? kthread_blkcg (kernel/kthread.c:342) 
   [146800.582379] ret_from_fork (arch/x86/kernel/process.c:153) 
   [146800.589784] ? kthread_blkcg (kernel/kthread.c:342) 
   [146800.597537] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
   [146800.605664]  </TASK>

Important to say that I am also seeing the following warning before the
crash:

 workqueue: drain_vmap_area_work hogged CPU for >20000us 4 times, consider switching to WQ_UNBOUND
 ------------[ cut here ]------------
           !se->on_rq
           WARNING: CPU: 24 PID: 17 at kernel/sched/fair.c:704 dequeue_entity+0xd21/0x17c0

Is this helpful?

Thanks
--breno

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 6.12-rc1: general protection fault at pick_task_fair()
  2024-09-30 12:52 6.12-rc1: general protection fault at pick_task_fair() Breno Leitao
@ 2024-09-30 14:48 ` Peter Zijlstra
  2024-09-30 16:56   ` Breno Leitao
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2024-09-30 14:48 UTC (permalink / raw)
  To: Breno Leitao
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel

On Mon, Sep 30, 2024 at 05:52:49AM -0700, Breno Leitao wrote:
> Hello,
> 
> I've been testing v6.12-rc1 and I got some crashes that I would like to
> share, since I haven't seen anything in the mailing list yet.
> 
> This kernel was compiled with some debug options, against 11a299a7933e
> ("Merge tag 'for-6.12/block-20240925' of git://git.kernel.dk/linux").
> 
> 
>    [146800.130180] Oops: general protection fault, probably for non-canonical address 0xdffffc000000000a: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
>    [146800.156067] KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
>    [146800.200109] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP, [N]=TEST
>    [146800.218119] Hardware name: Quanta Delta Lake MP 29F0EMA00E0/Delta Lake-Class1, BIOS F0E_3A19 04/27/2023
>    [146800.237177] Workqueue:  0x0 (events)
>    [146800.244615] RIP: 0010:pick_task_fair (kernel/sched/fair.c:5626 kernel/sched/fair.c:8856) 
>    [146800.253955] Code: 74 08 48 89 df e8 3d 78 01 00 e9 29 01 00 00 0f 1f 44 00 00 48 89 df e8 5b ef 01 00 49 89 c6 48 8d 68 51 48 89 eb 48 c1 eb 03 <42> 0f b6 04 3b 84 c0 0f 85 98 00 00 00 80 7d 00 00 0f 84 3a 03 00
>    All code
>    ========
>       0:	74 08                	je     0xa
>       2:	48 89 df             	mov    %rbx,%rdi
>       5:	e8 3d 78 01 00       	call   0x17847
>       a:	e9 29 01 00 00       	jmp    0x138
>       f:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
>      14:	48 89 df             	mov    %rbx,%rdi
>      17:	e8 5b ef 01 00       	call   0x1ef77
>      1c:	49 89 c6             	mov    %rax,%r14
>      1f:	48 8d 68 51          	lea    0x51(%rax),%rbp
>      23:	48 89 eb             	mov    %rbp,%rbx
>      26:	48 c1 eb 03          	shr    $0x3,%rbx
>      2a:*	42 0f b6 04 3b       	movzbl (%rbx,%r15,1),%eax		<-- trapping instruction
>      2f:	84 c0                	test   %al,%al
>      31:	0f 85 98 00 00 00    	jne    0xcf
>      37:	80 7d 00 00          	cmpb   $0x0,0x0(%rbp)
>      3b:	0f                   	.byte 0xf
>      3c:	84 3a                	test   %bh,(%rdx)
>      3e:	03 00                	add    (%rax),%eax
>    
>    Code starting with the faulting instruction
>    ===========================================
>       0:	42 0f b6 04 3b       	movzbl (%rbx,%r15,1),%eax
>       5:	84 c0                	test   %al,%al
>       7:	0f 85 98 00 00 00    	jne    0xa5
>       d:	80 7d 00 00          	cmpb   $0x0,0x0(%rbp)
>      11:	0f                   	.byte 0xf
>      12:	84 3a                	test   %bh,(%rdx)
>      14:	03 00                	add    (%rax),%eax
>    [146800.291790] RSP: 0018:ffff8889ae3dfbc0 EFLAGS: 00010006
>    [146800.302506] RAX: 0000000000000000 RBX: 000000000000000a RCX: dffffc0000000000
>    [146800.317040] RDX: ffff8889ae3dfd30 RSI: ffff8889af5bca00 RDI: ffff888e38546380
>    [146800.331573] RBP: 0000000000000051 R08: ffffffff86ddef37 R09: 1ffffffff0dbbde6
>    [146800.346109] R10: dffffc0000000000 R11: fffffbfff0dbbde7 R12: 1ffff111c70a8c6a
>    [146800.360642] R13: 1ffff111c70a8c71 R14: 0000000000000000 R15: dffffc0000000000
>    [146800.375175] FS:  0000000000000000(0000) GS:ffff888e38500000(0000) knlGS:0000000000000000
>    [146800.391619] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [146800.403369] CR2: 0000559372540094 CR3: 0000000017c8c001 CR4: 00000000007706f0
>    [146800.417906] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [146800.432437] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [146800.446969] PKRU: 55555554
>    [146800.452638] Call Trace:
>    [146800.457784]  <TASK>
>    [146800.462238] ? __die_body (arch/x86/kernel/dumpstack.c:421) 
>    [146800.469479] ? die_addr (arch/x86/kernel/dumpstack.c:?) 
>    [146800.476373] ? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:693) 
>    [146800.486078] ? asm_exc_general_protection (./arch/x86/include/asm/idtentry.h:617) 
>    [146800.496111] ? pick_task_fair (kernel/sched/fair.c:5626 kernel/sched/fair.c:8856) 
>    [146800.504219] ? rcu_is_watching (./include/linux/context_tracking.h:128 kernel/rcu/tree.c:737) 
>    [146800.512321] ? util_est_update (./include/trace/events/sched.h:814 kernel/sched/fair.c:5054) 
>    [146800.520777] pick_next_task_fair (kernel/sched/fair.c:8877) 
>    [146800.529408] __schedule (kernel/sched/core.c:5956 kernel/sched/core.c:6477 kernel/sched/core.c:6629) 
>    [146800.536841] ? sched_submit_work (kernel/sched/core.c:6708) 
>    [146800.545472] schedule (kernel/sched/core.c:6753 kernel/sched/core.c:6767) 
>    [146800.552189] worker_thread (kernel/workqueue.c:3344) 
>    [146800.559983] kthread (kernel/kthread.c:390) 
>    [146800.566696] ? pr_cont_work (kernel/workqueue.c:3337) 
>    [146800.574623] ? kthread_blkcg (kernel/kthread.c:342) 
>    [146800.582379] ret_from_fork (arch/x86/kernel/process.c:153) 
>    [146800.589784] ? kthread_blkcg (kernel/kthread.c:342) 
>    [146800.597537] ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
>    [146800.605664]  </TASK>
> 
> Important to say that I am also seeing the following warning before the
> crash:
> 
>  workqueue: drain_vmap_area_work hogged CPU for >20000us 4 times, consider switching to WQ_UNBOUND
>  ------------[ cut here ]------------
>            !se->on_rq
>            WARNING: CPU: 24 PID: 17 at kernel/sched/fair.c:704 dequeue_entity+0xd21/0x17c0
> 
> Is this helpful?
> 
> Thanks
> --breno

This looks to be the same issue as reported here:

  https://lkml.kernel.org/r/20240930144157.GH5594@noisy.programming.kicks-ass.net

Is there anything you can share about your setup / workload that manages
to trigger this?

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 6.12-rc1: general protection fault at pick_task_fair()
  2024-09-30 14:48 ` Peter Zijlstra
@ 2024-09-30 16:56   ` Breno Leitao
  0 siblings, 0 replies; 3+ messages in thread
From: Breno Leitao @ 2024-09-30 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, vincent.guittot, dietmar.eggemann, rostedt,
	bsegall, mgorman, vschneid, linux-kernel

Hello Peter,

On Mon, Sep 30, 2024 at 04:48:09PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 30, 2024 at 05:52:49AM -0700, Breno Leitao wrote:
> > Important to say that I am also seeing the following warning before the
> > crash:
> > 
> >  workqueue: drain_vmap_area_work hogged CPU for >20000us 4 times, consider switching to WQ_UNBOUND
> >  ------------[ cut here ]------------
> >            !se->on_rq
> >            WARNING: CPU: 24 PID: 17 at kernel/sched/fair.c:704 dequeue_entity+0xd21/0x17c0
> > 

> This looks to be the same issue as reported here:
> 
>   https://lkml.kernel.org/r/20240930144157.GH5594@noisy.programming.kicks-ass.net

Ok, I though they would be different given that it was crashing at
update_entity_lag() and not at pick_task_fair(). 

> Is there anything you can share about your setup / workload that manages
> to trigger this?

Nothing really. I have a test system where I install a debug kernel
(with some DEBUG+KASAN+LOCKDEP) and keep the host mostly idle, when I
hit it ~2x a day on average.

PS: I have the crashdump if you need me to investigate anything.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-09-30 16:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-30 12:52 6.12-rc1: general protection fault at pick_task_fair() Breno Leitao
2024-09-30 14:48 ` Peter Zijlstra
2024-09-30 16:56   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox