public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* general protection fault, probably for non-canonical address in pick_next_task_fair()
@ 2024-02-29 15:55 Breno Leitao
  2024-02-29 22:58 ` Thomas Gleixner
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Breno Leitao @ 2024-02-29 15:55 UTC (permalink / raw)
  To: peterz, bp, tglx; +Cc: linux-kernel

I've been running some stress test using stress-ng with a kernel with some
debug options enabled, such as KASAN and friends (See the config below).

I saw it in rc4 and the decode instructions are a bit off (as it is here
also - search for mavabs in dmesg below and you will find something as `(bad)`,
so I though it was a machine issue. But now I see it again, and I am sharing
for awareness.

This is happening in upstream kernel against the following commit
d206a76d7d2726 ("Linux 6.8-rc6")

This is the exercpt that shows before the crash:

	general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
	KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]

This is the stack that is getting it

	? __die_body (arch/x86/kernel/dumpstack.c:421)
	? die_addr (arch/x86/kernel/dumpstack.c:460)
	? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
	? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
	? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)
	? pick_next_task_fair (kernel/sched/fair.c:5463 kernel/sched/fair.c:8434)
	? update_rq_clock_task (kernel/sched/core.c:?)
	__schedule (kernel/sched/core.c:6022 kernel/sched/core.c:6545 kernel/sched/core.c:6691)
	schedule (kernel/sched/core.c:6803 kernel/sched/core.c:6817)
	syscall_exit_to_user_mode (kernel/entry/common.c:98 include/linux/entry-common.h:328 kernel/entry/common.c:201 kernel/entry/common.c:212) 
	do_syscall_64 (arch/x86/entry/common.c:102)
	? irqentry_exit_to_user_mode (kernel/entry/common.c:228)
	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)

Full dmesg: https://paste.mozilla.org/RiLnt4QO#
Configs: https://paste.mozilla.org/XJ9wbdRp

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: general protection fault, probably for non-canonical address in pick_next_task_fair()
  2024-02-29 15:55 general protection fault, probably for non-canonical address in pick_next_task_fair() Breno Leitao
@ 2024-02-29 22:58 ` Thomas Gleixner
  2024-03-01  3:43 ` Abel Wu
  2024-03-01  3:47 ` Abel Wu
  2 siblings, 0 replies; 5+ messages in thread
From: Thomas Gleixner @ 2024-02-29 22:58 UTC (permalink / raw)
  To: Breno Leitao, peterz, bp
  Cc: linux-kernel, Ingo Molnar, Vincent Guittot, Dietmar Eggemann

On Thu, Feb 29 2024 at 07:55, Breno Leitao wrote:
> I've been running some stress test using stress-ng with a kernel with some
> debug options enabled, such as KASAN and friends (See the config below).
>
> I saw it in rc4 and the decode instructions are a bit off (as it is here
> also - search for mavabs in dmesg below and you will find something as `(bad)`,
> so I though it was a machine issue. But now I see it again, and I am sharing
> for awareness.

The (bad) is after the faulting instruction, but gives an hint:

  2e:	0f 84 67 ff ff ff    	je     0xffffffffffffff9b
  34:	48 89 ef             	mov    %rbp,%rdi
  37:	e8 cf 70 76 00       	call   0x76710b
  3c:	e9                   	.byte 0xe9

That's an invalid opcode, which means that memory is corrupted.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: general protection fault, probably for non-canonical address in pick_next_task_fair()
  2024-02-29 15:55 general protection fault, probably for non-canonical address in pick_next_task_fair() Breno Leitao
  2024-02-29 22:58 ` Thomas Gleixner
@ 2024-03-01  3:43 ` Abel Wu
  2024-03-01  3:47 ` Abel Wu
  2 siblings, 0 replies; 5+ messages in thread
From: Abel Wu @ 2024-03-01  3:43 UTC (permalink / raw)
  To: Breno Leitao, peterz, bp, tglx; +Cc: linux-kernel

Hi Breno, this seems to be a known issue under discussion.

https://lore.kernel.org/lkml/202401301012.2ed95df0-oliver.sang@intel.com/
https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/

On 2/29/24 11:55 PM, Breno Leitao Wrote:
> I've been running some stress test using stress-ng with a kernel with some
> debug options enabled, such as KASAN and friends (See the config below).
> 
> I saw it in rc4 and the decode instructions are a bit off (as it is here
> also - search for mavabs in dmesg below and you will find something as `(bad)`,
> so I though it was a machine issue. But now I see it again, and I am sharing
> for awareness.
> 
> This is happening in upstream kernel against the following commit
> d206a76d7d2726 ("Linux 6.8-rc6")
> 
> This is the exercpt that shows before the crash:
> 
> 	general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> 	KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]
> 
> This is the stack that is getting it
> 
> 	? __die_body (arch/x86/kernel/dumpstack.c:421)
> 	? die_addr (arch/x86/kernel/dumpstack.c:460)
> 	? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
> 	? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
> 	? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)
> 	? pick_next_task_fair (kernel/sched/fair.c:5463 kernel/sched/fair.c:8434)
> 	? update_rq_clock_task (kernel/sched/core.c:?)
> 	__schedule (kernel/sched/core.c:6022 kernel/sched/core.c:6545 kernel/sched/core.c:6691)
> 	schedule (kernel/sched/core.c:6803 kernel/sched/core.c:6817)
> 	syscall_exit_to_user_mode (kernel/entry/common.c:98 include/linux/entry-common.h:328 kernel/entry/common.c:201 kernel/entry/common.c:212)
> 	do_syscall_64 (arch/x86/entry/common.c:102)
> 	? irqentry_exit_to_user_mode (kernel/entry/common.c:228)
> 	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
> 
> Full dmesg: https://paste.mozilla.org/RiLnt4QO#
> Configs: https://paste.mozilla.org/XJ9wbdRp
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: general protection fault, probably for non-canonical address in pick_next_task_fair()
  2024-02-29 15:55 general protection fault, probably for non-canonical address in pick_next_task_fair() Breno Leitao
  2024-02-29 22:58 ` Thomas Gleixner
  2024-03-01  3:43 ` Abel Wu
@ 2024-03-01  3:47 ` Abel Wu
  2024-03-01  7:14   ` Chen Yu
  2 siblings, 1 reply; 5+ messages in thread
From: Abel Wu @ 2024-03-01  3:47 UTC (permalink / raw)
  To: Breno Leitao, peterz, bp, tglx; +Cc: linux-kernel, Chen Yu, Oliver Sang

(+ Chen Yu, Oliver Sang)

On 2/29/24 11:55 PM, Breno Leitao Wrote:
> I've been running some stress test using stress-ng with a kernel with some
> debug options enabled, such as KASAN and friends (See the config below).
> 
> I saw it in rc4 and the decode instructions are a bit off (as it is here
> also - search for mavabs in dmesg below and you will find something as `(bad)`,
> so I though it was a machine issue. But now I see it again, and I am sharing
> for awareness.
> 
> This is happening in upstream kernel against the following commit
> d206a76d7d2726 ("Linux 6.8-rc6")
> 
> This is the exercpt that shows before the crash:
> 
> 	general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> 	KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]
> 
> This is the stack that is getting it
> 
> 	? __die_body (arch/x86/kernel/dumpstack.c:421)
> 	? die_addr (arch/x86/kernel/dumpstack.c:460)
> 	? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
> 	? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
> 	? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)
> 	? pick_next_task_fair (kernel/sched/fair.c:5463 kernel/sched/fair.c:8434)
> 	? update_rq_clock_task (kernel/sched/core.c:?)
> 	__schedule (kernel/sched/core.c:6022 kernel/sched/core.c:6545 kernel/sched/core.c:6691)
> 	schedule (kernel/sched/core.c:6803 kernel/sched/core.c:6817)
> 	syscall_exit_to_user_mode (kernel/entry/common.c:98 include/linux/entry-common.h:328 kernel/entry/common.c:201 kernel/entry/common.c:212)
> 	do_syscall_64 (arch/x86/entry/common.c:102)
> 	? irqentry_exit_to_user_mode (kernel/entry/common.c:228)
> 	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
> 
> Full dmesg: https://paste.mozilla.org/RiLnt4QO#
> Configs: https://paste.mozilla.org/XJ9wbdRp
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: general protection fault, probably for non-canonical address in pick_next_task_fair()
  2024-03-01  3:47 ` Abel Wu
@ 2024-03-01  7:14   ` Chen Yu
  0 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2024-03-01  7:14 UTC (permalink / raw)
  To: Abel Wu; +Cc: Breno Leitao, peterz, bp, tglx, linux-kernel, Oliver Sang

On 2024-03-01 at 11:47:05 +0800, Abel Wu wrote:
> (+ Chen Yu, Oliver Sang)
> 
> On 2/29/24 11:55 PM, Breno Leitao Wrote:
> > I've been running some stress test using stress-ng with a kernel with some
> > debug options enabled, such as KASAN and friends (See the config below).
> > 
> > I saw it in rc4 and the decode instructions are a bit off (as it is here
> > also - search for mavabs in dmesg below and you will find something as `(bad)`,
> > so I though it was a machine issue. But now I see it again, and I am sharing
> > for awareness.
> > 
> > This is happening in upstream kernel against the following commit
> > d206a76d7d2726 ("Linux 6.8-rc6")
> > 
> > This is the exercpt that shows before the crash:
> > 
> > 	general protection fault, probably for non-canonical address 0xdffffc0000000014: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
> > 	KASAN: null-ptr-deref in range [0x00000000000000a0-0x00000000000000a7]
> > 
> > This is the stack that is getting it
> > 
> > 	? __die_body (arch/x86/kernel/dumpstack.c:421)
> > 	? die_addr (arch/x86/kernel/dumpstack.c:460)
> > 	? exc_general_protection (arch/x86/kernel/traps.c:? arch/x86/kernel/traps.c:643)
> > 	? asm_exc_general_protection (arch/x86/include/asm/idtentry.h:564)
> > 	? pick_next_task_fair (kernel/sched/sched.h:1453 kernel/sched/fair.c:8435)

Seems to be the same reason pick_eevdf returns NULL.. it panic here..
cfs_rq = group_cfs_rq(se);

I remember lkp has regular stress-ng test for regression test, but
not detect this yet.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-01  7:15 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-29 15:55 general protection fault, probably for non-canonical address in pick_next_task_fair() Breno Leitao
2024-02-29 22:58 ` Thomas Gleixner
2024-03-01  3:43 ` Abel Wu
2024-03-01  3:47 ` Abel Wu
2024-03-01  7:14   ` Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox