[PATCH v3 1/2] x86/fpu: Fix NULL dereference in avx512

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 1/2] x86/fpu: Fix NULL dereference in avx512_status()
@ 2025-07-24  1:34 Sohil Mehta
  2025-07-24  1:34 ` [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu() Sohil Mehta
  0 siblings, 1 reply; 9+ messages in thread
From: Sohil Mehta @ 2025-07-24  1:34 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: Borislav Petkov, H . Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, Sohil Mehta, Sean Christopherson, Peter Zijlstra,
	Vignesh Balasubramanian, Rick Edgecombe, Oleg Nesterov,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, stable

From: Fushuai Wang <wangfushuai@baidu.com>

Problem
-------
With CONFIG_X86_DEBUG_FPU enabled, reading /proc/[kthread]/arch_status
causes a kernel NULL pointer dereference.

Kernel threads aren't expected to access the FPU state directly. Kernel
usage of FPU registers is contained within kernel_fpu_begin()/_end()
sections.

However, to report AVX-512 usage, the avx512_timestamp variable within
struct fpu needs to be accessed, which triggers a warning in
x86_task_fpu().

For Kthreads:
  proc_pid_arch_status()
    avx512_status()
      x86_task_fpu() => Warning and returns NULL
      x86_task_fpu()->avx512_timestamp => NULL dereference

The warning is a false alarm in this case, since the access isn't
intended for modifying the FPU state. All kernel threads (except the
init_task) have a "struct fpu" with an accessible avx512_timestamp
variable. The init_task (PID 0) never follows this path since it is not
exposed in /proc.

Solution
--------
One option is to get rid of the warning in x86_task_fpu() for kernel
threads. However, that warning was recently added and might be useful to
catch any potential misuse of the FPU state in kernel threads.

Another option is to avoid the access altogether. The kernel does not
track AVX-512 usage for kernel threads.
save_fpregs_to_fpstate()->update_avx_timestamp() is never invoked for
kernel threads, so avx512_timestamp is always guaranteed to be 0.

Also, the legacy behavior of reporting "AVX512_elapsed_ms: -1", which
signifies "no AVX-512 usage", is misleading. The kernel usage just isn't
tracked.

Update the ABI for kernel threads and do not report AVX-512 usage for
them. Not having a value in the file avoids the NULL dereference as well
as the misleading report.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Fixes: 22aafe3bcb67 ("x86/fpu: Remove init_task FPU state dependencies, add debugging warning for PF_KTHREAD tasks")
Cc: stable@vger.kernel.org
Signed-off-by: Fushuai Wang <wangfushuai@baidu.com>
Co-developed-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
v3:
 - Do not report anything for kernel threads. (DaveH)
 - Make the commit message more precise.

v2: https://lore.kernel.org/lkml/20250721215302.3562784-1-sohil.mehta@intel.com/
 - Avoid making the fix dependent on CONFIG_X86_DEBUG_FPU.
 - Include PF_USER_WORKER in the kernel thread check.
 - Update commit message for clarity.
---
 arch/x86/kernel/fpu/xstate.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9aa9ac8399ae..b90b2eec8fb8 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1855,19 +1855,20 @@ long fpu_xstate_prctl(int option, unsigned long arg2)
 #ifdef CONFIG_PROC_PID_ARCH_STATUS
 /*
  * Report the amount of time elapsed in millisecond since last AVX512
- * use in the task.
+ * use in the task. Report -1 if no AVX-512 usage.
  */
 static void avx512_status(struct seq_file *m, struct task_struct *task)
 {
-	unsigned long timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
-	long delta;
+	unsigned long timestamp;
+	long delta = -1;
 
-	if (!timestamp) {
-		/*
-		 * Report -1 if no AVX512 usage
-		 */
-		delta = -1;
-	} else {
+	/* AVX-512 usage is not tracked for kernel threads. */
+	if (task->flags & (PF_KTHREAD | PF_USER_WORKER))
+		return;
+
+	timestamp = READ_ONCE(x86_task_fpu(task)->avx512_timestamp);
+
+	if (timestamp) {
 		delta = (long)(jiffies - timestamp);
 		/*
 		 * Cap to LONG_MAX if time difference > LONG_MAX
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-07-24  1:34 [PATCH v3 1/2] x86/fpu: Fix NULL dereference in avx512_status() Sohil Mehta
@ 2025-07-24  1:34 ` Sohil Mehta
  2025-08-08  3:26   ` Lai, Yi
  0 siblings, 1 reply; 9+ messages in thread
From: Sohil Mehta @ 2025-07-24  1:34 UTC (permalink / raw)
  To: Dave Hansen, x86
  Cc: Borislav Petkov, H . Peter Anvin, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, Sohil Mehta, Sean Christopherson, Peter Zijlstra,
	Vignesh Balasubramanian, Rick Edgecombe, Oleg Nesterov,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel

Kernel threads aren't expected to directly access struct fpu using
x86_task_fpu(). They typically access the FPU state using pairs of
kernel_fpu_begin()/kernel_fpu_end().

When CONFIG_X86_DEBUG_FPU is set, any usage of x86_task_fpu() by kernel
threads is flagged with a WARN_ON_ONCE(). However, along with the
warning, x86_task_fpu() returns a NULL pointer, which deviates from the
flow without the debug config.

Changing the return value could make failures harder to debug by masking
problems or introducing its own set of issues. Keep the behavior of
x86_task_fpu() consistent across debug and non-debug configurations,
except for the warning.

Also, update the warning to include PF_USER_WORKER, as these tasks are
treated as equivalent to kernel threads during FPU state management as
well as context switch.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
v3: Improve commit message
v2: New patch

This patch is less urgent than Patch 1, which fixes the real issue.
---
 arch/x86/kernel/fpu/core.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ea138583dd92..ba16dda697b1 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -58,8 +58,7 @@ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 #ifdef CONFIG_X86_DEBUG_FPU
 struct fpu *x86_task_fpu(struct task_struct *task)
 {
-	if (WARN_ON_ONCE(task->flags & PF_KTHREAD))
-		return NULL;
+	WARN_ON_ONCE(task->flags & (PF_KTHREAD | PF_USER_WORKER));

 	return (void *)task + sizeof(*task);
 }
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-07-24  1:34 ` [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu() Sohil Mehta
@ 2025-08-08  3:26   ` Lai, Yi
  2025-08-08  7:49     ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Lai, Yi @ 2025-08-08  3:26 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Oleg Nesterov, Chang S . Bae, Brian Gerst, Eric Biggers,
	Kees Cook, Chao Gao, Fushuai Wang, linux-kernel, yi1.lai

Resent the report to LKML.

Hi Sohil,

I used Syzkaller and found that there is WARNING in x86_task_fpu in v6.16 kernel.

After bisection and the first bad commit is this patch - x86/fpu: Update the debug flow for x86_task_fpu()

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/250806_033745_x86_task_fpu
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/250806_033745_x86_task_fpu/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/250806_033745_x86_task_fpu/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/250806_033745_x86_task_fpu/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/250806_033745_x86_task_fpu/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/250806_033745_x86_task_fpu/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/250806_033745_x86_task_fpu/bzImage_9efb9ebfb1faa8aa0315b86eacebee9ba825aa97
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/250806_033745_x86_task_fpu/9efb9ebfb1faa8aa0315b86eacebee9ba825aa97_dmesg.log

"
test login: [   17.470482] repro[731]: segfault at 0 ip 000000000040157d sp 00007ffeb0762f00 error 6 in repro[157d,401000+1000] likely on CPU 0 (core 0, socket 0)
[   17.471251] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
[   17.474495] ------------[ cut here ]------------
[   17.474769] WARNING: CPU: 1 PID: 731 at arch/x86/kernel/fpu/core.c:61 x86_task_fpu+0x76/0x90
[   17.475245] Modules linked in:
[   17.475440] CPU: 1 UID: 0 PID: 731 Comm: repro Not tainted 6.16.0-9efb9ebfb1fa+ #1 PREEMPT(voluntary) 
[   17.475909] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[   17.476478] RIP: 0010:x86_task_fpu+0x76/0x90
[   17.476710] Code: 40 20 00 44 89 e6 e8 49 fe 54 00 45 85 e4 75 15 e8 8f 03 55 00 48 8d 83 40 20 00 00 5b 41 5c 5d c3 cc cc cc cc e8 7a 03 55 00 <0f> 0b eb e2 e8 d1 4a c1 00 eb c1 66 66 2e 0f 1f 84 00 00 00 00 00
[   17.477627] RSP: 0018:ffff88801794f408 EFLAGS: 00010293
[   17.477905] RAX: 0000000000000000 RBX: ffff88801dc1ca80 RCX: ffffffff813246a7
[   17.478274] RDX: ffff88800ed32540 RSI: ffffffff813246c6 RDI: 0000000000000005
[   17.478655] RBP: ffff88801794f418 R08: 0000000000000400 R09: ffff88806b8f8650
[   17.479024] R10: 0000000000004000 R11: 000000000000000e R12: 0000000000004000
[   17.479391] R13: ffff88801dc1ca80 R14: 0000000000000200 R15: 0000000000000200
[   17.479756] FS:  00007f49bfd9e740(0000) GS:ffff8880e3653000(0000) knlGS:0000000000000000
[   17.480165] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   17.480464] CR2: 00007f13183cc088 CR3: 0000000011fb0000 CR4: 0000000000750ef0
[   17.480831] PKRU: 55555554
[   17.480983] Call Trace:
[   17.481120]  <TASK>
[   17.481244]  xfpregs_get+0x9c/0x1e0
[   17.481436]  ? __kvmalloc_node_noprof+0x448/0x660
[   17.481698]  ? __pfx_xfpregs_get+0x10/0x10
[   17.481917]  ? regset_get_alloc+0xf8/0x200
[   17.482150]  regset_get_alloc+0x137/0x200
[   17.482374]  elf_core_dump+0x10e3/0x3800
[   17.482613]  ? stack_depot_save_flags+0x445/0xa40
[   17.482884]  ? __pfx_regset_xregset_fpregs_active+0x10/0x10
[   17.483178]  ? __pfx_elf_core_dump+0x10/0x10
[   17.483407]  ? irqentry_exit_to_user_mode+0x125/0x1e0
[   17.483687]  ? exc_page_fault+0xee/0x1b0
[   17.483896]  ? asm_exc_page_fault+0x2b/0x30
[   17.484119]  ? __this_cpu_preempt_check+0x21/0x30
[   17.484372]  ? ___slab_alloc+0x2f0/0x12c0
[   17.484593]  ? lock_is_held_type+0xef/0x150
[   17.484814]  ? ___slab_alloc+0x2ff/0x12c0
[   17.485043]  ? lock_release+0x14f/0x2c0
[   17.485304]  do_coredump+0x370e/0x5060
[   17.485539]  ? do_coredump+0x370e/0x5060
[   17.485749]  ? stack_depot_save_flags+0x445/0xa40
[   17.486018]  ? __pfx_do_coredump+0x10/0x10
[   17.486251]  ? stack_depot_save_flags+0x445/0xa40
[   17.486519]  ? kasan_save_stack+0x40/0x60
[   17.486738]  ? kasan_save_stack+0x2c/0x60
[   17.486957]  ? kasan_save_track+0x18/0x40
[   17.487169]  ? kasan_save_free_info+0x3f/0x60
[   17.487401]  ? __kasan_slab_free+0x3d/0x60
[   17.487618]  ? kmem_cache_free+0x2ea/0x520
[   17.487836]  ? __sigqueue_free+0xcc/0x140
[   17.488053]  ? get_signal+0xbca/0x2430
[   17.488257]  ? arch_do_signal_or_restart+0x8d/0x7d0
[   17.488519]  ? irqentry_exit_to_user_mode+0x125/0x1e0
[   17.488784]  ? irqentry_exit+0x6f/0xa0
[   17.488986]  ? exc_page_fault+0xee/0x1b0
[   17.489195]  ? asm_exc_page_fault+0x2b/0x30
[   17.489424]  ? percpu_ref_put_many.constprop.0+0x80/0x1d0
[   17.489718]  ? __memcg_slab_free_hook+0xed/0x570
[   17.489966]  ? __memcg_slab_free_hook+0xed/0x570
[   17.490217]  ? __this_cpu_preempt_check+0x21/0x30
[   17.490469]  ? lock_release+0x14f/0x2c0
[   17.490707]  ? __this_cpu_preempt_check+0x21/0x30
[   17.490956]  ? _raw_spin_unlock_irq+0x2c/0x60
[   17.491194]  get_signal+0x1ae2/0x2430
[   17.491394]  ? get_signal+0x1ae2/0x2430
[   17.491610]  ? __pfx_get_signal+0x10/0x10
[   17.491824]  ? __pfx_force_sig_fault+0x10/0x10
[   17.492067]  arch_do_signal_or_restart+0x8d/0x7d0
[   17.492319]  ? trace_irq_disable+0xd1/0x110
[   17.492547]  ? __pfx_arch_do_signal_or_restart+0x10/0x10
[   17.492819]  ? trace_hardirqs_off+0x3e/0x50
[   17.493047]  ? __bad_area_nosemaphore+0x377/0x650
[   17.493304]  ? __this_cpu_preempt_check+0x21/0x30
[   17.493563]  ? irqentry_exit_to_user_mode+0xf0/0x1e0
[   17.493889]  irqentry_exit_to_user_mode+0x125/0x1e0
[   17.494220]  irqentry_exit+0x6f/0xa0
[   17.494460]  exc_page_fault+0xee/0x1b0
[   17.494741]  asm_exc_page_fault+0x2b/0x30
[   17.495005] RIP: 0033:0x40157d
[   17.495220] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
[   17.496346] RSP: 002b:00007ffeb0762f00 EFLAGS: 00010207
[   17.496682] RAX: 00007f49bfc7d000 RBX: 0000000000000000 RCX: 00007f49bfb47fe7
[   17.497128] RDX: 0000000000000000 RSI: 0000000000120040 RDI: 0000000000000000
[   17.497571] RBP: 00007ffeb0762f60 R08: 0000000000000003 R09: 0000000000000000
[   17.498013] R10: 0000000000008001 R11: 0000000000000246 R12: 00007ffeb07630f8
[   17.498460] R13: 0000000000401d31 R14: 0000000000403e08 R15: 00007f49bfdeb000
[   17.498924]  </TASK>
[   17.499080] irq event stamp: 1637
[   17.499297] hardirqs last  enabled at (1645): [<ffffffff81651e25>] __up_console_sem+0x95/0xb0
[   17.499829] hardirqs last disabled at (1652): [<ffffffff81651e0a>] __up_console_sem+0x7a/0xb0
[   17.500357] softirqs last  enabled at (1546): [<ffffffff8147fa1e>] __irq_exit_rcu+0x10e/0x170
[   17.500892] softirqs last disabled at (1541): [<ffffffff8147fa1e>] __irq_exit_rcu+0x10e/0x170
[   17.501420] ---[ end trace 0000000000000000 ]---
[   17.903446] repro[736]: segfault at 0 ip 000000000040157d sp 00007ffeb0762f00 error 6 in repro[157d,401000+1000] likely on CPU 0 (core 0, socket 0)
[   17.904173] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
[   18.214869] repro[744]: segfault at 0 ip 000000000040157d sp 00007ffeb0762f00 error 6 in repro[157d,401000+1000] likely on CPU 1 (core 1, socket 0)
[   18.215750] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
[   18.527142] repro[752]: segfault at 0 ip 000000000040157d sp 00007ffeb0762f00 error 6 in repro[157d,401000+1000] likely on CPU 1 (core 1, socket 0)
[   18.528028] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
[   18.899276] repro[761]: segfault at 0 ip 000000000040157d sp 00007ffeb0762f00 error 6 in repro[157d,401000+1000] likely on CPU 1 (core 1, socket 0)
[   18.899991] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
[   19.225731] repro[769]: segfault at 0 ip 000000000040157d sp 00007ffeb0762f00 error 6 in repro[157d,401000+1000] likely on CPU 1 (core 1, socket 0)
[   19.226868] Code: 8b 55 dc 8b 45 d0 41 b9 00 00 00 00 41 89 d0 b9 01 80 00 00 ba 03 00 00 00 48 89 c6 bf 00 00 00 00 e8 57 fb ff ff 48 8b 55 e8 <48> 89 02 48 8b 45 f0 8b 00 c1 e0 06 89 45 cc 8b 55 dc 8b 45 cc 41
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git cd qemu git checkout -f v7.1.0 mkdir build cd build yum install -y ninja-build.x86_64 yum -y install libslirp-devel.x86_64 ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp make make install 

On Wed, Jul 23, 2025 at 06:34:22PM -0700, Sohil Mehta wrote:
> Kernel threads aren't expected to directly access struct fpu using
> x86_task_fpu(). They typically access the FPU state using pairs of
> kernel_fpu_begin()/kernel_fpu_end().
> 
> When CONFIG_X86_DEBUG_FPU is set, any usage of x86_task_fpu() by kernel
> threads is flagged with a WARN_ON_ONCE(). However, along with the
> warning, x86_task_fpu() returns a NULL pointer, which deviates from the
> flow without the debug config.
> 
> Changing the return value could make failures harder to debug by masking
> problems or introducing its own set of issues. Keep the behavior of
> x86_task_fpu() consistent across debug and non-debug configurations,
> except for the warning.
> 
> Also, update the warning to include PF_USER_WORKER, as these tasks are
> treated as equivalent to kernel threads during FPU state management as
> well as context switch.
> 
> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
> ---
> v3: Improve commit message
> v2: New patch
> 
> This patch is less urgent than Patch 1, which fixes the real issue.
> ---
>  arch/x86/kernel/fpu/core.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index ea138583dd92..ba16dda697b1 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -58,8 +58,7 @@ DEFINE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
>  #ifdef CONFIG_X86_DEBUG_FPU
>  struct fpu *x86_task_fpu(struct task_struct *task)
>  {
> -	if (WARN_ON_ONCE(task->flags & PF_KTHREAD))
> -		return NULL;
> +	WARN_ON_ONCE(task->flags & (PF_KTHREAD | PF_USER_WORKER));
>  
>  	return (void *)task + sizeof(*task);
>  }
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-08-08  3:26   ` Lai, Yi
@ 2025-08-08  7:49     ` Oleg Nesterov
  2025-08-08 13:59       ` Sohil Mehta
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2025-08-08  7:49 UTC (permalink / raw)
  To: Lai, Yi
  Cc: Sohil Mehta, Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, yi1.lai

On 08/08, Lai, Yi wrote:
>
> [   17.474769] WARNING: CPU: 1 PID: 731 at arch/x86/kernel/fpu/core.c:61 x86_task_fpu+0x76/0x90

...

> [   17.481244]  xfpregs_get+0x9c/0x1e0

...

> [   17.485304]  do_coredump+0x370e/0x5060

Damn, I was going to check the ptrace / coredump paths but didn't have
time and then forgot :/

For now, I think we need to remove PF_USER_WORKER from x86_task_fpu().
Then, we add it back later.

It is not clear what should we do if debugger does xfpregs_set()...

Oleg.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-08-08  7:49     ` Oleg Nesterov
@ 2025-08-08 13:59       ` Sohil Mehta
  2025-08-08 15:11         ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Sohil Mehta @ 2025-08-08 13:59 UTC (permalink / raw)
  To: Oleg Nesterov, Lai, Yi
  Cc: Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, yi1.lai

On 8/8/2025 12:49 AM, Oleg Nesterov wrote:
> On 08/08, Lai, Yi wrote:
>>
>> [   17.474769] WARNING: CPU: 1 PID: 731 at arch/x86/kernel/fpu/core.c:61 x86_task_fpu+0x76/0x90
> 
> ...
> 
>> [   17.481244]  xfpregs_get+0x9c/0x1e0
> 
> ...
> 
>> [   17.485304]  do_coredump+0x370e/0x5060
> 

The warning here is mostly benign, right?

x86_task_fpu(target) and x86_task_fpu(current) wouldn't match, causing
sync_fpstate() to return early.

However, independent of this warning, can xfpregs_get()->sync_fpstate()
be called in the context of the PF_USER_WORKER thread? Would that be
problematic?

> Damn, I was going to check the ptrace / coredump paths but didn't have
> time and then forgot :/
> 
> For now, I think we need to remove PF_USER_WORKER from x86_task_fpu().
> Then, we add it back later.
> 

Adding PF_USER_WORKER to the warning is only proposed in this patch so
no harm done yet.

But, I am also skeptical about the x86_task_fpu() warnings. We have
"struct fpu *" comparisons that seem to be getting flagged but should be
valid in principle.

> It is not clear what should we do if debugger does xfpregs_set()...
> 

Yeah, I am wondering whether treating PF_USER_WORKER threads as
equivalent to PF_KTHREAD is truly accurate in the FPU.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-08-08 13:59       ` Sohil Mehta
@ 2025-08-08 15:11         ` Oleg Nesterov
  2025-08-08 18:52           ` Sohil Mehta
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2025-08-08 15:11 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: Lai, Yi, Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, yi1.lai

On 08/08, Sohil Mehta wrote:
>
> On 8/8/2025 12:49 AM, Oleg Nesterov wrote:
> > On 08/08, Lai, Yi wrote:
> >>
> >> [   17.474769] WARNING: CPU: 1 PID: 731 at arch/x86/kernel/fpu/core.c:61 x86_task_fpu+0x76/0x90
> >
> > ...
> >
> >> [   17.481244]  xfpregs_get+0x9c/0x1e0
> >
> > ...
> >
> >> [   17.485304]  do_coredump+0x370e/0x5060
> >
>
> The warning here is mostly benign, right?
>
> x86_task_fpu(target) and x86_task_fpu(current) wouldn't match, causing
> sync_fpstate() to return early.

Sorry, I am not sure I understand...

I only meant that the PF_USER_WORKER check in x86_task_fpu() is not
correct right now.

> However, independent of this warning, can xfpregs_get()->sync_fpstate()
> be called in the context of the PF_USER_WORKER thread?

Probably not but I need to recheck.

> Yeah, I am wondering whether treating PF_USER_WORKER threads as
> equivalent to PF_KTHREAD is truly accurate in the FPU.

I think it is more or less equivalent but needs some work. In fact
I was thinking about it a long ago, see
https://lore.kernel.org/all/20240606120038.GB22450@redhat.com/

I'll try to do something next week.

Oleg.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-08-08 15:11         ` Oleg Nesterov
@ 2025-08-08 18:52           ` Sohil Mehta
  2025-08-12 12:57             ` Oleg Nesterov
  0 siblings, 1 reply; 9+ messages in thread
From: Sohil Mehta @ 2025-08-08 18:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai, Yi, Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, yi1.lai

On 8/8/2025 8:11 AM, Oleg Nesterov wrote:
>> The warning here is mostly benign, right?
>>
>> x86_task_fpu(target) and x86_task_fpu(current) wouldn't match, causing
>> sync_fpstate() to return early.
> 
> Sorry, I am not sure I understand...
> 

I was mainly trying to confirm that the check in sync_fpstate():

	if (fpu == x86_task_fpu(current))
		fpu_sync_fpstate(fpu);

would always fail in the scenario reported by Yi Lai.

> I only meant that the PF_USER_WORKER check in x86_task_fpu() is not
> correct right now.
> 

Yes, agreed.

>> However, independent of this warning, can xfpregs_get()->sync_fpstate()
>> be called in the context of the PF_USER_WORKER thread?
> 
> Probably not but I need to recheck.

IIUC, if a PF_USER_WORKER thread encounters a fault, coredump could get
triggered in its context. That could cause the above check in
sync_fpstate() to pass. Maybe I am missing something?

> 
>> Yeah, I am wondering whether treating PF_USER_WORKER threads as
>> equivalent to PF_KTHREAD is truly accurate in the FPU.
> 
> I think it is more or less equivalent but needs some work. In fact
> I was thinking about it a long ago, see
> https://lore.kernel.org/all/20240606120038.GB22450@redhat.com/
> 
> I'll try to do something next week.

Sure! If it helps, you can add this patch to the end of your series. I
am not planning to pursue it any further.

Only patch 1 in this series is needed to fix the reported arch_status issue.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-08-08 18:52           ` Sohil Mehta
@ 2025-08-12 12:57             ` Oleg Nesterov
  2025-08-14  1:00               ` Sohil Mehta
  0 siblings, 1 reply; 9+ messages in thread
From: Oleg Nesterov @ 2025-08-12 12:57 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: Lai, Yi, Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, yi1.lai

On 08/08, Sohil Mehta wrote:
>
> On 8/8/2025 8:11 AM, Oleg Nesterov wrote:
>
> >> However, independent of this warning, can xfpregs_get()->sync_fpstate()
> >> be called in the context of the PF_USER_WORKER thread?
> >
> > Probably not but I need to recheck.
>
> IIUC, if a PF_USER_WORKER thread encounters a fault, coredump could get
> triggered in its context. That could cause the above check in
> sync_fpstate() to pass. Maybe I am missing something?

A PF_USER_WORKER can't initiate the coredump, it blocks all signals except
SIGKILL and SIGSTOP. But this doesn't really matter.

First of all, I think that in the long term kthreads and PF_USER_WORKERs
should run without "struct fpu" attached to task_struct, so x86_task_fpu()
should return NULL regardless of CONFIG_X86_DEBUG_FPU in this case. That
is why I like your patch which adds the PF_USER_WORKER check. But this
needs more work.

So. The problem is that do_coredump() paths or ptrace can abuse
PF_USER_WORKER's FPU state for no reason.

To simplify, lets only discuss REGSET64_FP for now. As for xfpregs_get(),
everything looks simple, but needs some preparatory patches. membuf_write()
and copy_xstate_to_uabi_buf() should use &init_fpstate instead of
x86_task_fpu(target)->fpstate when target->flags & PF_USER_WORKER. This
matches the reality.

But what about xfpregs_set() ? Can it simply return, say, -EPERM ?

What do you think?

Oleg.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu()
  2025-08-12 12:57             ` Oleg Nesterov
@ 2025-08-14  1:00               ` Sohil Mehta
  0 siblings, 0 replies; 9+ messages in thread
From: Sohil Mehta @ 2025-08-14  1:00 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Lai, Yi, Dave Hansen, x86, Borislav Petkov, H . Peter Anvin,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, Sean Christopherson,
	Peter Zijlstra, Vignesh Balasubramanian, Rick Edgecombe,
	Chang S . Bae, Brian Gerst, Eric Biggers, Kees Cook, Chao Gao,
	Fushuai Wang, linux-kernel, yi1.lai, Jens Axboe

On 8/12/2025 5:57 AM, Oleg Nesterov wrote:

> A PF_USER_WORKER can't initiate the coredump, it blocks all signals except
> SIGKILL and SIGSTOP. But this doesn't really matter.
> 

Thanks for checking.

> First of all, I think that in the long term kthreads and PF_USER_WORKERs
> should run without "struct fpu" attached to task_struct, so x86_task_fpu()
> should return NULL regardless of CONFIG_X86_DEBUG_FPU in this case. 

Having x86_task_fpu() to be consistent would definitely reduce the
number of corner cases we have to deal with :)

"struct fpu" seems to hold some metadata such as avx512_timestamp as
well as some permissions. Hopefully we never need any of that for
PF_USER_WORKERs.

Is the eventual goal of running without "struct fpu" to save memory or
having a cleaner implementation?

> So. The problem is that do_coredump() paths or ptrace can abuse
> PF_USER_WORKER's FPU state for no reason.
> 

Agreed, this requires immediate solving.

> To simplify, lets only discuss REGSET64_FP for now. As for xfpregs_get(),
> everything looks simple, but needs some preparatory patches. membuf_write()
> and copy_xstate_to_uabi_buf() should use &init_fpstate instead of
> x86_task_fpu(target)->fpstate when target->flags & PF_USER_WORKER. This
> matches the reality.
> 

Yeah, using init_fpstate seems reasonable. I don't know what userspace
would do with that information though. But, not returning anything would
likely break some debugging flows.

> But what about xfpregs_set() ? Can it simply return, say, -EPERM ?
> 

This would modify legacy behavior, but I don't think changing the
register values would have any effect either, right? Another option is
to silently ignore the set(). Maybe we can go with your approach unless
someone complains.

> What do you think?
> 

My knowledge about the FPU handling is fairly limited. So I don't have a
strong opinion on this. Just curiosity and questions! :)

Sohil

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-14  1:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24  1:34 [PATCH v3 1/2] x86/fpu: Fix NULL dereference in avx512_status() Sohil Mehta
2025-07-24  1:34 ` [PATCH v3 2/2] x86/fpu: Update the debug flow for x86_task_fpu() Sohil Mehta
2025-08-08  3:26   ` Lai, Yi
2025-08-08  7:49     ` Oleg Nesterov
2025-08-08 13:59       ` Sohil Mehta
2025-08-08 15:11         ` Oleg Nesterov
2025-08-08 18:52           ` Sohil Mehta
2025-08-12 12:57             ` Oleg Nesterov
2025-08-14  1:00               ` Sohil Mehta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).