* [Syzkaller & bisect] There is general protection fault in path_init in v6.11-rc2
@ 2024-08-10 14:04 Pengfei Xu
2025-01-27 9:18 ` Zicheng Qu
0 siblings, 1 reply; 18+ messages in thread
From: Pengfei Xu @ 2024-08-10 14:04 UTC (permalink / raw)
To: hch; +Cc: linux-kernel, linux-pm, axboe, syzkaller-bugs
Hi Christoph Hellwig,
Greetings!
There is general protection fault in path_init in v6.11-rc2:
Bisected and found it related to:
1e8c813b083c PM: hibernate: don't use early_lookup_bdev in resume_store
All detailed info: https://github.com/xupengfe/syzkaller_logs/tree/main/240809_171408_path_init
Syzkaller repro code: https://github.com/xupengfe/syzkaller_logs/blob/main/240809_171408_path_init/repro.c
Syzkaller repro syscall steps: https://github.com/xupengfe/syzkaller_logs/blob/main/240809_171408_path_init/repro.prog
Syzkaller report: https://github.com/xupengfe/syzkaller_logs/blob/main/240809_171408_path_init/repro.report
Kconfig(make olddefconfig): https://github.com/xupengfe/syzkaller_logs/blob/main/240809_171408_path_init/kconfig_origin
Bisect info: https://github.com/xupengfe/syzkaller_logs/blob/main/240809_171408_path_init/bisect_info.log
Issue dmesg: https://github.com/xupengfe/syzkaller_logs/blob/main/240809_171408_path_init/de9c2c66ad8e787abec7c9d7eff4f8c3cdd28aed_dmesg.log
v6.11-rc2 bzImage: https://github.com/xupengfe/syzkaller_logs/raw/main/240809_171408_path_init/bzImage_de9c2c66ad8e787abec7c9d7eff4f8c3cdd28aed.tar.gz
"
[ 23.436545] cgroup: Unknown subsys name 'net'
[ 23.567369] cgroup: Unknown subsys name 'rlimit'
[ 23.737915] Process accounting resumed
[ 23.747674] cgroup: fork rejected by pids controller in /syz0
[ 23.749730] general protection fault, probably for non-canonical address 0xdffffc000000000a: 0000 [#1] PREEMPT SMP KASAN NOPTI
[ 23.750465] KASAN: null-ptr-deref in range [0x0000000000000050-0x0000000000000057]
[ 23.750937] CPU: 0 PID: 719 Comm: repro Not tainted 6.4.0-rc2-1e8c813b083c+ #1
[ 23.751395] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 23.752100] RIP: 0010:__lock_acquire+0xe83/0x5e10
[ 23.752422] Code: 00 00 3b 05 df 65 fc 08 0f 87 c8 08 00 00 41 bf 01 00 00 00 e9 84 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 da 48 c1 ea 0f
[ 23.753566] RSP: 0018:ff1100001446f060 EFLAGS: 00010006
[ 23.753896] RAX: dffffc0000000000 RBX: 1fe220000288de1f RCX: 0000000000000002
[ 23.754347] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000001
[ 23.754794] RBP: ff1100001446f180 R08: 0000000000000001 R09: 0000000000000001
[ 23.755242] R10: fffffbfff0e70d4c R11: 0000000000000050 R12: 0000000000000001
[ 23.755694] R13: ff1100001a9f0000 R14: 0000000000000000 R15: 0000000000000002
[ 23.756134] FS: 0000000000000000(0000) GS:ff1100006c200000(0000) knlGS:0000000000000000
[ 23.756636] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 23.756989] CR2: 00007fec33ffca50 CR3: 000000000667e003 CR4: 0000000000771ef0
[ 23.757439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 23.757886] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 23.758338] PKRU: 55555554
[ 23.758525] Call Trace:
[ 23.758690] <TASK>
[ 23.758829] ? __kasan_check_read+0x15/0x20
[ 23.759101] ? __lock_acquire+0xc77/0x5e10
[ 23.759376] ? __pfx_mark_lock.part.0+0x10/0x10
[ 23.759688] ? __pfx___lock_acquire+0x10/0x10
[ 23.759977] ? __pfx___lock_acquire+0x10/0x10
[ 23.760268] ? lock_release+0x417/0x7e0
[ 23.760535] lock_acquire+0x1c9/0x530
[ 23.760782] ? path_init+0x8cd/0x16e0
[ 23.761034] ? __pfx_lock_acquire+0x10/0x10
[ 23.761308] ? __pfx_lock_acquire+0x10/0x10
[ 23.761591] ? seqcount_lockdep_reader_access+0x82/0xd0
[ 23.761933] ? seqcount_lockdep_reader_access+0x82/0xd0
[ 23.762272] ? path_init+0x8cd/0x16e0
[ 23.762524] ? debug_smp_processor_id+0x20/0x30
[ 23.762828] ? rcu_is_watching+0x19/0xc0
[ 23.763097] seqcount_lockdep_reader_access+0x9f/0xd0
[ 23.763423] ? path_init+0x8cd/0x16e0
[ 23.763675] path_init+0x8cd/0x16e0
[ 23.763913] ? getname_kernel+0x5c/0x380
[ 23.764174] path_lookupat+0x35/0x770
[ 23.764423] ? kasan_save_stack+0x2a/0x50
[ 23.764693] ? kasan_set_track+0x29/0x40
[ 23.764948] filename_lookup+0x1db/0x5a0
[ 23.765212] ? __pfx_filename_lookup+0x10/0x10
[ 23.765512] ? __this_cpu_preempt_check+0x21/0x30
[ 23.765821] ? lock_is_held_type+0xf0/0x150
[ 23.766104] ? kmem_cache_alloc+0x32d/0x370
[ 23.766382] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 23.766744] kern_path+0x42/0x60
[ 23.766964] lookup_bdev+0xda/0x2a0
[ 23.767203] ? __pfx_lookup_bdev+0x10/0x10
[ 23.767485] ? __kmalloc_node_track_caller+0xfb/0x180
[ 23.767812] resume_store+0x233/0x540
[ 23.768050] ? __pfx_resume_store+0x10/0x10
[ 23.768326] ? __this_cpu_preempt_check+0x21/0x30
[ 23.768641] ? lock_acquire+0x1d9/0x530
[ 23.768905] ? __this_cpu_preempt_check+0x21/0x30
[ 23.769217] ? __pfx_resume_store+0x10/0x10
[ 23.769488] kobj_attr_store+0x5b/0x90
[ 23.769741] ? __pfx_kobj_attr_store+0x10/0x10
[ 23.770031] sysfs_kf_write+0x11f/0x180
[ 23.770290] kernfs_fop_write_iter+0x411/0x630
[ 23.770584] ? __pfx_sysfs_kf_write+0x10/0x10
[ 23.770879] __kernel_write_iter+0x28c/0x7f0
[ 23.771164] ? __pfx___kernel_write_iter+0x10/0x10
[ 23.771485] ? __pfx___lock_acquire+0x10/0x10
[ 23.771785] ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20
[ 23.772130] ? iov_iter_kvec+0x55/0x1f0
[ 23.772382] __kernel_write+0xe4/0x130
[ 23.772638] ? __pfx___kernel_write+0x10/0x10
[ 23.772922] ? __pfx_lock_acquire+0x10/0x10
[ 23.773209] ? __this_cpu_preempt_check+0x21/0x30
[ 23.773522] ? lock_is_held_type+0xf0/0x150
[ 23.773806] do_acct_process+0xd84/0x1580
[ 23.774075] ? __pfx_do_acct_process+0x10/0x10
[ 23.774374] ? __this_cpu_preempt_check+0x21/0x30
[ 23.774688] ? __pfx_lock_release+0x10/0x10
[ 23.774966] ? pin_kill+0x11e/0x980
[ 23.775201] acct_pin_kill+0x38/0x110
[ 23.775452] pin_kill+0x182/0x980
[ 23.775676] ? lock_acquire+0x1d9/0x530
[ 23.775935] ? __pfx_pin_kill+0x10/0x10
[ 23.776187] ? call_rcu+0x12/0x20
[ 23.776420] ? __pfx_autoremove_wake_function+0x10/0x10
[ 23.776761] ? __sanitizer_cov_trace_cmp8+0x1c/0x30
[ 23.777079] ? _find_next_bit+0x120/0x160
[ 23.777343] ? mnt_pin_kill+0x72/0x210
[ 23.777603] ? mnt_pin_kill+0x72/0x210
[ 23.777851] mnt_pin_kill+0x72/0x210
[ 23.778095] cleanup_mnt+0x343/0x400
[ 23.778335] __cleanup_mnt+0x1f/0x30
[ 23.778572] task_work_run+0x19d/0x2b0
[ 23.778823] ? __pfx_task_work_run+0x10/0x10
[ 23.779096] ? free_nsproxy+0x3b2/0x4e0
[ 23.779349] ? switch_task_namespaces+0xc8/0xe0
[ 23.779656] do_exit+0xaf5/0x2730
[ 23.779880] ? lock_release+0x417/0x7e0
[ 23.780139] ? __pfx_lock_release+0x10/0x10
[ 23.780427] ? __pfx_do_exit+0x10/0x10
[ 23.780673] ? __this_cpu_preempt_check+0x21/0x30
[ 23.780982] ? _raw_spin_unlock_irq+0x2c/0x60
[ 23.781272] ? lockdep_hardirqs_on+0x8a/0x110
[ 23.781564] ? _raw_spin_unlock_irq+0x2c/0x60
[ 23.781841] ? trace_hardirqs_on+0x26/0x120
[ 23.782120] do_group_exit+0xe5/0x2c0
[ 23.782369] __x64_sys_exit_group+0x4d/0x60
[ 23.782655] do_syscall_64+0x3c/0x90
[ 23.782899] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 23.783225] RIP: 0033:0x7fec33f18a4d
[ 23.783460] Code: Unable to access opcode bytes at 0x7fec33f18a23.
[ 23.783838] RSP: 002b:00007fffdd22ee98 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 23.784312] RAX: ffffffffffffffda RBX: 00007fec33ff69e0 RCX: 00007fec33f18a4d
[ 23.784763] RDX: 00000000000000e7 RSI: fffffffffffffeb0 RDI: 0000000000000001
[ 23.785209] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
[ 23.785663] R10: 00007fffdd22ed40 R11: 0000000000000246 R12: 00007fec33ff69e0
[ 23.786108] R13: 00007fec33ffbf00 R14: 0000000000000001 R15: 00007fec33ffbee8
[ 23.786560] </TASK>
[ 23.786707] Modules linked in:
[ 23.786910] ---[ end trace 0000000000000000 ]---
[ 23.787204] RIP: 0010:__lock_acquire+0xe83/0x5e10
[ 23.787517] Code: 00 00 3b 05 df 65 fc 08 0f 87 c8 08 00 00 41 bf 01 00 00 00 e9 84 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 da 48 c1 ea 0f
[ 23.788664] RSP: 0018:ff1100001446f060 EFLAGS: 00010006
[ 23.788992] RAX: dffffc0000000000 RBX: 1fe220000288de1f RCX: 0000000000000002
[ 23.789444] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000001
[ 23.789890] RBP: ff1100001446f180 R08: 0000000000000001 R09: 0000000000000001
[ 23.790336] R10: fffffbfff0e70d4c R11: 0000000000000050 R12: 0000000000000001
[ 23.790788] R13: ff1100001a9f0000 R14: 0000000000000000 R15: 0000000000000002
[ 23.791233] FS: 0000000000000000(0000) GS:ff1100006c200000(0000) knlGS:0000000000000000
[ 23.791728] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 23.792093] CR2: 00007fec33ffca50 CR3: 000000000667e003 CR4: 0000000000771ef0
[ 23.792538] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 23.792970] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 23.793423] PKRU: 55555554
[ 23.793608] note: repro[719] exited with irqs disabled
[ 23.793983] Fixing recursive fault but reboot is needed!
[ 23.794322] BUG: using smp_processor_id() in preemptible [00000000] code: repro/719
[ 23.794823] caller is debug_smp_processor_id+0x20/0x30
[ 23.795151] CPU: 0 PID: 719 Comm: repro Tainted: G D 6.4.0-rc2-1e8c813b083c+ #1
[ 23.795692] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 23.796391] Call Trace:
[ 23.796557] <TASK>
[ 23.796695] dump_stack_lvl+0xe1/0x110
[ 23.796943] dump_stack+0x19/0x20
[ 23.797164] check_preemption_disabled+0x16a/0x180
[ 23.797484] debug_smp_processor_id+0x20/0x30
[ 23.797771] __schedule+0x9a/0x3010
[ 23.797998] ? debug_smp_processor_id+0x20/0x30
[ 23.798293] ? rcu_is_watching+0x19/0xc0
[ 23.798558] ? __pfx___schedule+0x10/0x10
[ 23.798820] ? __pfx_lock_release+0x10/0x10
[ 23.799092] ? _raw_spin_unlock_irqrestore+0x35/0x70
[ 23.799404] ? do_task_dead+0xa6/0x110
[ 23.799655] ? debug_smp_processor_id+0x20/0x30
[ 23.799954] ? rcu_is_watching+0x19/0xc0
[ 23.800215] ? _raw_spin_unlock_irqrestore+0x35/0x70
[ 23.800537] ? trace_hardirqs_on+0x26/0x120
[ 23.800810] do_task_dead+0xde/0x110
[ 23.801046] make_task_dead+0x37f/0x3c0
[ 23.801304] ? __x64_sys_exit_group+0x4d/0x60
[ 23.801595] rewind_stack_and_make_dead+0x17/0x20
[ 23.801903] RIP: 0033:0x7fec33f18a4d
[ 23.802140] Code: Unable to access opcode bytes at 0x7fec33f18a23.
[ 23.802535] RSP: 002b:00007fffdd22ee98 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 23.803013] RAX: ffffffffffffffda RBX: 00007fec33ff69e0 RCX: 00007fec33f18a4d
[ 23.803458] RDX: 00000000000000e7 RSI: fffffffffffffeb0 RDI: 0000000000000001
[ 23.803908] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
[ 23.804354] R10: 00007fffdd22ed40 R11: 0000000000000246 R12: 00007fec33ff69e0
[ 23.804803] R13: 00007fec33ffbf00 R14: 0000000000000001 R15: 00007fec33ffbee8
[ 23.805258] </TASK>
[ 23.805421] BUG: scheduling while atomic: repro/719/0x00000000
[ 23.805801] INFO: lockdep is turned off.
[ 23.806050] Modules linked in:
[ 23.806249] Preemption disabled at:
[ 23.806252] [<ffffffff813123e7>] do_task_dead+0x27/0x110
[ 23.806829] CPU: 0 PID: 719 Comm: repro Tainted: G D 6.4.0-rc2-1e8c813b083c+ #1
[ 23.807382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 23.808097] Call Trace:
[ 23.808260] <TASK>
[ 23.808403] dump_stack_lvl+0xe1/0x110
[ 23.808656] ? do_task_dead+0x27/0x110
[ 23.808898] dump_stack+0x19/0x20
[ 23.809120] __schedule_bug+0x13f/0x190
[ 23.809379] __schedule+0x221f/0x3010
[ 23.809630] ? rcu_is_watching+0x19/0xc0
[ 23.809887] ? __pfx___schedule+0x10/0x10
[ 23.810142] ? __pfx_lock_release+0x10/0x10
[ 23.810419] ? _raw_spin_unlock_irqrestore+0x35/0x70
[ 23.810751] ? do_task_dead+0xa6/0x110
[ 23.810994] ? debug_smp_processor_id+0x20/0x30
[ 23.811289] ? rcu_is_watching+0x19/0xc0
[ 23.811553] ? _raw_spin_unlock_irqrestore+0x35/0x70
[ 23.811873] ? trace_hardirqs_on+0x26/0x120
[ 23.812148] do_task_dead+0xde/0x110
[ 23.812389] make_task_dead+0x37f/0x3c0
[ 23.812651] ? __x64_sys_exit_group+0x4d/0x60
[ 23.812938] rewind_stack_and_make_dead+0x17/0x20
[ 23.813246] RIP: 0033:0x7fec33f18a4d
[ 23.813486] Code: Unable to access opcode bytes at 0x7fec33f18a23.
[ 23.813872] RSP: 002b:00007fffdd22ee98 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 23.814347] RAX: ffffffffffffffda RBX: 00007fec33ff69e0 RCX: 00007fec33f18a4d
[ 23.814794] RDX: 00000000000000e7 RSI: fffffffffffffeb0 RDI: 0000000000000001
[ 23.815231] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
[ 23.815680] R10: 00007fffdd22ed40 R11: 0000000000000246 R12: 00007fec33ff69e0
[ 23.816126] R13: 00007fec33ffbf00 R14: 0000000000000001 R15: 00007fec33ffbee8
[ 23.816580] </TASK>
[ 23.816734] ------------[ cut here ]------------
"
I hope it's helpful.
---
If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.
How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh // it needs qemu-system-x86_64 and I used v7.1.0
// start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
// You could change the bzImage_xxx as you want
// Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost
After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/
Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage //x should equal or less than cpu num your pc has
Fill the bzImage file into above start3.sh to load the target kernel in vm.
Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install
Best Regards,
Thanks!
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Syzkaller & bisect] There is general protection fault in path_init in v6.11-rc2
2024-08-10 14:04 [Syzkaller & bisect] There is general protection fault in path_init in v6.11-rc2 Pengfei Xu
@ 2025-01-27 9:18 ` Zicheng Qu
2025-02-10 13:17 ` [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs Zicheng Qu
0 siblings, 1 reply; 18+ messages in thread
From: Zicheng Qu @ 2025-01-27 9:18 UTC (permalink / raw)
To: quzicheng
Cc: pengfei.xu, axboe, hch, jlayton, brauner, joel.granados, rafael,
len.brown, pavel, linux-kernel, linux-pm, syzkaller-bugs
Hi,
I am encountering this similar panic issue in v6.6 and would greatly
appreciate any guidance or suggestions you might have.
It seems that the sysfs path was passed to acct(), and when the process
exited, the fs_struct was released. However, acct_pin_kill() attempted
to write to the hibernate sysfs interface, triggering a null pointer
dereference.
I added a few more logs ï¼labeled the file path, the function name and some key info) based on Pengfei. Below are the relevant log
excerpts and details of the problem for the process/thread T9251:
[ 266.570716][ T9251] kernel/acct.c acct_on(): ./file0
[ 266.574701][ T7380] fs/namei.c path_init():, fs_struct is: not null
[ 266.576955][ T9251] fs/namei.c path_init():, fs_struct is: not null
[ 266.579385][ T9317] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.579674][ T7380] fs/namei.c path_init():, fs_struct is: not null
[ 266.584518][ T9244] fs/fs_struct.c exit_fs(): the kill is: 0, fs_struct is released
[ 266.587130][ T9268] fs/fs_struct.c exit_fs(): the kill is: 0, fs_struct is released
[ 266.587478][ T9251] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.591099][ T9278] Process accounting resumed
[ 266.592558][ T7380] fs/namei.c path_init():, fs_struct is: not null
[ 266.595184][ T9278] kernel/power/hibernate.c resume_store()
[ 266.598253][ T7380] fs/namei.c path_init():, fs_struct is: not null
[ 266.601043][ T9278] fs/namei.c path_init():, fs_struct is: not null
[ 266.605319][ T7380] fs/namei.c path_init():, fs_struct is: not null
[ 266.609439][ T9278] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.614479][ T9321] fs/namei.c path_init():, fs_struct is: not null
[ 266.615085][ T9320] fs/namei.c path_init():, fs_struct is: not null
[ 266.616612][ T9251] kernel/power/hibernate.c resume_store()
[ 266.620361][ T9321] fs/namei.c path_init():, fs_struct is: not null
[ 266.622487][ T9251] fs/namei.c path_init():, fs_struct is: null
[ 266.624631][ T9319] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.625668][ T9321] fs/namei.c path_init():, fs_struct is: not null
[ 266.625737][ T9321] fs/namei.c path_init():, fs_struct is: not null
[ 266.628762][ T9251] Unable to handle kernel paging request at virtual address dfff800000000001
[ 266.629149][ T9328] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.633753][ T9321] fs/namei.c path_init():, fs_struct is: not null
[ 266.635200][ T9251] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 266.637804][ T9331] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.641370][ T9251] Mem abort info:
[ 266.641375][ T9251] ESR = 0x0000000096000004
[ 266.643344][ T9334] fs/namei.c path_init():, fs_struct is: not null
[ 266.643533][ T9332] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.649985][ T9335] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 266.650571][ T9251] EC = 0x25: DABT (current EL), IL = 32 bits
[ 266.679306][ T9333] fs/namei.c path_init():, fs_struct is: not null
[ 266.681354][ T9251] SET = 0, FnV = 0
[ 266.681360][ T9251] EA = 0, S1PTW = 0
[ 267.132845][ T9280] fs/fs_struct.c exit_fs(): the kill is: 0, fs_struct is released
[ 267.132913][ T9274] fs/fs_struct.c exit_fs(): the kill is: 0, fs_struct is released
[ 267.133970][ T9251] FSC = 0x04: level 0 translation fault
[ 267.133978][ T9251] Data abort info:
[ 267.133981][ T9251] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 267.133984][ T9251] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 267.133988][ T9251] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 267.133992][ T9251] [dfff800000000001] address between user and kernel address ranges
[ 267.134000][ T9251] Internal error: Oops: 0000000096000004 [#1] SMP
[ 267.134817][ T9320] fs/fs_struct.c exit_fs(): the kill is: 1, fs_struct is released
[ 267.137764][ T7101] fs/namei.c path_init():, fs_struct is: not null
[ 267.140527][ T9251] Modules linked in:
[ 267.140541][ T9251] CPU: 2 PID: 9251 Comm: syz.3.547 Not tainted 6.6.0-qzc-reproduct-1+ #12
[ 267.140550][ T9251] Hardware name: linux,dummy-virt (DT)
[ 267.140554][ T9251] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 267.140561][ T9251] pc : path_init+0x5f0/0x16c0
[ 267.140582][ T9251] lr : path_init+0x5c8/0x16c0
[ 267.140588][ T9251] sp : ffff800082c06e40
[ 267.140592][ T9251] x29: ffff800082c06e40 x28: 0000000000000000
[ 267.143204][ T7101] fs/namei.c path_init():, fs_struct is: not null
[ 267.146064][ T9251] x27: dfff800000000000
[ 267.146073][ T9251] x26: 0000000000000000 x25: 0000000000000008 x24: 0000000000000041
[ 267.146082][ T9251] x23: ffff1f2bceea5520 x22: 1ffff00010580e0c x21: ffff800082c07080
[ 267.146091][ T9251] x20: 1ffff00010580e10 x19: ffff800082c07060 x18: 0000000000000000
[ 267.146100][ T9251] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[ 267.146108][ T9251] x14: 0000000000000000 x13: 205d343332395420 x12: 0000000000000005
[ 267.146115][ T9251] x11: ffff800082c07090 x10: ffff800082c07068
[ 267.148952][ T7101] fs/namei.c path_init():, fs_struct is: not null
[ 267.151802][ T9251] x9 : dfff800000000001
[ 267.151810][ T9251] x8 : 00008fffefa7f290 x7 : ffff800082c070a0 x6 : 0000000000000003
[ 267.151819][ T9251] x5 : ffff800082c06b80 x4 : ffff700010580d71 x3 : 1ffff00010580de0
[ 267.151827][ T9251] x2 : 0000000000000000 x1 : ffff1f2bc490bfc0 x0 : 000000000000002b
[ 267.151836][ T9251] Call trace:
[ 267.151840][ T9251] path_init+0x5f0/0x16c0
[ 267.151848][ T9251] path_lookupat+0x3c/0x590
[ 267.151855][ T9251] filename_lookup+0x144/0x410
[ 267.151859][ T9251] kern_path+0x44/0x70
[ 267.151863][ T9251] lookup_bdev+0xb8/0x220
[ 267.151871][ T9251] resume_store+0x184/0x320
[ 267.151878][ T9251] kobj_attr_store+0x3c/0x70
[ 267.154689][ T7101] fs/namei.c path_init():, fs_struct is: not null
[ 267.157783][ T9251] sysfs_kf_write+0xfc/0x188
[ 267.157796][ T9251] kernfs_fop_write_iter+0x274/0x3e0
[ 267.157800][ T9251] __kernel_write_iter+0x1c4/0x600
[ 267.157808][ T9251] __kernel_write+0xbc/0x100
[ 267.157813][ T9251] do_acct_process+0x3e8/0x620
[ 267.157821][ T9251] acct_pin_kill+0x3c/0x110
[ 267.157826][ T9251] pin_kill+0x164/0x610
[ 267.157832][ T9251] mnt_pin_kill+0x50/0x98
[ 267.157836][ T9251] cleanup_mnt+0x24c/0x2c8
[ 267.161037][ T7101] fs/namei.c path_init():, fs_struct is: not null
[ 267.164241][ T9251] __cleanup_mnt+0x1c/0x30
[ 267.164252][ T9251] task_work_run+0x17c/0x308
[ 267.164259][ T9251] do_exit+0x3ac/0xa30
[ 267.164267][ T9251] do_group_exit+0x100/0x348
[ 267.164272][ T9251] get_signal+0x107c/0x10f8
[ 267.164277][ T9251] do_signal+0x160/0x400
[ 267.164283][ T9251] do_notify_resume+0x1c4/0x470
[ 267.164287][ T9251] el0_svc+0x1c0/0x1e8
[ 267.164294][ T9251] el0t_64_sync_handler+0xc0/0xc8
[ 267.164299][ T9251] el0t_64_sync+0x188/0x190
[ 267.166883][ T7101] fs/namei.c path_init():, fs_struct is: not null
[ 267.170069][ T9251] Code: 91010267 d343fe76 9100c26b 9100226a (39c00120)
[ 267.170079][ T9251] ---[ end trace 0000000000000000 ]---
[ 267.170084][ T9251] Kernel panic - not syncing: Oops: Fatal exception
[ 267.170090][ T9251] SMP: stopping secondary CPUs
[ 267.170167][ T9251] Kernel Offset: 0x22a72e400000 from 0xffff800080000000
[ 267.170172][ T9251] PHYS_OFFSET: 0xffffe0d540000000
[ 267.170175][ T9251] CPU features: 0x00,00000008,00002009,e0080000,1000421b
[ 267.170182][ T9251] Memory Limit: none
[ 269.627565][ T9251] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-01-27 9:18 ` Zicheng Qu
@ 2025-02-10 13:17 ` Zicheng Qu
2025-02-10 15:12 ` Christian Brauner
0 siblings, 1 reply; 18+ messages in thread
From: Zicheng Qu @ 2025-02-10 13:17 UTC (permalink / raw)
To: jlayton, brauner, axboe, joel.granados, tglx, viro, linux-kernel
Cc: hch, len.brown, pavel, pengfei.xu, rafael, syzkaller-bugs,
linux-pm, tanghui20, zhangqiao22, judy.chenhui, quzicheng
The acct feature is designed to write process records to specified
files, typically paths like /var/log/pacct. However, writing to sysfs
paths (e.g., /sys/power/resume) maylead to a NULL pointer dereference
issue. The acct() should not write to sysfs.
When call the acct() with a sysfs path, such as /sys/power/resume, the
process exit via do_exit(), it calls exit_fs() to clean up fs_struct
inside. Subsequently, exit_task_work() calls acct_pin_kill(), triggering
sysfs operations. This invokes the hibernate resume_store(). Since the
fs_struct has been cleaned, it results in a NULL pointer dereference.
This patch ensures that acct does not attempt to write to sysfs paths,
preventing the described issue.
[ 220.064848][ T4630] Unable to handle kernel paging request at virtual address dfff800000000001
[ 220.073744][ T4630] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
[ 220.080847][ T4630] Mem abort info:
[ 220.088915][ T4630] ESR = 0x0000000096000004
[ 220.088921][ T4630] EC = 0x25: DABT (current EL), IL = 32 bits
[ 220.088925][ T4630] SET = 0, FnV = 0
[ 220.088927][ T4630] EA = 0, S1PTW = 0
[ 220.088930][ T4630] FSC = 0x04: level 0 translation fault
[ 220.088933][ T4630] Data abort info:
[ 220.088934][ T4630] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 220.088937][ T4630] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 220.088940][ T4630] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 220.088943][ T4630] [dfff800000000001] address between user and kernel address ranges
[ 220.088949][ T4630] Internal error: Oops: 0000000096000004 [#1] SMP
[ 220.098020][ T4630] Modules linked in:
[ 220.104001][ T4630]
[ 220.104007][ T4630] CPU: 1 PID: 4630 Comm: syz.14.167 Not tainted 6.6.0-qzc-20250207-+ #16
[ 220.104014][ T4630] Hardware name: linux,dummy-virt (DT)
[ 220.104017][ T4630] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 220.104023][ T4630] pc : path_init+0x61c/0x1710
[ 220.104036][ T4630] lr : path_init+0x5f4/0x1710
[ 220.104042][ T4630] sp : ffff800083c86e20
[ 220.104044][ T4630] x29: ffff800083c86e20 x28: 0000000000000000 x27: dfff800000000000
[ 220.109465][ T4630]
[ 220.109469][ T4630] x26: 0000000000000000
[ 220.114307][ T4630] x25: 0000000000000008 x24: 0000000000000041
[ 220.114316][ T4630] x23: ffff303e696a2220 x22: 1ffff00010790e08 x21: ffff800083c87060
[ 220.114324][ T4630] x20: 1ffff00010790e0c x19: ffff800083c87040 x18: 0000000000000000
[ 220.114331][ T4630] x17: 756e203a73692074 x16: 63757274735f7366 x15: 202c292865726f74
[ 220.114339][ T4630] x14: 735f656d75736572 x13: 205d303336345420 x12: 0000000000000005
[ 220.114346][ T4630] x11: ffff800083c87070 x10: ffff800083c87048 x9 : dfff800000000001
[ 220.114354][ T4630] x8 : 00008fffef86f294
[ 220.119872][ T4630] x7 : ffff800083c87080
[ 220.126082][ T4630] x6 : 0000000000000003
[ 220.126089][ T4630] x5 : ffff800083c86b60 x4 : ffff700010790d6d x3 : 1ffff00010790ddc
[ 220.126097][ T4630] x2 : 0000000000000000 x1 : ffff303e680f1540 x0 : 0000000000000032
[ 220.126105][ T4630] Call trace:
[ 220.126108][ T4630] path_init+0x61c/0x1710
[ 220.152138][ T4630] path_lookupat+0x3c/0x590
[ 220.152150][ T4630] filename_lookup+0x144/0x410
[ 220.152155][ T4630] kern_path+0x44/0x70
[ 220.152158][ T4630] lookup_bdev+0xb8/0x220
[ 220.158873][ T4630] resume_store+0x1d8/0x3f8
[ 220.158882][ T4630] kobj_attr_store+0x3c/0x70
[ 220.163343][ T4630] sysfs_kf_write+0xfc/0x188
[ 220.163352][ T4630] kernfs_fop_write_iter+0x274/0x3e0
[ 220.163356][ T4630] __kernel_write_iter+0x1c4/0x600
[ 220.163363][ T4630] __kernel_write+0xbc/0x100
[ 220.163368][ T4630] do_acct_process+0x3e8/0x620
[ 220.163374][ T4630] acct_pin_kill+0xa0/0x190
[ 220.163379][ T4630] pin_kill+0x164/0x610
[ 220.163384][ T4630] mnt_pin_kill+0x50/0x98
[ 220.169427][ T4630] cleanup_mnt+0x24c/0x2c8
[ 220.169438][ T4630] __cleanup_mnt+0x1c/0x30
[ 220.169443][ T4630] task_work_run+0x17c/0x308
[ 220.169449][ T4630] do_exit+0x3ac/0xa30
[ 220.169455][ T4630] do_group_exit+0x100/0x348
[ 220.169460][ T4630] get_signal+0x107c/0x10f8
[ 220.169464][ T4630] do_signal+0x160/0x400
[ 220.169468][ T4630] do_notify_resume+0x1c4/0x470
[ 220.169472][ T4630] el0_svc+0x1c0/0x1e8
[ 220.169479][ T4630] el0t_64_sync_handler+0xc0/0xc8
[ 220.169482][ T4630] el0t_64_sync+0x188/0x190
[ 220.169489][ T4630] Code: 91010267 d343fe76 9100c26b 9100226a (39c00120)
[ 220.174309][ T4630] ---[ end trace 0000000000000000 ]---
[ 220.174316][ T4630] Kernel panic - not syncing: Oops: Fatal exception
[ 220.174319][ T4630] SMP: stopping secondary CPUs
[ 220.174387][ T4630] Kernel Offset: 0x3dd279e00000 from 0xffff800080000000
[ 220.174391][ T4630] PHYS_OFFSET: 0xffffcfc2c0000000
[ 220.174394][ T4630] CPU features: 0x00,00000008,00002009,e0080000,1000421b
[ 220.174398][ T4630] Memory Limit: none
[ 220.347443][ T4630] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
Fixes: 669abf4e5539 ("vfs: make path_openat take a struct filename pointer")
Cc: stable@vger.kernel.org
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
---
kernel/acct.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/acct.c b/kernel/acct.c
index 31222e8cd534..0beee5effee7 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -239,6 +239,14 @@ static int acct_on(struct filename *pathname)
filp_close(file, NULL);
return -EIO;
}
+
+ mnt = file->f_path.mnt;
+ if (mnt->mnt_sb->s_magic == SYSFS_MAGIC) {
+ kfree(acct);
+ filp_close(file, NULL);
+ return -EINVAL;
+ }
+
internal = mnt_clone_internal(&file->f_path);
if (IS_ERR(internal)) {
kfree(acct);
@@ -252,7 +260,7 @@ static int acct_on(struct filename *pathname)
filp_close(file, NULL);
return err;
}
- mnt = file->f_path.mnt;
+
file->f_path.mnt = internal;
atomic_long_set(&acct->count, 1);
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-02-10 13:17 ` [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs Zicheng Qu
@ 2025-02-10 15:12 ` Christian Brauner
2025-02-10 15:21 ` Al Viro
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
0 siblings, 2 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-10 15:12 UTC (permalink / raw)
To: Zicheng Qu, Linus Torvalds
Cc: jlayton, axboe, joel.granados, tglx, viro, linux-kernel, hch,
len.brown, pavel, pengfei.xu, rafael, syzkaller-bugs, linux-pm,
tanghui20, zhangqiao22, judy.chenhui
On Mon, Feb 10, 2025 at 01:17:19PM +0000, Zicheng Qu wrote:
> The acct feature is designed to write process records to specified
> files, typically paths like /var/log/pacct. However, writing to sysfs
> paths (e.g., /sys/power/resume) maylead to a NULL pointer dereference
> issue. The acct() should not write to sysfs.
>
> When call the acct() with a sysfs path, such as /sys/power/resume, the
> process exit via do_exit(), it calls exit_fs() to clean up fs_struct
> inside. Subsequently, exit_task_work() calls acct_pin_kill(), triggering
> sysfs operations. This invokes the hibernate resume_store(). Since the
> fs_struct has been cleaned, it results in a NULL pointer dereference.
This is a mess.
As an immediate fix what you're doing will stop the bleeding for sysfs
only. But who knows what people can do if they pass in a procfs or some
other special sauce filesystem path.
There's no guarantee that there isn't an internal lookup that's somehow
triggered in some filesystem by this accounting nonsense.
One fix would be to move exit_fs() past exit_task_work(). It looks like
that this should be doable without much of a problem and it would fix
the path_init() problem.
There should hopefully be nothing relying on task->fs == NULL in
exit_task_work().
There's other solutions but they all get increasingly disgusting or
maybe I'm not imaginative enough rn.
But bigger picture: Can we try and get rid of that accounting stuff? We
could start by making acct(2) return -ENOSYS unconditionally for a start
and see if anything does actually break.
> This patch ensures that acct does not attempt to write to sysfs paths,
> preventing the described issue.
>
> [ 220.064848][ T4630] Unable to handle kernel paging request at virtual address dfff800000000001
> [ 220.073744][ T4630] KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> [ 220.080847][ T4630] Mem abort info:
> [ 220.088915][ T4630] ESR = 0x0000000096000004
> [ 220.088921][ T4630] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 220.088925][ T4630] SET = 0, FnV = 0
> [ 220.088927][ T4630] EA = 0, S1PTW = 0
> [ 220.088930][ T4630] FSC = 0x04: level 0 translation fault
> [ 220.088933][ T4630] Data abort info:
> [ 220.088934][ T4630] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
> [ 220.088937][ T4630] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> [ 220.088940][ T4630] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [ 220.088943][ T4630] [dfff800000000001] address between user and kernel address ranges
> [ 220.088949][ T4630] Internal error: Oops: 0000000096000004 [#1] SMP
> [ 220.098020][ T4630] Modules linked in:
> [ 220.104001][ T4630]
> [ 220.104007][ T4630] CPU: 1 PID: 4630 Comm: syz.14.167 Not tainted 6.6.0-qzc-20250207-+ #16
> [ 220.104014][ T4630] Hardware name: linux,dummy-virt (DT)
> [ 220.104017][ T4630] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 220.104023][ T4630] pc : path_init+0x61c/0x1710
> [ 220.104036][ T4630] lr : path_init+0x5f4/0x1710
> [ 220.104042][ T4630] sp : ffff800083c86e20
> [ 220.104044][ T4630] x29: ffff800083c86e20 x28: 0000000000000000 x27: dfff800000000000
> [ 220.109465][ T4630]
> [ 220.109469][ T4630] x26: 0000000000000000
> [ 220.114307][ T4630] x25: 0000000000000008 x24: 0000000000000041
> [ 220.114316][ T4630] x23: ffff303e696a2220 x22: 1ffff00010790e08 x21: ffff800083c87060
> [ 220.114324][ T4630] x20: 1ffff00010790e0c x19: ffff800083c87040 x18: 0000000000000000
> [ 220.114331][ T4630] x17: 756e203a73692074 x16: 63757274735f7366 x15: 202c292865726f74
> [ 220.114339][ T4630] x14: 735f656d75736572 x13: 205d303336345420 x12: 0000000000000005
> [ 220.114346][ T4630] x11: ffff800083c87070 x10: ffff800083c87048 x9 : dfff800000000001
> [ 220.114354][ T4630] x8 : 00008fffef86f294
> [ 220.119872][ T4630] x7 : ffff800083c87080
> [ 220.126082][ T4630] x6 : 0000000000000003
> [ 220.126089][ T4630] x5 : ffff800083c86b60 x4 : ffff700010790d6d x3 : 1ffff00010790ddc
> [ 220.126097][ T4630] x2 : 0000000000000000 x1 : ffff303e680f1540 x0 : 0000000000000032
> [ 220.126105][ T4630] Call trace:
> [ 220.126108][ T4630] path_init+0x61c/0x1710
> [ 220.152138][ T4630] path_lookupat+0x3c/0x590
> [ 220.152150][ T4630] filename_lookup+0x144/0x410
> [ 220.152155][ T4630] kern_path+0x44/0x70
> [ 220.152158][ T4630] lookup_bdev+0xb8/0x220
> [ 220.158873][ T4630] resume_store+0x1d8/0x3f8
> [ 220.158882][ T4630] kobj_attr_store+0x3c/0x70
> [ 220.163343][ T4630] sysfs_kf_write+0xfc/0x188
> [ 220.163352][ T4630] kernfs_fop_write_iter+0x274/0x3e0
> [ 220.163356][ T4630] __kernel_write_iter+0x1c4/0x600
> [ 220.163363][ T4630] __kernel_write+0xbc/0x100
> [ 220.163368][ T4630] do_acct_process+0x3e8/0x620
> [ 220.163374][ T4630] acct_pin_kill+0xa0/0x190
> [ 220.163379][ T4630] pin_kill+0x164/0x610
> [ 220.163384][ T4630] mnt_pin_kill+0x50/0x98
> [ 220.169427][ T4630] cleanup_mnt+0x24c/0x2c8
> [ 220.169438][ T4630] __cleanup_mnt+0x1c/0x30
> [ 220.169443][ T4630] task_work_run+0x17c/0x308
> [ 220.169449][ T4630] do_exit+0x3ac/0xa30
> [ 220.169455][ T4630] do_group_exit+0x100/0x348
> [ 220.169460][ T4630] get_signal+0x107c/0x10f8
> [ 220.169464][ T4630] do_signal+0x160/0x400
> [ 220.169468][ T4630] do_notify_resume+0x1c4/0x470
> [ 220.169472][ T4630] el0_svc+0x1c0/0x1e8
> [ 220.169479][ T4630] el0t_64_sync_handler+0xc0/0xc8
> [ 220.169482][ T4630] el0t_64_sync+0x188/0x190
> [ 220.169489][ T4630] Code: 91010267 d343fe76 9100c26b 9100226a (39c00120)
> [ 220.174309][ T4630] ---[ end trace 0000000000000000 ]---
> [ 220.174316][ T4630] Kernel panic - not syncing: Oops: Fatal exception
> [ 220.174319][ T4630] SMP: stopping secondary CPUs
> [ 220.174387][ T4630] Kernel Offset: 0x3dd279e00000 from 0xffff800080000000
> [ 220.174391][ T4630] PHYS_OFFSET: 0xffffcfc2c0000000
> [ 220.174394][ T4630] CPU features: 0x00,00000008,00002009,e0080000,1000421b
> [ 220.174398][ T4630] Memory Limit: none
> [ 220.347443][ T4630] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
>
> Fixes: 669abf4e5539 ("vfs: make path_openat take a struct filename pointer")
> Cc: stable@vger.kernel.org
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> ---
> kernel/acct.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 31222e8cd534..0beee5effee7 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -239,6 +239,14 @@ static int acct_on(struct filename *pathname)
> filp_close(file, NULL);
> return -EIO;
> }
> +
> + mnt = file->f_path.mnt;
> + if (mnt->mnt_sb->s_magic == SYSFS_MAGIC) {
> + kfree(acct);
> + filp_close(file, NULL);
> + return -EINVAL;
> + }
> +
> internal = mnt_clone_internal(&file->f_path);
> if (IS_ERR(internal)) {
> kfree(acct);
> @@ -252,7 +260,7 @@ static int acct_on(struct filename *pathname)
> filp_close(file, NULL);
> return err;
> }
> - mnt = file->f_path.mnt;
> +
> file->f_path.mnt = internal;
>
> atomic_long_set(&acct->count, 1);
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-02-10 15:12 ` Christian Brauner
@ 2025-02-10 15:21 ` Al Viro
2025-02-10 16:02 ` Christian Brauner
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
1 sibling, 1 reply; 18+ messages in thread
From: Al Viro @ 2025-02-10 15:21 UTC (permalink / raw)
To: Christian Brauner
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
linux-kernel, hch, len.brown, pavel, pengfei.xu, rafael,
syzkaller-bugs, linux-pm, tanghui20, zhangqiao22, judy.chenhui
On Mon, Feb 10, 2025 at 04:12:54PM +0100, Christian Brauner wrote:
> One fix would be to move exit_fs() past exit_task_work(). It looks like
> that this should be doable without much of a problem and it would fix
> the path_init() problem.
>
> There should hopefully be nothing relying on task->fs == NULL in
> exit_task_work().
There's a question of the task_work_add() issued by exit_task_fs(),
though.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-02-10 15:21 ` Al Viro
@ 2025-02-10 16:02 ` Christian Brauner
2025-02-10 18:19 ` Al Viro
0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-02-10 16:02 UTC (permalink / raw)
To: Al Viro
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
linux-kernel, hch, len.brown, pavel, pengfei.xu, rafael,
syzkaller-bugs, linux-pm, tanghui20, zhangqiao22, judy.chenhui
On Mon, Feb 10, 2025 at 03:21:46PM +0000, Al Viro wrote:
> On Mon, Feb 10, 2025 at 04:12:54PM +0100, Christian Brauner wrote:
>
> > One fix would be to move exit_fs() past exit_task_work(). It looks like
> > that this should be doable without much of a problem and it would fix
> > the path_init() problem.
> >
> > There should hopefully be nothing relying on task->fs == NULL in
> > exit_task_work().
>
> There's a question of the task_work_add() issued by exit_task_fs(),
> though.
Can't we simply remove the pins on the mounts of fs->root and fs->pwd in
exit_fs() explicitly? If that works I think that's a fair enough
compromise for this shite.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-02-10 16:02 ` Christian Brauner
@ 2025-02-10 18:19 ` Al Viro
2025-02-11 0:23 ` Al Viro
0 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2025-02-10 18:19 UTC (permalink / raw)
To: Christian Brauner
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
linux-kernel, hch, len.brown, pavel, pengfei.xu, rafael,
syzkaller-bugs, linux-pm, tanghui20, zhangqiao22, judy.chenhui
On Mon, Feb 10, 2025 at 05:02:35PM +0100, Christian Brauner wrote:
> On Mon, Feb 10, 2025 at 03:21:46PM +0000, Al Viro wrote:
> > On Mon, Feb 10, 2025 at 04:12:54PM +0100, Christian Brauner wrote:
> >
> > > One fix would be to move exit_fs() past exit_task_work(). It looks like
> > > that this should be doable without much of a problem and it would fix
> > > the path_init() problem.
> > >
> > > There should hopefully be nothing relying on task->fs == NULL in
> > > exit_task_work().
> >
> > There's a question of the task_work_add() issued by exit_task_fs(),
> > though.
>
> Can't we simply remove the pins on the mounts of fs->root and fs->pwd in
> exit_fs() explicitly? If that works I think that's a fair enough
> compromise for this shite.
I'd rather go for a simpler approach... Why do we need those writes
to be done in context of exiting process in the first place? It's
not as if they needed to go out before it terminates, so what's to
stop us from having a kernel thread in background and queue the data
to be written for it to pick up?
Does anybody see problems with that approach?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-02-10 18:19 ` Al Viro
@ 2025-02-11 0:23 ` Al Viro
2025-02-11 10:17 ` Christian Brauner
0 siblings, 1 reply; 18+ messages in thread
From: Al Viro @ 2025-02-11 0:23 UTC (permalink / raw)
To: Christian Brauner
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
linux-kernel, hch, len.brown, pavel, pengfei.xu, rafael,
syzkaller-bugs, linux-pm, tanghui20, zhangqiao22, judy.chenhui
On Mon, Feb 10, 2025 at 06:19:02PM +0000, Al Viro wrote:
> On Mon, Feb 10, 2025 at 05:02:35PM +0100, Christian Brauner wrote:
> > On Mon, Feb 10, 2025 at 03:21:46PM +0000, Al Viro wrote:
> > > On Mon, Feb 10, 2025 at 04:12:54PM +0100, Christian Brauner wrote:
> > >
> > > > One fix would be to move exit_fs() past exit_task_work(). It looks like
> > > > that this should be doable without much of a problem and it would fix
> > > > the path_init() problem.
> > > >
> > > > There should hopefully be nothing relying on task->fs == NULL in
> > > > exit_task_work().
> > >
> > > There's a question of the task_work_add() issued by exit_task_fs(),
> > > though.
> >
> > Can't we simply remove the pins on the mounts of fs->root and fs->pwd in
> > exit_fs() explicitly? If that works I think that's a fair enough
> > compromise for this shite.
>
> I'd rather go for a simpler approach... Why do we need those writes
> to be done in context of exiting process in the first place? It's
> not as if they needed to go out before it terminates, so what's to
> stop us from having a kernel thread in background and queue the data
> to be written for it to pick up?
>
> Does anybody see problems with that approach?
Note, BTW, that games with rlimit and creds switching disappear if done
that way.
FWIW, I wonder if we should simply allocate a page worth of buffer,
occupied by acct_t array, with count + pointer to buffer kept in acct,
with acct->mutex used to protect the entire thing, so that do_acct_process()
would add a record to that sucker and wake the kthread up, with kthread
handling actual writes and emptying the buffer. No need for exit(2)
to wait unless the buffer is full...
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs
2025-02-11 0:23 ` Al Viro
@ 2025-02-11 10:17 ` Christian Brauner
0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-11 10:17 UTC (permalink / raw)
To: Al Viro
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
linux-kernel, hch, len.brown, pavel, pengfei.xu, rafael,
syzkaller-bugs, linux-pm, tanghui20, zhangqiao22, judy.chenhui
On Tue, Feb 11, 2025 at 12:23:08AM +0000, Al Viro wrote:
> On Mon, Feb 10, 2025 at 06:19:02PM +0000, Al Viro wrote:
> > On Mon, Feb 10, 2025 at 05:02:35PM +0100, Christian Brauner wrote:
> > > On Mon, Feb 10, 2025 at 03:21:46PM +0000, Al Viro wrote:
> > > > On Mon, Feb 10, 2025 at 04:12:54PM +0100, Christian Brauner wrote:
> > > >
> > > > > One fix would be to move exit_fs() past exit_task_work(). It looks like
> > > > > that this should be doable without much of a problem and it would fix
> > > > > the path_init() problem.
> > > > >
> > > > > There should hopefully be nothing relying on task->fs == NULL in
> > > > > exit_task_work().
> > > >
> > > > There's a question of the task_work_add() issued by exit_task_fs(),
> > > > though.
> > >
> > > Can't we simply remove the pins on the mounts of fs->root and fs->pwd in
> > > exit_fs() explicitly? If that works I think that's a fair enough
> > > compromise for this shite.
> >
> > I'd rather go for a simpler approach... Why do we need those writes
> > to be done in context of exiting process in the first place? It's
> > not as if they needed to go out before it terminates, so what's to
> > stop us from having a kernel thread in background and queue the data
> > to be written for it to pick up?
> >
> > Does anybody see problems with that approach?
>
> Note, BTW, that games with rlimit and creds switching disappear if done
> that way.
>
> FWIW, I wonder if we should simply allocate a page worth of buffer,
> occupied by acct_t array, with count + pointer to buffer kept in acct,
> with acct->mutex used to protect the entire thing, so that do_acct_process()
> would add a record to that sucker and wake the kthread up, with kthread
> handling actual writes and emptying the buffer. No need for exit(2)
> to wait unless the buffer is full...
I had thought about it but both LTP and the selftests want the buffer to
be filled after the process exits.
So let's not overly complicate this. I want this to be as simple as
possible and then start deprecating this api asap.
I have a patch that just moves the final write into the workqueue. But
let's keep the cred override because who knows what security hole we
open up if we skip the override cred.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 0/2] acct: don't allow access to internal filesystems
2025-02-10 15:12 ` Christian Brauner
2025-02-10 15:21 ` Al Viro
@ 2025-02-11 17:15 ` Christian Brauner
2025-02-11 17:15 ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
` (2 more replies)
1 sibling, 3 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-11 17:15 UTC (permalink / raw)
To: Zicheng Qu, Linus Torvalds
Cc: Christian Brauner, jlayton, axboe, joel.granados, tglx, viro, hch,
len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
In [1] it was reported that the acct(2) system call can be used to
trigger a NULL deref in cases where it is set to write to a file that
triggers an internal lookup.
This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
point the where the write to this file happens the calling task has
already exited and called exit_fs() but an internal lookup might be
triggered through lookup_bdev(). This may trigger a NULL-deref
when accessing current->fs.
This series does two things:
- Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.
- Block access to kernel internal filesystems as well as procfs and
sysfs in the first place.
This api should stop to exist imho.
Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
acct: perform last write from workqueue
acct: block access to kernel internal filesystems
kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 84 insertions(+), 50 deletions(-)
---
base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
change-id: 20250211-work-acct-a6d8e92a5fe0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/2] acct: perform last write from workqueue
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
@ 2025-02-11 17:15 ` Christian Brauner
2025-02-11 17:16 ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
2025-02-11 18:56 ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
2 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-11 17:15 UTC (permalink / raw)
To: Zicheng Qu, Linus Torvalds
Cc: Christian Brauner, jlayton, axboe, joel.granados, tglx, viro, hch,
len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
In [1] it was reported that the acct(2) system call can be used to
trigger NULL deref in cases where it is set to write to a file that
triggers an internal lookup. This can e.g., happen when pointing acc(2)
to /sys/power/resume. At the point the where the write to this file
happens the calling task has already exited and called exit_fs(). A
lookup will thus trigger a NULL-deref when accessing current->fs.
Reorganize the code so that the the final write happens from the
workqueue but with the caller's credentials. This preserves the
(strange) permission model and has almost no regression risk.
This api should stop to exist though.
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
kernel/acct.c | 120 ++++++++++++++++++++++++++++++++++------------------------
1 file changed, 70 insertions(+), 50 deletions(-)
diff --git a/kernel/acct.c b/kernel/acct.c
index 31222e8cd534..48283efe8a12 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -103,48 +103,50 @@ struct bsd_acct_struct {
atomic_long_t count;
struct rcu_head rcu;
struct mutex lock;
- int active;
+ bool active;
+ bool check_space;
unsigned long needcheck;
struct file *file;
struct pid_namespace *ns;
struct work_struct work;
struct completion done;
+ acct_t ac;
};
-static void do_acct_process(struct bsd_acct_struct *acct);
+static void fill_ac(struct bsd_acct_struct *acct);
+static void acct_write_process(struct bsd_acct_struct *acct);
/*
* Check the amount of free space and suspend/resume accordingly.
*/
-static int check_free_space(struct bsd_acct_struct *acct)
+static bool check_free_space(struct bsd_acct_struct *acct)
{
struct kstatfs sbuf;
- if (time_is_after_jiffies(acct->needcheck))
- goto out;
+ if (!acct->check_space)
+ return acct->active;
/* May block */
if (vfs_statfs(&acct->file->f_path, &sbuf))
- goto out;
+ return acct->active;
if (acct->active) {
u64 suspend = sbuf.f_blocks * SUSPEND;
do_div(suspend, 100);
if (sbuf.f_bavail <= suspend) {
- acct->active = 0;
+ acct->active = false;
pr_info("Process accounting paused\n");
}
} else {
u64 resume = sbuf.f_blocks * RESUME;
do_div(resume, 100);
if (sbuf.f_bavail >= resume) {
- acct->active = 1;
+ acct->active = true;
pr_info("Process accounting resumed\n");
}
}
acct->needcheck = jiffies + ACCT_TIMEOUT*HZ;
-out:
return acct->active;
}
@@ -189,7 +191,11 @@ static void acct_pin_kill(struct fs_pin *pin)
{
struct bsd_acct_struct *acct = to_acct(pin);
mutex_lock(&acct->lock);
- do_acct_process(acct);
+ /*
+ * Fill the accounting struct with the exiting task's info
+ * before punting to the workqueue.
+ */
+ fill_ac(acct);
schedule_work(&acct->work);
wait_for_completion(&acct->done);
cmpxchg(&acct->ns->bacct, pin, NULL);
@@ -202,6 +208,9 @@ static void close_work(struct work_struct *work)
{
struct bsd_acct_struct *acct = container_of(work, struct bsd_acct_struct, work);
struct file *file = acct->file;
+
+ /* We were fired by acct_pin_kill() which holds acct->lock. */
+ acct_write_process(acct);
if (file->f_op->flush)
file->f_op->flush(file, NULL);
__fput_sync(file);
@@ -430,13 +439,27 @@ static u32 encode_float(u64 value)
* do_exit() or when switching to a different output file.
*/
-static void fill_ac(acct_t *ac)
+static void fill_ac(struct bsd_acct_struct *acct)
{
struct pacct_struct *pacct = ¤t->signal->pacct;
+ struct file *file = acct->file;
+ acct_t *ac = &acct->ac;
u64 elapsed, run_time;
time64_t btime;
struct tty_struct *tty;
+ lockdep_assert_held(&acct->lock);
+
+ if (time_is_after_jiffies(acct->needcheck)) {
+ acct->check_space = false;
+
+ /* Don't fill in @ac if nothing will be written. */
+ if (!acct->active)
+ return;
+ } else {
+ acct->check_space = true;
+ }
+
/*
* Fill the accounting struct with the needed info as recorded
* by the different kernel functions.
@@ -484,64 +507,61 @@ static void fill_ac(acct_t *ac)
ac->ac_majflt = encode_comp_t(pacct->ac_majflt);
ac->ac_exitcode = pacct->ac_exitcode;
spin_unlock_irq(¤t->sighand->siglock);
-}
-/*
- * do_acct_process does all actual work. Caller holds the reference to file.
- */
-static void do_acct_process(struct bsd_acct_struct *acct)
-{
- acct_t ac;
- unsigned long flim;
- const struct cred *orig_cred;
- struct file *file = acct->file;
-
- /*
- * Accounting records are not subject to resource limits.
- */
- flim = rlimit(RLIMIT_FSIZE);
- current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
- /* Perform file operations on behalf of whoever enabled accounting */
- orig_cred = override_creds(file->f_cred);
- /*
- * First check to see if there is enough free_space to continue
- * the process accounting system.
- */
- if (!check_free_space(acct))
- goto out;
-
- fill_ac(&ac);
/* we really need to bite the bullet and change layout */
- ac.ac_uid = from_kuid_munged(file->f_cred->user_ns, orig_cred->uid);
- ac.ac_gid = from_kgid_munged(file->f_cred->user_ns, orig_cred->gid);
+ ac->ac_uid = from_kuid_munged(file->f_cred->user_ns, current_uid());
+ ac->ac_gid = from_kgid_munged(file->f_cred->user_ns, current_gid());
#if ACCT_VERSION == 1 || ACCT_VERSION == 2
/* backward-compatible 16 bit fields */
- ac.ac_uid16 = ac.ac_uid;
- ac.ac_gid16 = ac.ac_gid;
+ ac->ac_uid16 = ac->ac_uid;
+ ac->ac_gid16 = ac->ac_gid;
#elif ACCT_VERSION == 3
{
struct pid_namespace *ns = acct->ns;
- ac.ac_pid = task_tgid_nr_ns(current, ns);
+ ac->ac_pid = task_tgid_nr_ns(current, ns);
rcu_read_lock();
- ac.ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent),
- ns);
+ ac->ac_ppid = task_tgid_nr_ns(rcu_dereference(current->real_parent), ns);
rcu_read_unlock();
}
#endif
+}
+
+static void acct_write_process(struct bsd_acct_struct *acct)
+{
+ struct file *file = acct->file;
+ const struct cred *cred;
+ acct_t *ac = &acct->ac;
+
+ /* Perform file operations on behalf of whoever enabled accounting */
+ cred = override_creds(file->f_cred);
+
/*
- * Get freeze protection. If the fs is frozen, just skip the write
- * as we could deadlock the system otherwise.
+ * First check to see if there is enough free_space to continue
+ * the process accounting system. Then get freeze protection. If
+ * the fs is frozen, just skip the write as we could deadlock
+ * the system otherwise.
*/
- if (file_start_write_trylock(file)) {
+ if (check_free_space(acct) && file_start_write_trylock(file)) {
/* it's been opened O_APPEND, so position is irrelevant */
loff_t pos = 0;
- __kernel_write(file, &ac, sizeof(acct_t), &pos);
+ __kernel_write(file, ac, sizeof(acct_t), &pos);
file_end_write(file);
}
-out:
+
+ revert_creds(cred);
+}
+
+static void do_acct_process(struct bsd_acct_struct *acct)
+{
+ unsigned long flim;
+
+ /* Accounting records are not subject to resource limits. */
+ flim = rlimit(RLIMIT_FSIZE);
+ current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
+ fill_ac(acct);
+ acct_write_process(acct);
current->signal->rlim[RLIMIT_FSIZE].rlim_cur = flim;
- revert_creds(orig_cred);
}
/**
--
2.47.2
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 2/2] acct: block access to kernel internal filesystems
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
2025-02-11 17:15 ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
@ 2025-02-11 17:16 ` Christian Brauner
2025-02-11 20:30 ` Amir Goldstein
2025-02-11 20:54 ` Al Viro
2025-02-11 18:56 ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
2 siblings, 2 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-11 17:16 UTC (permalink / raw)
To: Zicheng Qu, Linus Torvalds
Cc: Christian Brauner, jlayton, axboe, joel.granados, tglx, viro, hch,
len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
There's no point in allowing anything kernel internal nor procfs or
sysfs.
Reported-by: Zicheng Qu <quzicheng@huawei.com>
Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: <stable@vger.kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
kernel/acct.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/acct.c b/kernel/acct.c
index 48283efe8a12..6520baa13669 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -243,6 +243,20 @@ static int acct_on(struct filename *pathname)
return -EACCES;
}
+ /* Exclude kernel kernel internal filesystems. */
+ if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
+ kfree(acct);
+ filp_close(file, NULL);
+ return -EINVAL;
+ }
+
+ /* Exclude procfs and sysfs. */
+ if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
+ kfree(acct);
+ filp_close(file, NULL);
+ return -EINVAL;
+ }
+
if (!(file->f_mode & FMODE_CAN_WRITE)) {
kfree(acct);
filp_close(file, NULL);
--
2.47.2
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 0/2] acct: don't allow access to internal filesystems
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
2025-02-11 17:15 ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
2025-02-11 17:16 ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
@ 2025-02-11 18:56 ` Jeff Layton
2025-02-12 11:16 ` Christian Brauner
2 siblings, 1 reply; 18+ messages in thread
From: Jeff Layton @ 2025-02-11 18:56 UTC (permalink / raw)
To: Christian Brauner, Zicheng Qu, Linus Torvalds
Cc: axboe, joel.granados, tglx, viro, hch, len.brown, pavel,
pengfei.xu, rafael, tanghui20, zhangqiao22, judy.chenhui,
linux-kernel, linux-fsdevel, syzkaller-bugs, linux-pm, stable
On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> In [1] it was reported that the acct(2) system call can be used to
> trigger a NULL deref in cases where it is set to write to a file that
> triggers an internal lookup.
>
> This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> point the where the write to this file happens the calling task has
> already exited and called exit_fs() but an internal lookup might be
> triggered through lookup_bdev(). This may trigger a NULL-deref
> when accessing current->fs.
>
> This series does two things:
>
> - Reorganize the code so that the the final write happens from the
> workqueue but with the caller's credentials. This preserves the
> (strange) permission model and has almost no regression risk.
>
> - Block access to kernel internal filesystems as well as procfs and
> sysfs in the first place.
>
> This api should stop to exist imho.
>
I wonder who uses it these days, and what would we suggest they replace
it with? Maybe syscall auditing?
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
depends on MULTIUSER
help
If you say Y here, a user level program will be able to instruct the
kernel (via a special system call) to write process accounting
information to a file: whenever a process exits, information about
that process will be appended to the file by the kernel. The
information includes things such as creation time, owning user,
command name, memory usage, controlling terminal etc. (the complete
list is in the struct acct in <file:include/linux/acct.h>). It is
up to the user level program to do useful things with this
information. This is generally a good idea, so say Y.
Maybe at least time to replace that last sentence and make this default
to 'n'?
> Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
> Christian Brauner (2):
> acct: perform last write from workqueue
> acct: block access to kernel internal filesystems
>
> kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 84 insertions(+), 50 deletions(-)
> ---
> base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
> change-id: 20250211-work-acct-a6d8e92a5fe0
>
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] acct: block access to kernel internal filesystems
2025-02-11 17:16 ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
@ 2025-02-11 20:30 ` Amir Goldstein
2025-02-11 20:54 ` Al Viro
1 sibling, 0 replies; 18+ messages in thread
From: Amir Goldstein @ 2025-02-11 20:30 UTC (permalink / raw)
To: Christian Brauner
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
viro, hch, len.brown, pavel, pengfei.xu, rafael, tanghui20,
zhangqiao22, judy.chenhui, linux-kernel, linux-fsdevel,
syzkaller-bugs, linux-pm, stable
On Tue, Feb 11, 2025 at 6:17 PM Christian Brauner <brauner@kernel.org> wrote:
>
> There's no point in allowing anything kernel internal nor procfs or
> sysfs.
>
> Reported-by: Zicheng Qu <quzicheng@huawei.com>
> Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> ---
> kernel/acct.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/kernel/acct.c b/kernel/acct.c
> index 48283efe8a12..6520baa13669 100644
> --- a/kernel/acct.c
> +++ b/kernel/acct.c
> @@ -243,6 +243,20 @@ static int acct_on(struct filename *pathname)
> return -EACCES;
> }
>
> + /* Exclude kernel kernel internal filesystems. */
> + if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
> + kfree(acct);
> + filp_close(file, NULL);
> + return -EINVAL;
> + }
> +
> + /* Exclude procfs and sysfs. */
> + if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
> + kfree(acct);
> + filp_close(file, NULL);
> + return -EINVAL;
> + }
> +
> if (!(file->f_mode & FMODE_CAN_WRITE)) {
> kfree(acct);
> filp_close(file, NULL);
>
> --
> 2.47.2
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] acct: block access to kernel internal filesystems
2025-02-11 17:16 ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
2025-02-11 20:30 ` Amir Goldstein
@ 2025-02-11 20:54 ` Al Viro
2025-02-12 10:32 ` Christian Brauner
1 sibling, 1 reply; 18+ messages in thread
From: Al Viro @ 2025-02-11 20:54 UTC (permalink / raw)
To: Christian Brauner
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
hch, len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
On Tue, Feb 11, 2025 at 06:16:00PM +0100, Christian Brauner wrote:
> There's no point in allowing anything kernel internal nor procfs or
> sysfs.
> + /* Exclude kernel kernel internal filesystems. */
> + if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
> + kfree(acct);
> + filp_close(file, NULL);
> + return -EINVAL;
> + }
> +
> + /* Exclude procfs and sysfs. */
> + if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
> + kfree(acct);
> + filp_close(file, NULL);
> + return -EINVAL;
> + }
That looks like a really weird way to test it, especially the second
part...
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 2/2] acct: block access to kernel internal filesystems
2025-02-11 20:54 ` Al Viro
@ 2025-02-12 10:32 ` Christian Brauner
0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-12 10:32 UTC (permalink / raw)
To: Al Viro
Cc: Zicheng Qu, Linus Torvalds, jlayton, axboe, joel.granados, tglx,
hch, len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
On Tue, Feb 11, 2025 at 08:54:18PM +0000, Al Viro wrote:
> On Tue, Feb 11, 2025 at 06:16:00PM +0100, Christian Brauner wrote:
> > There's no point in allowing anything kernel internal nor procfs or
> > sysfs.
>
> > + /* Exclude kernel kernel internal filesystems. */
> > + if (file_inode(file)->i_sb->s_flags & (SB_NOUSER | SB_KERNMOUNT)) {
> > + kfree(acct);
> > + filp_close(file, NULL);
> > + return -EINVAL;
> > + }
> > +
> > + /* Exclude procfs and sysfs. */
> > + if (file_inode(file)->i_sb->s_iflags & SB_I_USERNS_VISIBLE) {
> > + kfree(acct);
> > + filp_close(file, NULL);
> > + return -EINVAL;
> > + }
>
> That looks like a really weird way to test it, especially the second
> part...
SB_I_USERNS_VISIBLE has only ever applied to procfs and sysfs.
Granted, it's main purpose is to indicate that a caller in an
unprivileged userns might have a restricted view of sysfs/procfs already
so mounting it again must be prevented to not reveal any overmounted
entities (A Strong candidate for the price of least transparent cause of
EPERMs from the kernel imho.).
That flag could reasonably go and be replaced by explicit checks for
procfs and sysfs in general because we haven't ever grown any additional
candidates for that mess and it's unlikely that we ever will. But as
long as we have this I don't mind using it. If it's important to you
I'll happily change it. If you can live with the comment I added I'll
leave it.
To be perfectly blunt: Imho, this api isn't worth massaging a single
line of VFS code which is why this isn't going to win the price of
prettiest fix of a NULL-deref.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/2] acct: don't allow access to internal filesystems
2025-02-11 18:56 ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
@ 2025-02-12 11:16 ` Christian Brauner
2025-02-13 14:56 ` Christian Brauner
0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-02-12 11:16 UTC (permalink / raw)
To: Jeff Layton
Cc: Zicheng Qu, Linus Torvalds, axboe, joel.granados, tglx, viro, hch,
len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
On Tue, Feb 11, 2025 at 01:56:41PM -0500, Jeff Layton wrote:
> On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> > In [1] it was reported that the acct(2) system call can be used to
> > trigger a NULL deref in cases where it is set to write to a file that
> > triggers an internal lookup.
> >
> > This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> > point the where the write to this file happens the calling task has
> > already exited and called exit_fs() but an internal lookup might be
> > triggered through lookup_bdev(). This may trigger a NULL-deref
> > when accessing current->fs.
> >
> > This series does two things:
> >
> > - Reorganize the code so that the the final write happens from the
> > workqueue but with the caller's credentials. This preserves the
> > (strange) permission model and has almost no regression risk.
> >
> > - Block access to kernel internal filesystems as well as procfs and
> > sysfs in the first place.
> >
> > This api should stop to exist imho.
> >
>
> I wonder who uses it these days, and what would we suggest they replace
> it with? Maybe syscall auditing?
Someone pointed me to atop but that also works without it. Since this is
a privileged api I think the natural candidate to replace all of this is
bpf. I'm pretty sure that it's relatively straightforward to get a lot
more information out of it than with acct(2) and it will probably be
more performant too.
Without any limitations as it is right now, acct(2) can easily lockup
the system quite easily by pointing it to various things in sysfs and
I'm sure it can be abused in other ways. So I wouldn't enable it.
>
> config BSD_PROCESS_ACCT
> bool "BSD Process Accounting"
> depends on MULTIUSER
> help
> If you say Y here, a user level program will be able to instruct the
> kernel (via a special system call) to write process accounting
> information to a file: whenever a process exits, information about
> that process will be appended to the file by the kernel. The
> information includes things such as creation time, owning user,
> command name, memory usage, controlling terminal etc. (the complete
> list is in the struct acct in <file:include/linux/acct.h>). It is
> up to the user level program to do useful things with this
> information. This is generally a good idea, so say Y.
>
> Maybe at least time to replace that last sentence and make this default
> to 'n'?
I agree.
>
> > Link: https://lore.kernel.org/r/20250127091811.3183623-1-quzicheng@huawei.com [1]
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> > Christian Brauner (2):
> > acct: perform last write from workqueue
> > acct: block access to kernel internal filesystems
> >
> > kernel/acct.c | 134 ++++++++++++++++++++++++++++++++++++----------------------
> > 1 file changed, 84 insertions(+), 50 deletions(-)
> > ---
> > base-commit: af69e27b3c8240f7889b6c457d71084458984d8e
> > change-id: 20250211-work-acct-a6d8e92a5fe0
> >
>
> --
> Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 0/2] acct: don't allow access to internal filesystems
2025-02-12 11:16 ` Christian Brauner
@ 2025-02-13 14:56 ` Christian Brauner
0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-02-13 14:56 UTC (permalink / raw)
To: Jeff Layton
Cc: Zicheng Qu, Linus Torvalds, axboe, joel.granados, tglx, viro, hch,
len.brown, pavel, pengfei.xu, rafael, tanghui20, zhangqiao22,
judy.chenhui, linux-kernel, linux-fsdevel, syzkaller-bugs,
linux-pm, stable
On Wed, Feb 12, 2025 at 12:16:44PM +0100, Christian Brauner wrote:
> On Tue, Feb 11, 2025 at 01:56:41PM -0500, Jeff Layton wrote:
> > On Tue, 2025-02-11 at 18:15 +0100, Christian Brauner wrote:
> > > In [1] it was reported that the acct(2) system call can be used to
> > > trigger a NULL deref in cases where it is set to write to a file that
> > > triggers an internal lookup.
> > >
> > > This can e.g., happen when pointing acct(2) to /sys/power/resume. At the
> > > point the where the write to this file happens the calling task has
> > > already exited and called exit_fs() but an internal lookup might be
> > > triggered through lookup_bdev(). This may trigger a NULL-deref
> > > when accessing current->fs.
> > >
> > > This series does two things:
> > >
> > > - Reorganize the code so that the the final write happens from the
> > > workqueue but with the caller's credentials. This preserves the
> > > (strange) permission model and has almost no regression risk.
> > >
> > > - Block access to kernel internal filesystems as well as procfs and
> > > sysfs in the first place.
> > >
> > > This api should stop to exist imho.
> > >
> >
> > I wonder who uses it these days, and what would we suggest they replace
> > it with? Maybe syscall auditing?
>
> Someone pointed me to atop but that also works without it. Since this is
> a privileged api I think the natural candidate to replace all of this is
> bpf. I'm pretty sure that it's relatively straightforward to get a lot
> more information out of it than with acct(2) and it will probably be
> more performant too.
>
> Without any limitations as it is right now, acct(2) can easily lockup
> the system quite easily by pointing it to various things in sysfs and
> I'm sure it can be abused in other ways. So I wouldn't enable it.
And I totally forgot about taskstats via Netlink:
https://www.kernel.org/doc/Documentation/accounting/taskstats.txt
include/uapi/linux/taskstats.h
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-02-13 14:56 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-10 14:04 [Syzkaller & bisect] There is general protection fault in path_init in v6.11-rc2 Pengfei Xu
2025-01-27 9:18 ` Zicheng Qu
2025-02-10 13:17 ` [PATCH] acct: Prevent NULL pointer dereference when writing to sysfs Zicheng Qu
2025-02-10 15:12 ` Christian Brauner
2025-02-10 15:21 ` Al Viro
2025-02-10 16:02 ` Christian Brauner
2025-02-10 18:19 ` Al Viro
2025-02-11 0:23 ` Al Viro
2025-02-11 10:17 ` Christian Brauner
2025-02-11 17:15 ` [PATCH 0/2] acct: don't allow access to internal filesystems Christian Brauner
2025-02-11 17:15 ` [PATCH 1/2] acct: perform last write from workqueue Christian Brauner
2025-02-11 17:16 ` [PATCH 2/2] acct: block access to kernel internal filesystems Christian Brauner
2025-02-11 20:30 ` Amir Goldstein
2025-02-11 20:54 ` Al Viro
2025-02-12 10:32 ` Christian Brauner
2025-02-11 18:56 ` [PATCH 0/2] acct: don't allow access to " Jeff Layton
2025-02-12 11:16 ` Christian Brauner
2025-02-13 14:56 ` Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).