* sched, timers: use after free in __lock_task_sighand when exiting a process @ 2014-07-13 21:51 Sasha Levin 2014-07-13 23:45 ` Sasha Levin 0 siblings, 1 reply; 17+ messages in thread From: Sasha Levin @ 2014-07-13 21:51 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker Cc: LKML, Dave Jones, Andrey Ryabinin Hi all, While fuzzing with trinity inside a KVM tools guest running the latest -next kernel with the KASAN patchset, I've stumbled on the following spew: [ 4838.503887] ================================================================== [ 4838.510906] AddressSanitizer: use after free in do_raw_spin_lock+0x157/0x240 at addr ffff8800ac3af608 [ 4838.513433] page:ffffea0002b0ebc0 count:0 mapcount:0 mapping: (null) index:0x0 [ 4838.513433] page flags: 0x1fffff80008000(tail) [ 4838.513433] page dumped because: kasan error [ 4838.513433] CPU: 14 PID: 19619 Comm: trinity-c52 Tainted: G W 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813 [ 4838.513433] 00000000000000fb 0000000000000000 ffffea0002b0ebc0 ffff8805f2e03bc8 [ 4838.513433] ffffffff9ee47068 ffff8805f2e03c98 ffff8805f2e03c88 ffffffff9a426f5c [ 4838.513433] ffff8805f2fe2d80 ffff8805f2fe2d90 ffff8805f2fe2d80 ffff8805f2fe2d88 [ 4838.513433] Call Trace: [ 4838.513433] <IRQ> dump_stack (lib/dump_stack.c:52) [ 4838.513433] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166) [ 4838.513433] __asan_load4 (mm/kasan/kasan.c:358) [ 4838.513433] do_raw_spin_lock (./arch/x86/include/asm/spinlock.h:108 kernel/locking/spinlock_debug.c:112 kernel/locking/spinlock_debug.c:137) [ 4838.513433] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151) [ 4838.513433] __lock_task_sighand (kernel/signal.c:1280) [ 4838.513433] run_posix_cpu_timers (kernel/time/posix-cpu-timers.c:1162) [ 4838.513433] update_process_times (kernel/time/timer.c:1392) [ 4838.513433] tick_sched_timer (kernel/time/tick-sched.c:152 kernel/time/tick-sched.c:1096) [ 4838.513433] __run_hrtimer (kernel/time/hrtimer.c:1272 (discriminator 3)) [ 4838.513433] hrtimer_interrupt (kernel/time/hrtimer.c:1361) [ 4838.513433] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:921) [ 4838.513433] smp_apic_timer_interrupt (./arch/x86/include/asm/apic.h:696 arch/x86/kernel/apic/apic.c:945) [ 4838.513433] apic_timer_interrupt (arch/x86/kernel/entry_64.S:1020) [ 4838.513433] <EOI> ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) [ 4838.513433] __slab_free (include/linux/spinlock.h:358 mm/slub.c:2565) [ 4838.513433] kmem_cache_free (mm/slub.c:2704 mm/slub.c:2714) [ 4838.513433] put_io_context (block/blk-ioc.c:154) [ 4838.513433] put_io_context_active (block/blk-ioc.c:197) [ 4838.513433] exit_io_context (block/blk-ioc.c:211) [ 4838.513433] do_exit (kernel/exit.c:801) [ 4838.513433] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:870) [ 4838.513433] get_signal_to_deliver (kernel/signal.c:2351) [ 4838.513433] do_signal (arch/x86/kernel/signal.c:698) [ 4838.513433] do_notify_resume (arch/x86/kernel/signal.c:751) [ 4838.513433] int_signal (arch/x86/kernel/entry_64.S:600) [ 4838.513433] Read of size 4 by thread T19619: [ 4838.513433] Memory state around the buggy address: [ 4838.513433] ffff8800ac3af380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] >ffff8800ac3af600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ^ [ 4838.513433] ffff8800ac3af680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 4838.513433] ffff8800ac3af800: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd [ 4838.513433] ffff8800ac3af880: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd [ 4838.513433] ================================================================== It looks like a race between an exiting process and an external access to it's 'sighand'. Thanks, Sasha ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-13 21:51 sched, timers: use after free in __lock_task_sighand when exiting a process Sasha Levin @ 2014-07-13 23:45 ` Sasha Levin 2014-07-14 9:04 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Sasha Levin @ 2014-07-13 23:45 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker Cc: LKML, Dave Jones, Andrey Ryabinin On 07/13/2014 05:51 PM, Sasha Levin wrote: > Hi all, > > While fuzzing with trinity inside a KVM tools guest running the latest -next > kernel with the KASAN patchset, I've stumbled on the following spew: Alrighty, I don't think it has anything to do with timers: [ 876.319044] ================================================================== [ 876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18 [ 876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping: (null) index:0x0 [ 876.319044] page flags: 0x2fffff80008000(tail) [ 876.319044] page dumped because: kasan error [ 876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G W 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817 [ 876.319044] 00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70 [ 876.319044] ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c [ 876.319044] 0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8 [ 876.319044] Call Trace: [ 876.319044] dump_stack (lib/dump_stack.c:52) [ 876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166) [ 876.319044] ? debug_smp_processor_id (lib/smp_processor_id.c:57) [ 876.319044] ? check_chain_key (kernel/locking/lockdep.c:2188) [ 876.319044] __asan_load8 (mm/kasan/kasan.c:364) [ 876.319044] ? do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158) [ 876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158) [ 876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) [ 876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285) [ 876.319044] ? __lock_task_sighand (./arch/x86/include/asm/paravirt.h:814 ./arch/x86/include/asm/paravirt.h:827 kernel/signal.c:1270) [ 876.319044] do_send_sig_info (kernel/signal.c:1191) [ 876.319044] group_send_sig_info (kernel/signal.c:1304) [ 876.319044] ? group_send_sig_info (kernel/signal.c:1296) [ 876.319044] kill_pid_info (kernel/signal.c:1339) [ 876.319044] ? kill_pid_info (kernel/signal.c:1330) [ 876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900) [ 876.319044] ? SYSC_kill (include/linux/rcupdate.h:806 kernel/signal.c:1422 kernel/signal.c:2900) [ 876.319044] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600) [ 876.319044] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2)) [ 876.319044] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600) [ 876.319044] SyS_kill (kernel/signal.c:2890) [ 876.319044] tracesys (arch/x86/kernel/entry_64.S:542) [ 876.319044] Read of size 8 by thread T8749: [ 876.319044] Memory state around the buggy address: [ 876.319044] ffff8803e48ce980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48cea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48cea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48ceb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48ceb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] >ffff8803e48cec00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ^ [ 876.319044] ffff8803e48cec80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48ced00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48ced80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 876.319044] ffff8803e48cee00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 876.319044] ffff8803e48cee80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 876.319044] ================================================================== Thanks, Sasha > Sasha > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-13 23:45 ` Sasha Levin @ 2014-07-14 9:04 ` Peter Zijlstra 2014-07-14 9:34 ` Andrey Ryabinin 2014-07-14 14:49 ` Oleg Nesterov 0 siblings, 2 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-07-14 9:04 UTC (permalink / raw) To: Sasha Levin Cc: Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin, Oleg Nesterov [-- Attachment #1: Type: text/plain, Size: 2297 bytes --] On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: > On 07/13/2014 05:51 PM, Sasha Levin wrote: > > Hi all, > > > > While fuzzing with trinity inside a KVM tools guest running the latest -next > > kernel with the KASAN patchset, I've stumbled on the following spew: WTH is a KASAN? > Alrighty, I don't think it has anything to do with timers: > > [ 876.319044] ================================================================== > [ 876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18 > [ 876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping: (null) index:0x0 > [ 876.319044] page flags: 0x2fffff80008000(tail) > [ 876.319044] page dumped because: kasan error > [ 876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G W 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817 > [ 876.319044] 00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70 > [ 876.319044] ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c > [ 876.319044] 0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8 > [ 876.319044] Call Trace: > [ 876.319044] dump_stack (lib/dump_stack.c:52) > [ 876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166) > [ 876.319044] __asan_load8 (mm/kasan/kasan.c:364) > [ 876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158) > [ 876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) > [ 876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285) > [ 876.319044] do_send_sig_info (kernel/signal.c:1191) > [ 876.319044] group_send_sig_info (kernel/signal.c:1304) > [ 876.319044] kill_pid_info (kernel/signal.c:1339) > [ 876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900) Oleg, what guarantees the RCU free of task-struct and sighand? The only RCU I can find is delayed_put_task_struct() but that's not often used. TASK_DEAD etc. use regular put_task_struct() and that doesn't seem to involve RCU. Could be I'm a moron, its Monday, and its relatively early, so I'll go have another look.. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 9:04 ` Peter Zijlstra @ 2014-07-14 9:34 ` Andrey Ryabinin 2014-07-14 9:58 ` Peter Zijlstra 2014-07-14 14:49 ` Oleg Nesterov 1 sibling, 1 reply; 17+ messages in thread From: Andrey Ryabinin @ 2014-07-14 9:34 UTC (permalink / raw) To: Peter Zijlstra, Sasha Levin Cc: Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Oleg Nesterov On 07/14/14 13:04, Peter Zijlstra wrote: > On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: >> On 07/13/2014 05:51 PM, Sasha Levin wrote: >>> Hi all, >>> >>> While fuzzing with trinity inside a KVM tools guest running the latest -next >>> kernel with the KASAN patchset, I've stumbled on the following spew: > > WTH is a KASAN? > It's dynamic memory checker, detects use after free, out of bounds accesses - https://lkml.org/lkml/2014/7/9/990 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 9:34 ` Andrey Ryabinin @ 2014-07-14 9:58 ` Peter Zijlstra 2014-07-14 10:25 ` Andrey Ryabinin 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2014-07-14 9:58 UTC (permalink / raw) To: Andrey Ryabinin Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Oleg Nesterov [-- Attachment #1: Type: text/plain, Size: 660 bytes --] On Mon, Jul 14, 2014 at 01:34:40PM +0400, Andrey Ryabinin wrote: > On 07/14/14 13:04, Peter Zijlstra wrote: > > On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: > >> On 07/13/2014 05:51 PM, Sasha Levin wrote: > >>> Hi all, > >>> > >>> While fuzzing with trinity inside a KVM tools guest running the latest -next > >>> kernel with the KASAN patchset, I've stumbled on the following spew: > > > > WTH is a KASAN? > > > > It's dynamic memory checker, detects use after free, out of bounds accesses - https://lkml.org/lkml/2014/7/9/990 How is it better/different from the existing use after free stuff? That email fails to tell me. [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 9:58 ` Peter Zijlstra @ 2014-07-14 10:25 ` Andrey Ryabinin 0 siblings, 0 replies; 17+ messages in thread From: Andrey Ryabinin @ 2014-07-14 10:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Oleg Nesterov On 07/14/14 13:58, Peter Zijlstra wrote: > On Mon, Jul 14, 2014 at 01:34:40PM +0400, Andrey Ryabinin wrote: >> On 07/14/14 13:04, Peter Zijlstra wrote: >>> On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: >>>> On 07/13/2014 05:51 PM, Sasha Levin wrote: >>>>> Hi all, >>>>> >>>>> While fuzzing with trinity inside a KVM tools guest running the latest -next >>>>> kernel with the KASAN patchset, I've stumbled on the following spew: >>> >>> WTH is a KASAN? >>> >> >> It's dynamic memory checker, detects use after free, out of bounds accesses - https://lkml.org/lkml/2014/7/9/990 > > How is it better/different from the existing use after free stuff? That > email fails to tell me. > DEBUG_PAGEALLOC works on page-granularity level, kasan can do checks at sub-page granularity (for slab objects). SLUB_DEBUG(poisoning, etc.) doesn't guarantee that you will catch use-after-free. For example if you are only reading and checking a single bit in some already free object. Kasan will catch this. KASAN is very similar to kmemcheck, but definitely faster. The only thing that currently kasan can't catch, while kmemcheck can - is reads of uninitialized memory. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 9:04 ` Peter Zijlstra 2014-07-14 9:34 ` Andrey Ryabinin @ 2014-07-14 14:49 ` Oleg Nesterov 2014-07-14 15:13 ` Oleg Nesterov ` (3 more replies) 1 sibling, 4 replies; 17+ messages in thread From: Oleg Nesterov @ 2014-07-14 14:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin On 07/14, Peter Zijlstra wrote: > > On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: > > > > [ 876.319044] ================================================================== > > [ 876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18 > > [ 876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping: (null) index:0x0 > > [ 876.319044] page flags: 0x2fffff80008000(tail) > > [ 876.319044] page dumped because: kasan error > > [ 876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G W 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817 > > [ 876.319044] 00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70 > > [ 876.319044] ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c > > [ 876.319044] 0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8 > > [ 876.319044] Call Trace: > > [ 876.319044] dump_stack (lib/dump_stack.c:52) > > [ 876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166) > > [ 876.319044] __asan_load8 (mm/kasan/kasan.c:364) > > [ 876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158) > > [ 876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) > > [ 876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285) > > [ 876.319044] do_send_sig_info (kernel/signal.c:1191) > > [ 876.319044] group_send_sig_info (kernel/signal.c:1304) > > [ 876.319044] kill_pid_info (kernel/signal.c:1339) > > [ 876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900) Looks like a false alarm at first glance... > Oleg, what guarantees the RCU free of task-struct and sighand? > The only RCU I can find is delayed_put_task_struct() but that's not > often used. Yes, usually the code uses put_task_struct(). delayed_put_task_struct() acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use get_task_struct() if you observe this task under rcu_read_lock(). Say, rcu_read_lock(); task = find_task_by_vpid(...); if (task) get_task_struct(task); rcu_read_unlock(); If release_task() used dec_and_test + kfree_rcu, the code above could not work. > TASK_DEAD etc. use regular put_task_struct() and that > doesn't seem to involve RCU. Yes, the task itself (or, depending ob pov, scheduler) has a reference. copy_process() does /* * One for us, one for whoever does the "release_task()" (usually * parent) */ atomic_set(&tsk->usage, 2); "us" actually means that put_task_struct(TASK_DEAD). As for ->sighand, note that sighand_cachep is SLAB_DESTROY_BY_RCU. So this memory is RCU free in a sense that it can't be returned to system, but it can be reused by another task. This is fine, lock_task_sighand() rechecks sighand == task->sighand under ->siglock. So perhaps this tool misinterprets kmem_cache_free(sighand_cachep) as use after free? We are going to add some comments into lock_task_sighand(). And cleanup it, it can look much simpler. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 14:49 ` Oleg Nesterov @ 2014-07-14 15:13 ` Oleg Nesterov 2014-07-14 15:31 ` Andrey Ryabinin ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: Oleg Nesterov @ 2014-07-14 15:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin I'm afraid I wasn't clear... Let me try again. So yes, this "race" is of course possible: lock_task_sighand() release_task() sighand = task->sighand; sighand = task->sighand; spin_lock(sighand->siglock); task->sighand = NULL; spin_unlcok(sighand->siglock); kmem_cache_free(sighand); spin_lock(sighand->siglock); but this is fine. lock_task_sighand() will notice task->sighand == NULL under ->siglock and fail. SLAB_DESTROY_BY_RCU guarantees that this memory is still sighand_struct even if it is freed (or even reallocated). spin_lock/spin_unlock is safe because ->siglock initialized by sighand_ctor(). And until the caller of lock_task_sighand() drops ->siglock kmem_cache_free() is not possible, the task can't exit. To remind, this is one of the reasons why rt_mutex_unlock() must be "atomic" as spin_lock_t. Without the recent fix from tglx spin_unlock() (turned into rt_mutex_unlock()) could play with the freed memory. Because, once "unlock" makes another "lock" possible, the task can take this lock and free this memory, but lock_task_sighand() can be called outside of rcu_read_lock(). On 07/14, Oleg Nesterov wrote: > > On 07/14, Peter Zijlstra wrote: > > > > On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: > > > > > > [ 876.319044] ================================================================== > > > [ 876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18 > > > [ 876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping: (null) index:0x0 > > > [ 876.319044] page flags: 0x2fffff80008000(tail) > > > [ 876.319044] page dumped because: kasan error > > > [ 876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G W 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817 > > > [ 876.319044] 00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70 > > > [ 876.319044] ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c > > > [ 876.319044] 0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8 > > > [ 876.319044] Call Trace: > > > [ 876.319044] dump_stack (lib/dump_stack.c:52) > > > [ 876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166) > > > [ 876.319044] __asan_load8 (mm/kasan/kasan.c:364) > > > [ 876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158) > > > [ 876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) > > > [ 876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285) > > > [ 876.319044] do_send_sig_info (kernel/signal.c:1191) > > > [ 876.319044] group_send_sig_info (kernel/signal.c:1304) > > > [ 876.319044] kill_pid_info (kernel/signal.c:1339) > > > [ 876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900) > > Looks like a false alarm at first glance... > > > Oleg, what guarantees the RCU free of task-struct and sighand? > > > The only RCU I can find is delayed_put_task_struct() but that's not > > often used. > > Yes, usually the code uses put_task_struct(). delayed_put_task_struct() > acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use > get_task_struct() if you observe this task under rcu_read_lock(). > > Say, > rcu_read_lock(); > task = find_task_by_vpid(...); > if (task) > get_task_struct(task); > rcu_read_unlock(); > > If release_task() used dec_and_test + kfree_rcu, the code above could > not work. > > > TASK_DEAD etc. use regular put_task_struct() and that > > doesn't seem to involve RCU. > > Yes, the task itself (or, depending ob pov, scheduler) has a reference. > copy_process() does > > /* > * One for us, one for whoever does the "release_task()" (usually > * parent) > */ > atomic_set(&tsk->usage, 2); > > "us" actually means that put_task_struct(TASK_DEAD). > > As for ->sighand, note that sighand_cachep is SLAB_DESTROY_BY_RCU. So this > memory is RCU free in a sense that it can't be returned to system, but it > can be reused by another task. This is fine, lock_task_sighand() rechecks > sighand == task->sighand under ->siglock. > > So perhaps this tool misinterprets kmem_cache_free(sighand_cachep) as use > after free? > > We are going to add some comments into lock_task_sighand(). And cleanup it, > it can look much simpler. > > Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 14:49 ` Oleg Nesterov 2014-07-14 15:13 ` Oleg Nesterov @ 2014-07-14 15:31 ` Andrey Ryabinin 2014-07-14 16:01 ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov 2014-07-15 13:28 ` sched, timers: use after free in __lock_task_sighand when exiting a process Peter Zijlstra 3 siblings, 0 replies; 17+ messages in thread From: Andrey Ryabinin @ 2014-07-14 15:31 UTC (permalink / raw) To: Oleg Nesterov, Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones On 07/14/14 18:49, Oleg Nesterov wrote: > On 07/14, Peter Zijlstra wrote: >> >> On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote: >>> >>> [ 876.319044] ================================================================== >>> [ 876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18 >>> [ 876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping: (null) index:0x0 >>> [ 876.319044] page flags: 0x2fffff80008000(tail) >>> [ 876.319044] page dumped because: kasan error >>> [ 876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G W 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817 >>> [ 876.319044] 00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70 >>> [ 876.319044] ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c >>> [ 876.319044] 0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8 >>> [ 876.319044] Call Trace: >>> [ 876.319044] dump_stack (lib/dump_stack.c:52) >>> [ 876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166) >>> [ 876.319044] __asan_load8 (mm/kasan/kasan.c:364) >>> [ 876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158) >>> [ 876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183) >>> [ 876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285) >>> [ 876.319044] do_send_sig_info (kernel/signal.c:1191) >>> [ 876.319044] group_send_sig_info (kernel/signal.c:1304) >>> [ 876.319044] kill_pid_info (kernel/signal.c:1339) >>> [ 876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900) > > Looks like a false alarm at first glance... > >> Oleg, what guarantees the RCU free of task-struct and sighand? > >> The only RCU I can find is delayed_put_task_struct() but that's not >> often used. > > Yes, usually the code uses put_task_struct(). delayed_put_task_struct() > acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use > get_task_struct() if you observe this task under rcu_read_lock(). > > Say, > rcu_read_lock(); > task = find_task_by_vpid(...); > if (task) > get_task_struct(task); > rcu_read_unlock(); > > If release_task() used dec_and_test + kfree_rcu, the code above could > not work. > >> TASK_DEAD etc. use regular put_task_struct() and that >> doesn't seem to involve RCU. > > Yes, the task itself (or, depending ob pov, scheduler) has a reference. > copy_process() does > > /* > * One for us, one for whoever does the "release_task()" (usually > * parent) > */ > atomic_set(&tsk->usage, 2); > > "us" actually means that put_task_struct(TASK_DEAD). > > As for ->sighand, note that sighand_cachep is SLAB_DESTROY_BY_RCU. So this > memory is RCU free in a sense that it can't be returned to system, but it > can be reused by another task. This is fine, lock_task_sighand() rechecks > sighand == task->sighand under ->siglock. > > So perhaps this tool misinterprets kmem_cache_free(sighand_cachep) as use > after free? > Yes, you are right. I already seen such false positive with rcu slabs and kasan, and it was fixed at some point, but somehow fix for this didn't get to the final patches. I'm sorry for wasting everyone's time. > We are going to add some comments into lock_task_sighand(). And cleanup it, > it can look much simpler. > > Oleg. > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-14 14:49 ` Oleg Nesterov 2014-07-14 15:13 ` Oleg Nesterov 2014-07-14 15:31 ` Andrey Ryabinin @ 2014-07-14 16:01 ` Oleg Nesterov 2014-07-15 13:12 ` Oleg Nesterov 2014-07-15 13:28 ` sched, timers: use after free in __lock_task_sighand when exiting a process Peter Zijlstra 3 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2014-07-14 16:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin On 07/14, Oleg Nesterov wrote: > > Yes, the task itself (or, depending ob pov, scheduler) has a reference. > copy_process() does > > /* > * One for us, one for whoever does the "release_task()" (usually > * parent) > */ > atomic_set(&tsk->usage, 2); > > "us" actually means that put_task_struct(TASK_DEAD). Off-topic, but I do not understand the huge comment in finish_task_switch(). Perhaps this all was true a long ago, but currently "prev could be scheduled on another cpu" is certainly impossible? And if it was possible we have much more problems? In particular, in this case we still can drop the reference twice? I'll try to recheck, but do you see anything wrong in the patch below? Oleg. --- x/kernel/sched/core.c +++ x/kernel/sched/core.c @@ -2197,22 +2197,9 @@ static void finish_task_switch(struct rq __releases(rq->lock) { struct mm_struct *mm = rq->prev_mm; - long prev_state; rq->prev_mm = NULL; - /* - * A task struct has one reference for the use as "current". - * If a task dies, then it sets TASK_DEAD in tsk->state and calls - * schedule one last time. The schedule call will never return, and - * the scheduled task must drop that reference. - * The test for TASK_DEAD must occur while the runqueue locks are - * still held, otherwise prev could be scheduled on another cpu, die - * there before we look at prev->state, and then the reference would - * be dropped twice. - * Manfred Spraul <manfred@colorfullife.com> - */ - prev_state = prev->state; vtime_task_switch(prev); finish_arch_switch(prev); perf_event_task_sched_in(prev, current); @@ -2222,7 +2209,7 @@ static void finish_task_switch(struct rq fire_sched_in_preempt_notifiers(current); if (mm) mmdrop(mm); - if (unlikely(prev_state == TASK_DEAD)) { + if (unlikely(prev->state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-14 16:01 ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov @ 2014-07-15 13:12 ` Oleg Nesterov 2014-07-15 13:23 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2014-07-15 13:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin Ah, I am stupid, please ignore. Of course a TASK_DEAD task can not schedule, but we can race with RUNNING -> DEAD transition. So we should only do put_task_struct() if "prev" was already TASK_DEAD before we drop the rq locks. On 07/14, Oleg Nesterov wrote: > > On 07/14, Oleg Nesterov wrote: > > > > Yes, the task itself (or, depending ob pov, scheduler) has a reference. > > copy_process() does > > > > /* > > * One for us, one for whoever does the "release_task()" (usually > > * parent) > > */ > > atomic_set(&tsk->usage, 2); > > > > "us" actually means that put_task_struct(TASK_DEAD). > > Off-topic, but I do not understand the huge comment in finish_task_switch(). > Perhaps this all was true a long ago, but currently "prev could be scheduled > on another cpu" is certainly impossible? > > And if it was possible we have much more problems? In particular, in this case > we still can drop the reference twice? > > I'll try to recheck, but do you see anything wrong in the patch below? > > Oleg. > > --- x/kernel/sched/core.c > +++ x/kernel/sched/core.c > @@ -2197,22 +2197,9 @@ static void finish_task_switch(struct rq > __releases(rq->lock) > { > struct mm_struct *mm = rq->prev_mm; > - long prev_state; > > rq->prev_mm = NULL; > > - /* > - * A task struct has one reference for the use as "current". > - * If a task dies, then it sets TASK_DEAD in tsk->state and calls > - * schedule one last time. The schedule call will never return, and > - * the scheduled task must drop that reference. > - * The test for TASK_DEAD must occur while the runqueue locks are > - * still held, otherwise prev could be scheduled on another cpu, die > - * there before we look at prev->state, and then the reference would > - * be dropped twice. > - * Manfred Spraul <manfred@colorfullife.com> > - */ > - prev_state = prev->state; > vtime_task_switch(prev); > finish_arch_switch(prev); > perf_event_task_sched_in(prev, current); > @@ -2222,7 +2209,7 @@ static void finish_task_switch(struct rq > fire_sched_in_preempt_notifiers(current); > if (mm) > mmdrop(mm); > - if (unlikely(prev_state == TASK_DEAD)) { > + if (unlikely(prev->state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-15 13:12 ` Oleg Nesterov @ 2014-07-15 13:23 ` Peter Zijlstra 2014-07-15 14:25 ` Oleg Nesterov 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2014-07-15 13:23 UTC (permalink / raw) To: Oleg Nesterov Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin [-- Attachment #1: Type: text/plain, Size: 1806 bytes --] On Tue, Jul 15, 2014 at 03:12:40PM +0200, Oleg Nesterov wrote: > Ah, I am stupid, please ignore. > > Of course a TASK_DEAD task can not schedule, but we can race with RUNNING -> > DEAD transition. So we should only do put_task_struct() if "prev" was already > TASK_DEAD before we drop the rq locks. Not so stupid; that is, when I read your question yesterday I didn't have a ready answer and queued the email to look at later. This does mean the comment isn't clear enough at the very least. Does this make it better? --- kernel/sched/core.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7bc599dc4aa4..aa67f1cfa58e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2211,13 +2211,15 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) /* * A task struct has one reference for the use as "current". + * * If a task dies, then it sets TASK_DEAD in tsk->state and calls - * schedule one last time. The schedule call will never return, and - * the scheduled task must drop that reference. - * The test for TASK_DEAD must occur while the runqueue locks are - * still held, otherwise prev could be scheduled on another cpu, die - * there before we look at prev->state, and then the reference would - * be dropped twice. + * schedule one last time. The schedule call will never return, and the + * scheduled task must drop that reference. + * + * The test for TASK_DEAD must occur while the runqueue locks are still + * held, otherwise we can race with RUNNING -> DEAD transitions, and + * then the reference would be dropped twice. + * * Manfred Spraul <manfred@colorfullife.com> */ prev_state = prev->state; [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-15 13:23 ` Peter Zijlstra @ 2014-07-15 14:25 ` Oleg Nesterov 2014-07-29 9:10 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Oleg Nesterov @ 2014-07-15 14:25 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin On 07/15, Peter Zijlstra wrote: > > @@ -2211,13 +2211,15 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) > > /* > * A task struct has one reference for the use as "current". > + * > * If a task dies, then it sets TASK_DEAD in tsk->state and calls > - * schedule one last time. The schedule call will never return, and > - * the scheduled task must drop that reference. > - * The test for TASK_DEAD must occur while the runqueue locks are > - * still held, otherwise prev could be scheduled on another cpu, die > - * there before we look at prev->state, and then the reference would > - * be dropped twice. > + * schedule one last time. The schedule call will never return, and the > + * scheduled task must drop that reference. > + * > + * The test for TASK_DEAD must occur while the runqueue locks are still > + * held, otherwise we can race with RUNNING -> DEAD transitions, and > + * then the reference would be dropped twice. > + * > * Manfred Spraul <manfred@colorfullife.com> > */ Agreed, this looks much more understandable! And probably I missed something again, but it seems that this logic is broken with __ARCH_WANT_UNLOCKED_CTXSW. Of course, even if I am right this is pure theoretical, but smp_wmb() before "->on_cpu = 0" is not enough and we need a full barrier ? Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-15 14:25 ` Oleg Nesterov @ 2014-07-29 9:10 ` Peter Zijlstra 2014-07-29 9:22 ` Peter Zijlstra 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2014-07-29 9:10 UTC (permalink / raw) To: Oleg Nesterov Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin On Tue, Jul 15, 2014 at 04:25:25PM +0200, Oleg Nesterov wrote: > And probably I missed something again, but it seems that this logic is broken > with __ARCH_WANT_UNLOCKED_CTXSW. > > Of course, even if I am right this is pure theoretical, but smp_wmb() before > "->on_cpu = 0" is not enough and we need a full barrier ? (long delay there, forgot about this thread, sorry) Yes, I think I see that.. but now I think the comment is further wrong. Its not rq->lock that is important, remember, a concurrent wakeup onto another CPU does not require our rq->lock at all. It is the ->on_cpu = 0 store that is important (for both the UNLOCKED_CTXSW cases). As soon as that store comes through the task can start running on the remote cpu. Now the below patch 'fixes' this but at the cost of adding a full barrier which is somewhat unfortunate to say the least. wmb's are free on x86 and generally cheaper than mbs, so it would to find another solution to this problem... --- kernel/sched/core.c | 10 +++++----- kernel/sched/sched.h | 10 ++++++++-- 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2676866b4394..950264381644 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2214,11 +2214,11 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) * If a task dies, then it sets TASK_DEAD in tsk->state and calls * schedule one last time. The schedule call will never return, and * the scheduled task must drop that reference. - * The test for TASK_DEAD must occur while the runqueue locks are - * still held, otherwise prev could be scheduled on another cpu, die - * there before we look at prev->state, and then the reference would - * be dropped twice. - * Manfred Spraul <manfred@colorfullife.com> + * + * We must observe prev->state before clearing prev->on_cpu (in + * finish_lock_switch), otherwise a concurrent wakeup can get prev + * running on another CPU and we could race with its RUNNING -> DEAD + * transition, and then the reference would be dropped twice. */ prev_state = prev->state; vtime_task_switch(prev); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 579712f4e9d5..259632c09c98 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -973,8 +973,11 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) * After ->on_cpu is cleared, the task can be moved to a different CPU. * We must ensure this doesn't happen until the switch is completely * finished. + * + * We must furthermore ensure the prev->state read in + * finish_task_switch() is complete before allowing this store. */ - smp_wmb(); + smp_mb(); prev->on_cpu = 0; #endif #ifdef CONFIG_DEBUG_SPINLOCK @@ -1012,8 +1015,11 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) * After ->on_cpu is cleared, the task can be moved to a different CPU. * We must ensure this doesn't happen until the switch is completely * finished. + * + * We must furthermore ensure the prev->state read in + * finish_task_switch() is complete before allowing this store. */ - smp_wmb(); + smp_mb(); prev->on_cpu = 0; #endif local_irq_enable(); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-29 9:10 ` Peter Zijlstra @ 2014-07-29 9:22 ` Peter Zijlstra 2014-07-29 15:53 ` Oleg Nesterov 0 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2014-07-29 9:22 UTC (permalink / raw) To: Oleg Nesterov Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin On Tue, Jul 29, 2014 at 11:10:18AM +0200, Peter Zijlstra wrote: > On Tue, Jul 15, 2014 at 04:25:25PM +0200, Oleg Nesterov wrote: > > > And probably I missed something again, but it seems that this logic is broken > > with __ARCH_WANT_UNLOCKED_CTXSW. > > > > Of course, even if I am right this is pure theoretical, but smp_wmb() before > > "->on_cpu = 0" is not enough and we need a full barrier ? > > (long delay there, forgot about this thread, sorry) > > Yes, I think I see that.. but now I think the comment is further wrong. > > Its not rq->lock that is important, remember, a concurrent wakeup onto > another CPU does not require our rq->lock at all. > > It is the ->on_cpu = 0 store that is important (for both the > UNLOCKED_CTXSW cases). As soon as that store comes through the task can > start running on the remote cpu. > > Now the below patch 'fixes' this but at the cost of adding a full > barrier which is somewhat unfortunate to say the least. > > wmb's are free on x86 and generally cheaper than mbs, so it would to > find another solution to this problem... Something like so then? --- kernel/sched/core.c | 36 ++++++++++++++++++++---------------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2676866b4394..179390f7380d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2190,6 +2190,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, * finish_task_switch - clean up after a task-switch * @rq: runqueue associated with task-switch * @prev: the thread we just switched away from. + * @prev_state: the state of @prev before we switched away from it. * * finish_task_switch must be called after the context switch, paired * with a prepare_task_switch call before the context switch. @@ -2201,26 +2202,14 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, * with the lock held can cause deadlocks; see schedule() for * details.) */ -static void finish_task_switch(struct rq *rq, struct task_struct *prev) +static void +finish_task_switch(struct rq *rq, struct task_struct *prev, long prev_state) __releases(rq->lock) { struct mm_struct *mm = rq->prev_mm; - long prev_state; rq->prev_mm = NULL; - /* - * A task struct has one reference for the use as "current". - * If a task dies, then it sets TASK_DEAD in tsk->state and calls - * schedule one last time. The schedule call will never return, and - * the scheduled task must drop that reference. - * The test for TASK_DEAD must occur while the runqueue locks are - * still held, otherwise prev could be scheduled on another cpu, die - * there before we look at prev->state, and then the reference would - * be dropped twice. - * Manfred Spraul <manfred@colorfullife.com> - */ - prev_state = prev->state; vtime_task_switch(prev); finish_arch_switch(prev); perf_event_task_sched_in(prev, current); @@ -2279,7 +2268,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev) { struct rq *rq = this_rq(); - finish_task_switch(rq, prev); + finish_task_switch(rq, prev, 0); /* * FIXME: do we need to worry about rq being invalidated by the @@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { struct mm_struct *mm, *oldmm; + /* + * A task struct has one reference for the use as "current". + * If a task dies, then it sets TASK_DEAD in tsk->state and calls + * schedule one last time. The schedule call will never return, and + * the scheduled task must drop that reference. + * + * We must observe prev->state before clearing prev->on_cpu (in + * finish_lock_switch), otherwise a concurrent wakeup can get prev + * running on another CPU and we could race with its RUNNING -> DEAD + * transition, and then the reference would be dropped twice. + * + * We avoid the race by observing prev->state while it is still + * current. + */ + long prev_state = prev->state; prepare_task_switch(rq, prev, next); @@ -2347,7 +2351,7 @@ context_switch(struct rq *rq, struct task_struct *prev, * CPUs since it called schedule(), thus the 'rq' on its stack * frame will be invalid. */ - finish_task_switch(this_rq(), prev); + finish_task_switch(this_rq(), prev, prev_state); } /* ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) 2014-07-29 9:22 ` Peter Zijlstra @ 2014-07-29 15:53 ` Oleg Nesterov 0 siblings, 0 replies; 17+ messages in thread From: Oleg Nesterov @ 2014-07-29 15:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin On 07/29, Peter Zijlstra wrote: > > On Tue, Jul 29, 2014 at 11:10:18AM +0200, Peter Zijlstra wrote: > > On Tue, Jul 15, 2014 at 04:25:25PM +0200, Oleg Nesterov wrote: > > > > > And probably I missed something again, but it seems that this logic is broken > > > with __ARCH_WANT_UNLOCKED_CTXSW. > > > > > > Of course, even if I am right this is pure theoretical, but smp_wmb() before > > > "->on_cpu = 0" is not enough and we need a full barrier ? > > > > (long delay there, forgot about this thread, sorry) > > > > Yes, I think I see that.. but now I think the comment is further wrong. > > > > Its not rq->lock that is important, remember, a concurrent wakeup onto > > another CPU does not require our rq->lock at all. > > > > It is the ->on_cpu = 0 store that is important (for both the > > UNLOCKED_CTXSW cases). As soon as that store comes through the task can > > start running on the remote cpu. Yes, I came to the same conclusion right after I sent that email. > > Now the below patch 'fixes' this but at the cost of adding a full > > barrier which is somewhat unfortunate to say the least. And yes, this is obviously the "fix" I had in mind, but: > > wmb's are free on x86 and generally cheaper than mbs, so it would to > > find another solution to this problem... > > Something like so then? Hmm, indeed! Unfortunately I didn't find this simple solution. Yes, I think we should check current->state == TASK_DEAD, > @@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct task_struct *prev, > struct task_struct *next) > { > struct mm_struct *mm, *oldmm; > + /* > + * A task struct has one reference for the use as "current". > + * If a task dies, then it sets TASK_DEAD in tsk->state and calls > + * schedule one last time. The schedule call will never return, and > + * the scheduled task must drop that reference. > + * > + * We must observe prev->state before clearing prev->on_cpu (in > + * finish_lock_switch), otherwise a concurrent wakeup can get prev > + * running on another CPU and we could race with its RUNNING -> DEAD > + * transition, and then the reference would be dropped twice. > + * > + * We avoid the race by observing prev->state while it is still > + * current. > + */ > + long prev_state = prev->state; This doesn't really matter, but probably it would be better to do this right before switch_to(), prev == current until this point. Oleg. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: sched, timers: use after free in __lock_task_sighand when exiting a process 2014-07-14 14:49 ` Oleg Nesterov ` (2 preceding siblings ...) 2014-07-14 16:01 ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov @ 2014-07-15 13:28 ` Peter Zijlstra 3 siblings, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2014-07-15 13:28 UTC (permalink / raw) To: Oleg Nesterov Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin [-- Attachment #1: Type: text/plain, Size: 1138 bytes --] On Mon, Jul 14, 2014 at 04:49:53PM +0200, Oleg Nesterov wrote: > > Oleg, what guarantees the RCU free of task-struct and sighand? > > > The only RCU I can find is delayed_put_task_struct() but that's not > > often used. > > Yes, usually the code uses put_task_struct(). delayed_put_task_struct() > acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use > get_task_struct() if you observe this task under rcu_read_lock(). > > Say, > rcu_read_lock(); > task = find_task_by_vpid(...); > if (task) > get_task_struct(task); > rcu_read_unlock(); > > If release_task() used dec_and_test + kfree_rcu, the code above could > not work. Agreed. > > TASK_DEAD etc. use regular put_task_struct() and that > > doesn't seem to involve RCU. > > Yes, the task itself (or, depending ob pov, scheduler) has a reference. > copy_process() does > > /* > * One for us, one for whoever does the "release_task()" (usually > * parent) > */ > atomic_set(&tsk->usage, 2); > > "us" actually means that put_task_struct(TASK_DEAD). Right, that's it. I got confused in the exit code. Thanks! [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-07-29 15:55 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-13 21:51 sched, timers: use after free in __lock_task_sighand when exiting a process Sasha Levin 2014-07-13 23:45 ` Sasha Levin 2014-07-14 9:04 ` Peter Zijlstra 2014-07-14 9:34 ` Andrey Ryabinin 2014-07-14 9:58 ` Peter Zijlstra 2014-07-14 10:25 ` Andrey Ryabinin 2014-07-14 14:49 ` Oleg Nesterov 2014-07-14 15:13 ` Oleg Nesterov 2014-07-14 15:31 ` Andrey Ryabinin 2014-07-14 16:01 ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov 2014-07-15 13:12 ` Oleg Nesterov 2014-07-15 13:23 ` Peter Zijlstra 2014-07-15 14:25 ` Oleg Nesterov 2014-07-29 9:10 ` Peter Zijlstra 2014-07-29 9:22 ` Peter Zijlstra 2014-07-29 15:53 ` Oleg Nesterov 2014-07-15 13:28 ` sched, timers: use after free in __lock_task_sighand when exiting a process Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox