sched, timers: use after free in __lock_task_sighand when exiting a process

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* sched, timers: use after free in __lock_task_sighand when exiting a process
@ 2014-07-13 21:51 Sasha Levin
  2014-07-13 23:45 ` Sasha Levin
  0 siblings, 1 reply; 17+ messages in thread
From: Sasha Levin @ 2014-07-13 21:51 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker
  Cc: LKML, Dave Jones, Andrey Ryabinin

Hi all,

While fuzzing with trinity inside a KVM tools guest running the latest -next
kernel with the KASAN patchset, I've stumbled on the following spew:

[ 4838.503887] ==================================================================
[ 4838.510906] AddressSanitizer: use after free in do_raw_spin_lock+0x157/0x240 at addr ffff8800ac3af608
[ 4838.513433] page:ffffea0002b0ebc0 count:0 mapcount:0 mapping:          (null) index:0x0
[ 4838.513433] page flags: 0x1fffff80008000(tail)
[ 4838.513433] page dumped because: kasan error
[ 4838.513433] CPU: 14 PID: 19619 Comm: trinity-c52 Tainted: G        W      3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
[ 4838.513433]  00000000000000fb 0000000000000000 ffffea0002b0ebc0 ffff8805f2e03bc8
[ 4838.513433]  ffffffff9ee47068 ffff8805f2e03c98 ffff8805f2e03c88 ffffffff9a426f5c
[ 4838.513433]  ffff8805f2fe2d80 ffff8805f2fe2d90 ffff8805f2fe2d80 ffff8805f2fe2d88
[ 4838.513433] Call Trace:
[ 4838.513433] <IRQ> dump_stack (lib/dump_stack.c:52)
[ 4838.513433] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[ 4838.513433] __asan_load4 (mm/kasan/kasan.c:358)
[ 4838.513433] do_raw_spin_lock (./arch/x86/include/asm/spinlock.h:108 kernel/locking/spinlock_debug.c:112 kernel/locking/spinlock_debug.c:137)
[ 4838.513433] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 4838.513433] __lock_task_sighand (kernel/signal.c:1280)
[ 4838.513433] run_posix_cpu_timers (kernel/time/posix-cpu-timers.c:1162)
[ 4838.513433] update_process_times (kernel/time/timer.c:1392)
[ 4838.513433] tick_sched_timer (kernel/time/tick-sched.c:152 kernel/time/tick-sched.c:1096)
[ 4838.513433] __run_hrtimer (kernel/time/hrtimer.c:1272 (discriminator 3))
[ 4838.513433] hrtimer_interrupt (kernel/time/hrtimer.c:1361)
[ 4838.513433] local_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:921)
[ 4838.513433] smp_apic_timer_interrupt (./arch/x86/include/asm/apic.h:696 arch/x86/kernel/apic/apic.c:945)
[ 4838.513433] apic_timer_interrupt (arch/x86/kernel/entry_64.S:1020)
[ 4838.513433] <EOI> ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191)
[ 4838.513433] __slab_free (include/linux/spinlock.h:358 mm/slub.c:2565)
[ 4838.513433] kmem_cache_free (mm/slub.c:2704 mm/slub.c:2714)
[ 4838.513433] put_io_context (block/blk-ioc.c:154)
[ 4838.513433] put_io_context_active (block/blk-ioc.c:197)
[ 4838.513433] exit_io_context (block/blk-ioc.c:211)
[ 4838.513433] do_exit (kernel/exit.c:801)
[ 4838.513433] do_group_exit (./arch/x86/include/asm/current.h:14 kernel/exit.c:870)
[ 4838.513433] get_signal_to_deliver (kernel/signal.c:2351)
[ 4838.513433] do_signal (arch/x86/kernel/signal.c:698)
[ 4838.513433] do_notify_resume (arch/x86/kernel/signal.c:751)
[ 4838.513433] int_signal (arch/x86/kernel/entry_64.S:600)
[ 4838.513433] Read of size 4 by thread T19619:
[ 4838.513433] Memory state around the buggy address:
[ 4838.513433]  ffff8800ac3af380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433] >ffff8800ac3af600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]                       ^
[ 4838.513433]  ffff8800ac3af680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 4838.513433]  ffff8800ac3af800: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
[ 4838.513433]  ffff8800ac3af880: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
[ 4838.513433] ==================================================================

It looks like a race between an exiting process and an external access to it's
'sighand'.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-13 21:51 sched, timers: use after free in __lock_task_sighand when exiting a process Sasha Levin
@ 2014-07-13 23:45 ` Sasha Levin
  2014-07-14  9:04   ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Sasha Levin @ 2014-07-13 23:45 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker
  Cc: LKML, Dave Jones, Andrey Ryabinin

On 07/13/2014 05:51 PM, Sasha Levin wrote:
> Hi all,
> 
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel with the KASAN patchset, I've stumbled on the following spew:

Alrighty, I don't think it has anything to do with timers:

[  876.319044] ==================================================================
[  876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18
[  876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping:          (null) index:0x0
[  876.319044] page flags: 0x2fffff80008000(tail)
[  876.319044] page dumped because: kasan error
[  876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G        W      3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817
[  876.319044]  00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70
[  876.319044]  ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c
[  876.319044]  0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8
[  876.319044] Call Trace:
[  876.319044] dump_stack (lib/dump_stack.c:52)
[  876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[  876.319044] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  876.319044] ? check_chain_key (kernel/locking/lockdep.c:2188)
[  876.319044] __asan_load8 (mm/kasan/kasan.c:364)
[  876.319044] ? do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158)
[  876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158)
[  876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[  876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285)
[  876.319044] ? __lock_task_sighand (./arch/x86/include/asm/paravirt.h:814 ./arch/x86/include/asm/paravirt.h:827 kernel/signal.c:1270)
[  876.319044] do_send_sig_info (kernel/signal.c:1191)
[  876.319044] group_send_sig_info (kernel/signal.c:1304)
[  876.319044] ? group_send_sig_info (kernel/signal.c:1296)
[  876.319044] kill_pid_info (kernel/signal.c:1339)
[  876.319044] ? kill_pid_info (kernel/signal.c:1330)
[  876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900)
[  876.319044] ? SYSC_kill (include/linux/rcupdate.h:806 kernel/signal.c:1422 kernel/signal.c:2900)
[  876.319044] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
[  876.319044] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
[  876.319044] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
[  876.319044] SyS_kill (kernel/signal.c:2890)
[  876.319044] tracesys (arch/x86/kernel/entry_64.S:542)
[  876.319044] Read of size 8 by thread T8749:
[  876.319044] Memory state around the buggy address:
[  876.319044]  ffff8803e48ce980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48cea00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48cea80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48ceb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48ceb80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044] >ffff8803e48cec00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]                             ^
[  876.319044]  ffff8803e48cec80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48ced00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48ced80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  876.319044]  ffff8803e48cee00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  876.319044]  ffff8803e48cee80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  876.319044] ==================================================================


Thanks,
Sasha
> Sasha
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-13 23:45 ` Sasha Levin
@ 2014-07-14  9:04   ` Peter Zijlstra
  2014-07-14  9:34     ` Andrey Ryabinin
  2014-07-14 14:49     ` Oleg Nesterov
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-07-14  9:04 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker,
	LKML, Dave Jones, Andrey Ryabinin, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 2297 bytes --]

On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
> On 07/13/2014 05:51 PM, Sasha Levin wrote:
> > Hi all,
> > 
> > While fuzzing with trinity inside a KVM tools guest running the latest -next
> > kernel with the KASAN patchset, I've stumbled on the following spew:

WTH is a KASAN?

> Alrighty, I don't think it has anything to do with timers:
> 
> [  876.319044] ==================================================================
> [  876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18
> [  876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping:          (null) index:0x0
> [  876.319044] page flags: 0x2fffff80008000(tail)
> [  876.319044] page dumped because: kasan error
> [  876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G        W      3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817
> [  876.319044]  00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70
> [  876.319044]  ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c
> [  876.319044]  0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8
> [  876.319044] Call Trace:
> [  876.319044] dump_stack (lib/dump_stack.c:52)
> [  876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [  876.319044] __asan_load8 (mm/kasan/kasan.c:364)
> [  876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158)
> [  876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [  876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285)
> [  876.319044] do_send_sig_info (kernel/signal.c:1191)
> [  876.319044] group_send_sig_info (kernel/signal.c:1304)
> [  876.319044] kill_pid_info (kernel/signal.c:1339)
> [  876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900)

Oleg, what guarantees the RCU free of task-struct and sighand?

The only RCU I can find is delayed_put_task_struct() but that's not
often used. TASK_DEAD etc. use regular put_task_struct() and that
doesn't seem to involve RCU.

Could be I'm a moron, its Monday, and its relatively early, so I'll go
have another look..


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14  9:04   ` Peter Zijlstra
@ 2014-07-14  9:34     ` Andrey Ryabinin
  2014-07-14  9:58       ` Peter Zijlstra
  2014-07-14 14:49     ` Oleg Nesterov
  1 sibling, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2014-07-14  9:34 UTC (permalink / raw)
  To: Peter Zijlstra, Sasha Levin
  Cc: Ingo Molnar, John Stultz, Thomas Gleixner, Frederic Weisbecker,
	LKML, Dave Jones, Oleg Nesterov

On 07/14/14 13:04, Peter Zijlstra wrote:
> On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
>> On 07/13/2014 05:51 PM, Sasha Levin wrote:
>>> Hi all,
>>>
>>> While fuzzing with trinity inside a KVM tools guest running the latest -next
>>> kernel with the KASAN patchset, I've stumbled on the following spew:
> 
> WTH is a KASAN?
> 

It's dynamic memory checker, detects use after free, out of bounds accesses - https://lkml.org/lkml/2014/7/9/990


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14  9:34     ` Andrey Ryabinin
@ 2014-07-14  9:58       ` Peter Zijlstra
  2014-07-14 10:25         ` Andrey Ryabinin
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-07-14  9:58 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Mon, Jul 14, 2014 at 01:34:40PM +0400, Andrey Ryabinin wrote:
> On 07/14/14 13:04, Peter Zijlstra wrote:
> > On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
> >> On 07/13/2014 05:51 PM, Sasha Levin wrote:
> >>> Hi all,
> >>>
> >>> While fuzzing with trinity inside a KVM tools guest running the latest -next
> >>> kernel with the KASAN patchset, I've stumbled on the following spew:
> > 
> > WTH is a KASAN?
> > 
> 
> It's dynamic memory checker, detects use after free, out of bounds accesses - https://lkml.org/lkml/2014/7/9/990

How is it better/different from the existing use after free stuff? That
email fails to tell me.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14  9:58       ` Peter Zijlstra
@ 2014-07-14 10:25         ` Andrey Ryabinin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2014-07-14 10:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Oleg Nesterov

On 07/14/14 13:58, Peter Zijlstra wrote:
> On Mon, Jul 14, 2014 at 01:34:40PM +0400, Andrey Ryabinin wrote:
>> On 07/14/14 13:04, Peter Zijlstra wrote:
>>> On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
>>>> On 07/13/2014 05:51 PM, Sasha Levin wrote:
>>>>> Hi all,
>>>>>
>>>>> While fuzzing with trinity inside a KVM tools guest running the latest -next
>>>>> kernel with the KASAN patchset, I've stumbled on the following spew:
>>>
>>> WTH is a KASAN?
>>>
>>
>> It's dynamic memory checker, detects use after free, out of bounds accesses - https://lkml.org/lkml/2014/7/9/990
> 
> How is it better/different from the existing use after free stuff? That
> email fails to tell me.
> 

DEBUG_PAGEALLOC works on page-granularity level, kasan can do checks at sub-page granularity (for slab objects).

SLUB_DEBUG(poisoning, etc.) doesn't guarantee that you will catch use-after-free.
For example if you are only reading and checking a single bit in some already free object. Kasan will catch this.

KASAN is very similar to kmemcheck, but definitely faster.
The only thing that currently kasan can't catch, while kmemcheck can - is reads of uninitialized memory.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14  9:04   ` Peter Zijlstra
  2014-07-14  9:34     ` Andrey Ryabinin
@ 2014-07-14 14:49     ` Oleg Nesterov
  2014-07-14 15:13       ` Oleg Nesterov
                         ` (3 more replies)
  1 sibling, 4 replies; 17+ messages in thread
From: Oleg Nesterov @ 2014-07-14 14:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

On 07/14, Peter Zijlstra wrote:
>
> On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
> >
> > [  876.319044] ==================================================================
> > [  876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18
> > [  876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping:          (null) index:0x0
> > [  876.319044] page flags: 0x2fffff80008000(tail)
> > [  876.319044] page dumped because: kasan error
> > [  876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G        W      3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817
> > [  876.319044]  00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70
> > [  876.319044]  ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c
> > [  876.319044]  0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8
> > [  876.319044] Call Trace:
> > [  876.319044] dump_stack (lib/dump_stack.c:52)
> > [  876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> > [  876.319044] __asan_load8 (mm/kasan/kasan.c:364)
> > [  876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158)
> > [  876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> > [  876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285)
> > [  876.319044] do_send_sig_info (kernel/signal.c:1191)
> > [  876.319044] group_send_sig_info (kernel/signal.c:1304)
> > [  876.319044] kill_pid_info (kernel/signal.c:1339)
> > [  876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900)

Looks like a false alarm at first glance...

> Oleg, what guarantees the RCU free of task-struct and sighand?

> The only RCU I can find is delayed_put_task_struct() but that's not
> often used.

Yes, usually the code uses put_task_struct(). delayed_put_task_struct()
acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use
get_task_struct() if you observe this task under rcu_read_lock().

Say,
	rcu_read_lock();
	task = find_task_by_vpid(...);
	if (task)
		get_task_struct(task);
	rcu_read_unlock();

If release_task() used dec_and_test + kfree_rcu, the code above could
not work.

> TASK_DEAD etc. use regular put_task_struct() and that
> doesn't seem to involve RCU.

Yes, the task itself (or, depending ob pov, scheduler) has a reference.
copy_process() does

	/*
	 * One for us, one for whoever does the "release_task()" (usually
	 * parent)
	 */
	atomic_set(&tsk->usage, 2);

"us" actually means that put_task_struct(TASK_DEAD).

As for ->sighand, note that sighand_cachep is SLAB_DESTROY_BY_RCU. So this
memory is RCU free in a sense that it can't be returned to system, but it
can be reused by another task. This is fine, lock_task_sighand() rechecks
sighand == task->sighand under ->siglock.

So perhaps this tool misinterprets kmem_cache_free(sighand_cachep) as use
after free?

We are going to add some comments into lock_task_sighand(). And cleanup it,
it can look much simpler.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14 14:49     ` Oleg Nesterov
@ 2014-07-14 15:13       ` Oleg Nesterov
  2014-07-14 15:31       ` Andrey Ryabinin
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2014-07-14 15:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

I'm afraid I wasn't clear... Let me try again.

So yes, this "race" is of course possible:

	lock_task_sighand()				release_task()

	sighand = task->sighand;
							sighand = task->sighand;

							spin_lock(sighand->siglock);
							task->sighand = NULL;
							spin_unlcok(sighand->siglock);

							kmem_cache_free(sighand);

	spin_lock(sighand->siglock);

but this is fine. lock_task_sighand() will notice task->sighand == NULL
under ->siglock and fail.

SLAB_DESTROY_BY_RCU guarantees that this memory is still sighand_struct
even if it is freed (or even reallocated). spin_lock/spin_unlock is safe
because ->siglock initialized by sighand_ctor(). And until the caller of
lock_task_sighand() drops ->siglock kmem_cache_free() is not possible, the
task can't exit.

To remind, this is one of the reasons why rt_mutex_unlock() must be "atomic"
as spin_lock_t. Without the recent fix from tglx spin_unlock() (turned into
rt_mutex_unlock()) could play with the freed memory. Because, once "unlock"
makes another "lock" possible, the task can take this lock and free this
memory, but lock_task_sighand() can be called outside of rcu_read_lock().

On 07/14, Oleg Nesterov wrote:
>
> On 07/14, Peter Zijlstra wrote:
> >
> > On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
> > >
> > > [  876.319044] ==================================================================
> > > [  876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18
> > > [  876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping:          (null) index:0x0
> > > [  876.319044] page flags: 0x2fffff80008000(tail)
> > > [  876.319044] page dumped because: kasan error
> > > [  876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G        W      3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817
> > > [  876.319044]  00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70
> > > [  876.319044]  ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c
> > > [  876.319044]  0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8
> > > [  876.319044] Call Trace:
> > > [  876.319044] dump_stack (lib/dump_stack.c:52)
> > > [  876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> > > [  876.319044] __asan_load8 (mm/kasan/kasan.c:364)
> > > [  876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158)
> > > [  876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> > > [  876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285)
> > > [  876.319044] do_send_sig_info (kernel/signal.c:1191)
> > > [  876.319044] group_send_sig_info (kernel/signal.c:1304)
> > > [  876.319044] kill_pid_info (kernel/signal.c:1339)
> > > [  876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900)
>
> Looks like a false alarm at first glance...
>
> > Oleg, what guarantees the RCU free of task-struct and sighand?
>
> > The only RCU I can find is delayed_put_task_struct() but that's not
> > often used.
>
> Yes, usually the code uses put_task_struct(). delayed_put_task_struct()
> acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use
> get_task_struct() if you observe this task under rcu_read_lock().
>
> Say,
> 	rcu_read_lock();
> 	task = find_task_by_vpid(...);
> 	if (task)
> 		get_task_struct(task);
> 	rcu_read_unlock();
>
> If release_task() used dec_and_test + kfree_rcu, the code above could
> not work.
>
> > TASK_DEAD etc. use regular put_task_struct() and that
> > doesn't seem to involve RCU.
>
> Yes, the task itself (or, depending ob pov, scheduler) has a reference.
> copy_process() does
>
> 	/*
> 	 * One for us, one for whoever does the "release_task()" (usually
> 	 * parent)
> 	 */
> 	atomic_set(&tsk->usage, 2);
>
> "us" actually means that put_task_struct(TASK_DEAD).
>
> As for ->sighand, note that sighand_cachep is SLAB_DESTROY_BY_RCU. So this
> memory is RCU free in a sense that it can't be returned to system, but it
> can be reused by another task. This is fine, lock_task_sighand() rechecks
> sighand == task->sighand under ->siglock.
>
> So perhaps this tool misinterprets kmem_cache_free(sighand_cachep) as use
> after free?
>
> We are going to add some comments into lock_task_sighand(). And cleanup it,
> it can look much simpler.
>
> Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14 14:49     ` Oleg Nesterov
  2014-07-14 15:13       ` Oleg Nesterov
@ 2014-07-14 15:31       ` Andrey Ryabinin
  2014-07-14 16:01       ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov
  2014-07-15 13:28       ` sched, timers: use after free in __lock_task_sighand when exiting a process Peter Zijlstra
  3 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2014-07-14 15:31 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones

On 07/14/14 18:49, Oleg Nesterov wrote:
> On 07/14, Peter Zijlstra wrote:
>>
>> On Sun, Jul 13, 2014 at 07:45:56PM -0400, Sasha Levin wrote:
>>>
>>> [  876.319044] ==================================================================
>>> [  876.319044] AddressSanitizer: use after free in do_raw_spin_unlock+0x4b/0x1a0 at addr ffff8803e48cec18
>>> [  876.319044] page:ffffea000f923380 count:0 mapcount:0 mapping:          (null) index:0x0
>>> [  876.319044] page flags: 0x2fffff80008000(tail)
>>> [  876.319044] page dumped because: kasan error
>>> [  876.319044] CPU: 26 PID: 8749 Comm: trinity-watchdo Tainted: G        W      3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #817
>>> [  876.319044]  00000000000000fb 0000000000000000 ffffea000f923380 ffff8805c417fc70
>>> [  876.319044]  ffffffff9de47068 ffff8805c417fd40 ffff8805c417fd30 ffffffff99426f5c
>>> [  876.319044]  0000000000000010 0000000000000000 ffff8805c417fc9d 66666620000000a8
>>> [  876.319044] Call Trace:
>>> [  876.319044] dump_stack (lib/dump_stack.c:52)
>>> [  876.319044] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
>>> [  876.319044] __asan_load8 (mm/kasan/kasan.c:364)
>>> [  876.319044] do_raw_spin_unlock (./arch/x86/include/asm/current.h:14 kernel/locking/spinlock_debug.c:99 kernel/locking/spinlock_debug.c:158)
>>> [  876.319044] _raw_spin_unlock (include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>>> [  876.319044] __lock_task_sighand (include/linux/rcupdate.h:858 kernel/signal.c:1285)
>>> [  876.319044] do_send_sig_info (kernel/signal.c:1191)
>>> [  876.319044] group_send_sig_info (kernel/signal.c:1304)
>>> [  876.319044] kill_pid_info (kernel/signal.c:1339)
>>> [  876.319044] SYSC_kill (kernel/signal.c:1423 kernel/signal.c:2900)
> 
> Looks like a false alarm at first glance...
> 
>> Oleg, what guarantees the RCU free of task-struct and sighand?
> 
>> The only RCU I can find is delayed_put_task_struct() but that's not
>> often used.
> 
> Yes, usually the code uses put_task_struct(). delayed_put_task_struct()
> acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use
> get_task_struct() if you observe this task under rcu_read_lock().
> 
> Say,
> 	rcu_read_lock();
> 	task = find_task_by_vpid(...);
> 	if (task)
> 		get_task_struct(task);
> 	rcu_read_unlock();
> 
> If release_task() used dec_and_test + kfree_rcu, the code above could
> not work.
> 
>> TASK_DEAD etc. use regular put_task_struct() and that
>> doesn't seem to involve RCU.
> 
> Yes, the task itself (or, depending ob pov, scheduler) has a reference.
> copy_process() does
> 
> 	/*
> 	 * One for us, one for whoever does the "release_task()" (usually
> 	 * parent)
> 	 */
> 	atomic_set(&tsk->usage, 2);
> 
> "us" actually means that put_task_struct(TASK_DEAD).
> 
> As for ->sighand, note that sighand_cachep is SLAB_DESTROY_BY_RCU. So this
> memory is RCU free in a sense that it can't be returned to system, but it
> can be reused by another task. This is fine, lock_task_sighand() rechecks
> sighand == task->sighand under ->siglock.
> 
> So perhaps this tool misinterprets kmem_cache_free(sighand_cachep) as use
> after free?
> 

Yes, you are right. I already seen such false positive with rcu slabs and kasan,
and it was fixed at some point, but somehow fix for this didn't get to the final patches.

I'm sorry for wasting everyone's time.

> We are going to add some comments into lock_task_sighand(). And cleanup it,
> it can look much simpler.
> 
> Oleg.
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-14 14:49     ` Oleg Nesterov
  2014-07-14 15:13       ` Oleg Nesterov
  2014-07-14 15:31       ` Andrey Ryabinin
@ 2014-07-14 16:01       ` Oleg Nesterov
  2014-07-15 13:12         ` Oleg Nesterov
  2014-07-15 13:28       ` sched, timers: use after free in __lock_task_sighand when exiting a process Peter Zijlstra
  3 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-07-14 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

On 07/14, Oleg Nesterov wrote:
>
> Yes, the task itself (or, depending ob pov, scheduler) has a reference.
> copy_process() does
>
> 	/*
> 	 * One for us, one for whoever does the "release_task()" (usually
> 	 * parent)
> 	 */
> 	atomic_set(&tsk->usage, 2);
>
> "us" actually means that put_task_struct(TASK_DEAD).

Off-topic, but I do not understand the huge comment in finish_task_switch().
Perhaps this all was true a long ago, but currently "prev could be scheduled
on another cpu" is certainly impossible?

And if it was possible we have much more problems? In particular, in this case
we still can drop the reference twice?

I'll try to recheck, but do you see anything wrong in the patch below?

Oleg.

--- x/kernel/sched/core.c
+++ x/kernel/sched/core.c
@@ -2197,22 +2197,9 @@ static void finish_task_switch(struct rq
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
 
 	rq->prev_mm = NULL;
 
-	/*
-	 * A task struct has one reference for the use as "current".
-	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
-	 * schedule one last time. The schedule call will never return, and
-	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
-	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2222,7 +2209,7 @@ static void finish_task_switch(struct rq
 	fire_sched_in_preempt_notifiers(current);
 	if (mm)
 		mmdrop(mm);
-	if (unlikely(prev_state == TASK_DEAD)) {
+	if (unlikely(prev->state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-14 16:01       ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov
@ 2014-07-15 13:12         ` Oleg Nesterov
  2014-07-15 13:23           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-07-15 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

Ah, I am stupid, please ignore.

Of course a TASK_DEAD task can not schedule, but we can race with RUNNING ->
DEAD transition. So we should only do put_task_struct() if "prev" was already
TASK_DEAD before we drop the rq locks.

On 07/14, Oleg Nesterov wrote:
>
> On 07/14, Oleg Nesterov wrote:
> >
> > Yes, the task itself (or, depending ob pov, scheduler) has a reference.
> > copy_process() does
> >
> > 	/*
> > 	 * One for us, one for whoever does the "release_task()" (usually
> > 	 * parent)
> > 	 */
> > 	atomic_set(&tsk->usage, 2);
> >
> > "us" actually means that put_task_struct(TASK_DEAD).
>
> Off-topic, but I do not understand the huge comment in finish_task_switch().
> Perhaps this all was true a long ago, but currently "prev could be scheduled
> on another cpu" is certainly impossible?
>
> And if it was possible we have much more problems? In particular, in this case
> we still can drop the reference twice?
>
> I'll try to recheck, but do you see anything wrong in the patch below?
>
> Oleg.
>
> --- x/kernel/sched/core.c
> +++ x/kernel/sched/core.c
> @@ -2197,22 +2197,9 @@ static void finish_task_switch(struct rq
>  	__releases(rq->lock)
>  {
>  	struct mm_struct *mm = rq->prev_mm;
> -	long prev_state;
>
>  	rq->prev_mm = NULL;
>
> -	/*
> -	 * A task struct has one reference for the use as "current".
> -	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
> -	 * schedule one last time. The schedule call will never return, and
> -	 * the scheduled task must drop that reference.
> -	 * The test for TASK_DEAD must occur while the runqueue locks are
> -	 * still held, otherwise prev could be scheduled on another cpu, die
> -	 * there before we look at prev->state, and then the reference would
> -	 * be dropped twice.
> -	 *		Manfred Spraul <manfred@colorfullife.com>
> -	 */
> -	prev_state = prev->state;
>  	vtime_task_switch(prev);
>  	finish_arch_switch(prev);
>  	perf_event_task_sched_in(prev, current);
> @@ -2222,7 +2209,7 @@ static void finish_task_switch(struct rq
>  	fire_sched_in_preempt_notifiers(current);
>  	if (mm)
>  		mmdrop(mm);
> -	if (unlikely(prev_state == TASK_DEAD)) {
> +	if (unlikely(prev->state == TASK_DEAD)) {
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-15 13:12         ` Oleg Nesterov
@ 2014-07-15 13:23           ` Peter Zijlstra
  2014-07-15 14:25             ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-07-15 13:23 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

[-- Attachment #1: Type: text/plain, Size: 1806 bytes --]

On Tue, Jul 15, 2014 at 03:12:40PM +0200, Oleg Nesterov wrote:
> Ah, I am stupid, please ignore.
> 
> Of course a TASK_DEAD task can not schedule, but we can race with RUNNING ->
> DEAD transition. So we should only do put_task_struct() if "prev" was already
> TASK_DEAD before we drop the rq locks.

Not so stupid; that is, when I read your question yesterday I didn't
have a ready answer and queued the email to look at later.

This does mean the comment isn't clear enough at the very least.

Does this make it better?

---
 kernel/sched/core.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bc599dc4aa4..aa67f1cfa58e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2211,13 +2211,15 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 
 	/*
 	 * A task struct has one reference for the use as "current".
+	 *
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
-	 * schedule one last time. The schedule call will never return, and
-	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
+	 * schedule one last time. The schedule call will never return, and the
+	 * scheduled task must drop that reference.
+	 *
+	 * The test for TASK_DEAD must occur while the runqueue locks are still
+	 * held, otherwise we can race with RUNNING -> DEAD transitions, and
+	 * then the reference would be dropped twice.
+	 *
 	 *		Manfred Spraul <manfred@colorfullife.com>
 	 */
 	prev_state = prev->state;

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: sched, timers: use after free in __lock_task_sighand when exiting a process
  2014-07-14 14:49     ` Oleg Nesterov
                         ` (2 preceding siblings ...)
  2014-07-14 16:01       ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov
@ 2014-07-15 13:28       ` Peter Zijlstra
  3 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-07-15 13:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

On Mon, Jul 14, 2014 at 04:49:53PM +0200, Oleg Nesterov wrote:
> > Oleg, what guarantees the RCU free of task-struct and sighand?
> 
> > The only RCU I can find is delayed_put_task_struct() but that's not
> > often used.
> 
> Yes, usually the code uses put_task_struct(). delayed_put_task_struct()
> acts almost as "if (dec_and_test(usage)) kfree_rcu(), but allows to use
> get_task_struct() if you observe this task under rcu_read_lock().
> 
> Say,
> 	rcu_read_lock();
> 	task = find_task_by_vpid(...);
> 	if (task)
> 		get_task_struct(task);
> 	rcu_read_unlock();
> 
> If release_task() used dec_and_test + kfree_rcu, the code above could
> not work.

Agreed.

> > TASK_DEAD etc. use regular put_task_struct() and that
> > doesn't seem to involve RCU.
> 
> Yes, the task itself (or, depending ob pov, scheduler) has a reference.
> copy_process() does
> 
> 	/*
> 	 * One for us, one for whoever does the "release_task()" (usually
> 	 * parent)
> 	 */
> 	atomic_set(&tsk->usage, 2);
> 
> "us" actually means that put_task_struct(TASK_DEAD).

Right, that's it. I got confused in the exit code. Thanks!

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-15 13:23           ` Peter Zijlstra
@ 2014-07-15 14:25             ` Oleg Nesterov
  2014-07-29  9:10               ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Oleg Nesterov @ 2014-07-15 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

On 07/15, Peter Zijlstra wrote:
>
> @@ -2211,13 +2211,15 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
>  
>  	/*
>  	 * A task struct has one reference for the use as "current".
> +	 *
>  	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
> -	 * schedule one last time. The schedule call will never return, and
> -	 * the scheduled task must drop that reference.
> -	 * The test for TASK_DEAD must occur while the runqueue locks are
> -	 * still held, otherwise prev could be scheduled on another cpu, die
> -	 * there before we look at prev->state, and then the reference would
> -	 * be dropped twice.
> +	 * schedule one last time. The schedule call will never return, and the
> +	 * scheduled task must drop that reference.
> +	 *
> +	 * The test for TASK_DEAD must occur while the runqueue locks are still
> +	 * held, otherwise we can race with RUNNING -> DEAD transitions, and
> +	 * then the reference would be dropped twice.
> +	 *
>  	 *		Manfred Spraul <manfred@colorfullife.com>
>  	 */

Agreed, this looks much more understandable!


And probably I missed something again, but it seems that this logic is broken
with __ARCH_WANT_UNLOCKED_CTXSW.

Of course, even if I am right this is pure theoretical, but smp_wmb() before
"->on_cpu = 0" is not enough and we need a full barrier ?

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-15 14:25             ` Oleg Nesterov
@ 2014-07-29  9:10               ` Peter Zijlstra
  2014-07-29  9:22                 ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-07-29  9:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

On Tue, Jul 15, 2014 at 04:25:25PM +0200, Oleg Nesterov wrote:

> And probably I missed something again, but it seems that this logic is broken
> with __ARCH_WANT_UNLOCKED_CTXSW.
> 
> Of course, even if I am right this is pure theoretical, but smp_wmb() before
> "->on_cpu = 0" is not enough and we need a full barrier ?

(long delay there, forgot about this thread, sorry)

Yes, I think I see that.. but now I think the comment is further wrong.

Its not rq->lock that is important, remember, a concurrent wakeup onto
another CPU does not require our rq->lock at all.

It is the ->on_cpu = 0 store that is important (for both the
UNLOCKED_CTXSW cases). As soon as that store comes through the task can
start running on the remote cpu.

Now the below patch 'fixes' this but at the cost of adding a full
barrier which is somewhat unfortunate to say the least.

wmb's are free on x86 and generally cheaper than mbs, so it would to
find another solution to this problem...

---
 kernel/sched/core.c  | 10 +++++-----
 kernel/sched/sched.h | 10 ++++++++--
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2676866b4394..950264381644 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2214,11 +2214,11 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
 	 * schedule one last time. The schedule call will never return, and
 	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could race with its RUNNING -> DEAD
+	 * transition, and then the reference would be dropped twice.
 	 */
 	prev_state = prev->state;
 	vtime_task_switch(prev);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f4e9d5..259632c09c98 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -973,8 +973,11 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
+	 *
+	 * We must furthermore ensure the prev->state read in
+	 * finish_task_switch() is complete before allowing this store.
 	 */
-	smp_wmb();
+	smp_mb();
 	prev->on_cpu = 0;
 #endif
 #ifdef CONFIG_DEBUG_SPINLOCK
@@ -1012,8 +1015,11 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
 	 * After ->on_cpu is cleared, the task can be moved to a different CPU.
 	 * We must ensure this doesn't happen until the switch is completely
 	 * finished.
+	 *
+	 * We must furthermore ensure the prev->state read in
+	 * finish_task_switch() is complete before allowing this store.
 	 */
-	smp_wmb();
+	smp_mb();
 	prev->on_cpu = 0;
 #endif
 	local_irq_enable();

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-29  9:10               ` Peter Zijlstra
@ 2014-07-29  9:22                 ` Peter Zijlstra
  2014-07-29 15:53                   ` Oleg Nesterov
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-07-29  9:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

On Tue, Jul 29, 2014 at 11:10:18AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 15, 2014 at 04:25:25PM +0200, Oleg Nesterov wrote:
> 
> > And probably I missed something again, but it seems that this logic is broken
> > with __ARCH_WANT_UNLOCKED_CTXSW.
> > 
> > Of course, even if I am right this is pure theoretical, but smp_wmb() before
> > "->on_cpu = 0" is not enough and we need a full barrier ?
> 
> (long delay there, forgot about this thread, sorry)
> 
> Yes, I think I see that.. but now I think the comment is further wrong.
> 
> Its not rq->lock that is important, remember, a concurrent wakeup onto
> another CPU does not require our rq->lock at all.
> 
> It is the ->on_cpu = 0 store that is important (for both the
> UNLOCKED_CTXSW cases). As soon as that store comes through the task can
> start running on the remote cpu.
> 
> Now the below patch 'fixes' this but at the cost of adding a full
> barrier which is somewhat unfortunate to say the least.
> 
> wmb's are free on x86 and generally cheaper than mbs, so it would to
> find another solution to this problem...

Something like so then?

---
 kernel/sched/core.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2676866b4394..179390f7380d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2190,6 +2190,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
  * finish_task_switch - clean up after a task-switch
  * @rq: runqueue associated with task-switch
  * @prev: the thread we just switched away from.
+ * @prev_state: the state of @prev before we switched away from it.
  *
  * finish_task_switch must be called after the context switch, paired
  * with a prepare_task_switch call before the context switch.
@@ -2201,26 +2202,14 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
  * with the lock held can cause deadlocks; see schedule() for
  * details.)
  */
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static void 
+finish_task_switch(struct rq *rq, struct task_struct *prev, long prev_state)
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
 
 	rq->prev_mm = NULL;
 
-	/*
-	 * A task struct has one reference for the use as "current".
-	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
-	 * schedule one last time. The schedule call will never return, and
-	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
-	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2279,7 +2268,7 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
 {
 	struct rq *rq = this_rq();
 
-	finish_task_switch(rq, prev);
+	finish_task_switch(rq, prev, 0);
 
 	/*
 	 * FIXME: do we need to worry about rq being invalidated by the
@@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	/*
+	 * A task struct has one reference for the use as "current".
+	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
+	 * schedule one last time. The schedule call will never return, and
+	 * the scheduled task must drop that reference.
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could race with its RUNNING -> DEAD
+	 * transition, and then the reference would be dropped twice.
+	 *
+	 * We avoid the race by observing prev->state while it is still
+	 * current.
+	 */
+	long prev_state = prev->state;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2347,7 +2351,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 * CPUs since it called schedule(), thus the 'rq' on its stack
 	 * frame will be invalid.
 	 */
-	finish_task_switch(this_rq(), prev);
+	finish_task_switch(this_rq(), prev, prev_state);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process)
  2014-07-29  9:22                 ` Peter Zijlstra
@ 2014-07-29 15:53                   ` Oleg Nesterov
  0 siblings, 0 replies; 17+ messages in thread
From: Oleg Nesterov @ 2014-07-29 15:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sasha Levin, Ingo Molnar, John Stultz, Thomas Gleixner,
	Frederic Weisbecker, LKML, Dave Jones, Andrey Ryabinin

On 07/29, Peter Zijlstra wrote:
>
> On Tue, Jul 29, 2014 at 11:10:18AM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 15, 2014 at 04:25:25PM +0200, Oleg Nesterov wrote:
> >
> > > And probably I missed something again, but it seems that this logic is broken
> > > with __ARCH_WANT_UNLOCKED_CTXSW.
> > >
> > > Of course, even if I am right this is pure theoretical, but smp_wmb() before
> > > "->on_cpu = 0" is not enough and we need a full barrier ?
> >
> > (long delay there, forgot about this thread, sorry)
> >
> > Yes, I think I see that.. but now I think the comment is further wrong.
> >
> > Its not rq->lock that is important, remember, a concurrent wakeup onto
> > another CPU does not require our rq->lock at all.
> >
> > It is the ->on_cpu = 0 store that is important (for both the
> > UNLOCKED_CTXSW cases). As soon as that store comes through the task can
> > start running on the remote cpu.

Yes, I came to the same conclusion right after I sent that email.

> > Now the below patch 'fixes' this but at the cost of adding a full
> > barrier which is somewhat unfortunate to say the least.

And yes, this is obviously the "fix" I had in mind, but:

> > wmb's are free on x86 and generally cheaper than mbs, so it would to
> > find another solution to this problem...
>
> Something like so then?

Hmm, indeed! Unfortunately I didn't find this simple solution. Yes, I think
we should check current->state == TASK_DEAD,

> @@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct task_struct *prev,
>  	       struct task_struct *next)
>  {
>  	struct mm_struct *mm, *oldmm;
> +	/*
> +	 * A task struct has one reference for the use as "current".
> +	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
> +	 * schedule one last time. The schedule call will never return, and
> +	 * the scheduled task must drop that reference.
> +	 *
> +	 * We must observe prev->state before clearing prev->on_cpu (in
> +	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
> +	 * running on another CPU and we could race with its RUNNING -> DEAD
> +	 * transition, and then the reference would be dropped twice.
> +	 *
> +	 * We avoid the race by observing prev->state while it is still
> +	 * current.
> +	 */
> +	long prev_state = prev->state;

This doesn't really matter, but probably it would be better to do this right
before switch_to(), prev == current until this point.

Oleg.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-07-29 15:55 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-13 21:51 sched, timers: use after free in __lock_task_sighand when exiting a process Sasha Levin
2014-07-13 23:45 ` Sasha Levin
2014-07-14  9:04   ` Peter Zijlstra
2014-07-14  9:34     ` Andrey Ryabinin
2014-07-14  9:58       ` Peter Zijlstra
2014-07-14 10:25         ` Andrey Ryabinin
2014-07-14 14:49     ` Oleg Nesterov
2014-07-14 15:13       ` Oleg Nesterov
2014-07-14 15:31       ` Andrey Ryabinin
2014-07-14 16:01       ` finish_task_switch && prev_state (Was: sched, timers: use after free in __lock_task_sighand when exiting a process) Oleg Nesterov
2014-07-15 13:12         ` Oleg Nesterov
2014-07-15 13:23           ` Peter Zijlstra
2014-07-15 14:25             ` Oleg Nesterov
2014-07-29  9:10               ` Peter Zijlstra
2014-07-29  9:22                 ` Peter Zijlstra
2014-07-29 15:53                   ` Oleg Nesterov
2014-07-15 13:28       ` sched, timers: use after free in __lock_task_sighand when exiting a process Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox