arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
@ 2018-02-14 12:02 Mark Rutland
  2018-02-14 15:07 ` Will Deacon
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2018-02-14 12:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

As a heads-up, I hit the splat below when fuzzing v4.16-rc1 on arm64.

Evidently, we get to finish_task_switch() with rq->prev_mm != NULL,
despite rq->prev_mm having been freed. KASAN spots the dereference of
mm->membarrier_state in membarrier_mm_sync_core_before_usermode(mm), but
AFAICT the underlying issue is independent of the membarrier code, and
we could get a splat on the subsequent mmdrop(mm).

I've seen this once in ~2500 CPU hours of fuzzing, so it looks pretty
difficult to hit, and I have no reproducer so far.

Syzkaller report below, mirrored with Syzkaller log at [1]. If I hit
this again, I'll upload new info there.

Thanks,
Mark.

[1] https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20180214-finish_task_switch-stale-mm/

==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:183 [inline]
BUG: KASAN: use-after-free in membarrier_mm_sync_core_before_usermode include/linux/sched/mm.h:216 [inline]
BUG: KASAN: use-after-free in finish_task_switch+0x590/0x5e8 kernel/sched/core.c:2720
Read of size 4 at addr ffff800073dd4980 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-rc1-00001-g3fffa6166965-dirty #13
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x388 arch/arm64/kernel/time.c:52
 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:151
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0xd0/0x120 lib/dump_stack.c:53
 print_address_description+0x5c/0x2a8 mm/kasan/report.c:262
 kasan_report_error mm/kasan/report.c:362 [inline]
 kasan_report+0x210/0x340 mm/kasan/report.c:420
 __asan_report_load4_noabort+0x18/0x20 mm/kasan/report.c:440
 __read_once_size include/linux/compiler.h:183 [inline]
 membarrier_mm_sync_core_before_usermode include/linux/sched/mm.h:216 [inline]
 finish_task_switch+0x590/0x5e8 kernel/sched/core.c:2720
 context_switch kernel/sched/core.c:2860 [inline]
 __schedule+0x4f0/0x1558 kernel/sched/core.c:3435
 schedule_idle+0x40/0x80 kernel/sched/core.c:3521
 do_idle+0x114/0x2e0 kernel/sched/idle.c:269
 cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:351
 secondary_start_kernel+0x290/0x318 arch/arm64/kernel/smp.c:283

Allocated by task 12964:
 save_stack mm/kasan/kasan.c:447 [inline]
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xd0/0x180 mm/kasan/kasan.c:552
 kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:489
 slab_post_alloc_hook mm/slab.h:443 [inline]
 slab_alloc_node mm/slub.c:2725 [inline]
 slab_alloc mm/slub.c:2733 [inline]
 kmem_cache_alloc+0x150/0x240 mm/slub.c:2738
 dup_mm kernel/fork.c:1235 [inline]
 copy_mm kernel/fork.c:1298 [inline]
 copy_process.isra.7.part.8+0x1a10/0x5018 kernel/fork.c:1804
 copy_process kernel/fork.c:1617 [inline]
 _do_fork+0x178/0x998 kernel/fork.c:2098
 SYSC_clone kernel/fork.c:2205 [inline]
 SyS_clone+0x48/0x60 kernel/fork.c:2183
 el0_svc_naked+0x30/0x34

Freed by task 10882:
 save_stack mm/kasan/kasan.c:447 [inline]
 set_track mm/kasan/kasan.c:459 [inline]
 __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
 kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
 slab_free_hook mm/slub.c:1393 [inline]
 slab_free_freelist_hook mm/slub.c:1414 [inline]
 slab_free mm/slub.c:2968 [inline]
 kmem_cache_free+0x88/0x270 mm/slub.c:2990
 __mmdrop+0x164/0x248 kernel/fork.c:604
 mmdrop+0x50/0x60 kernel/fork.c:615
 __mmput kernel/fork.c:981 [inline]
 mmput+0x270/0x338 kernel/fork.c:992
 exit_mm kernel/exit.c:544 [inline]
 do_exit+0x640/0x2100 kernel/exit.c:852
 do_group_exit+0xdc/0x260 kernel/exit.c:968
 get_signal+0x2ec/0xfc8 kernel/signal.c:2469
 do_signal+0x270/0x4f8 arch/arm64/kernel/signal.c:869
 do_notify_resume+0x150/0x1c0 arch/arm64/kernel/signal.c:927
 work_pending+0x8/0x14

The buggy address belongs to the object at ffff800073dd4600
 which belongs to the cache mm_struct of size 1104
The buggy address is located 896 bytes inside of
 1104-byte region [ffff800073dd4600, ffff800073dd4a50)
The buggy address belongs to the page:
page:ffff7e0001cf7400 count:1 mapcount:0 mapping:          (null) index:0x0 compound_mapcount: 0
flags: 0x4fffc00000008100(slab|head)
raw: 4fffc00000008100 0000000000000000 0000000000000000 0000000180190019
raw: ffff7e0001d2b000 0000000700000007 ffff800075803000 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff800073dd4880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff800073dd4900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff800073dd4980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff800073dd4a00: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc
 ffff800073dd4a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-14 12:02 arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch Mark Rutland
@ 2018-02-14 15:07 ` Will Deacon
  2018-02-14 16:51   ` Mark Rutland
  0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2018-02-14 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Mark,

Cheers for the report. These things tend to be a pain to debug, but I've had
a go.

On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote:
> As a heads-up, I hit the splat below when fuzzing v4.16-rc1 on arm64.
> 
> Evidently, we get to finish_task_switch() with rq->prev_mm != NULL,
> despite rq->prev_mm having been freed. KASAN spots the dereference of
> mm->membarrier_state in membarrier_mm_sync_core_before_usermode(mm), but
> AFAICT the underlying issue is independent of the membarrier code, and
> we could get a splat on the subsequent mmdrop(mm).
> 
> I've seen this once in ~2500 CPU hours of fuzzing, so it looks pretty
> difficult to hit, and I have no reproducer so far.
> 
> Syzkaller report below, mirrored with Syzkaller log at [1]. If I hit
> this again, I'll upload new info there.

The interesting thing here is on the exit path:

> Freed by task 10882:
>  save_stack mm/kasan/kasan.c:447 [inline]
>  set_track mm/kasan/kasan.c:459 [inline]
>  __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
>  kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
>  slab_free_hook mm/slub.c:1393 [inline]
>  slab_free_freelist_hook mm/slub.c:1414 [inline]
>  slab_free mm/slub.c:2968 [inline]
>  kmem_cache_free+0x88/0x270 mm/slub.c:2990
>  __mmdrop+0x164/0x248 kernel/fork.c:604

^^ This should never run, because there's an mmgrab() about 8 lines above
the mmput() in exit_mm.

>  mmdrop+0x50/0x60 kernel/fork.c:615
>  __mmput kernel/fork.c:981 [inline]
>  mmput+0x270/0x338 kernel/fork.c:992
>  exit_mm kernel/exit.c:544 [inline]

Looking at exit_mm:

        mmgrab(mm);
        BUG_ON(mm != current->active_mm);
        /* more a memory barrier than a real lock */
        task_lock(current);
        current->mm = NULL;
        up_read(&mm->mmap_sem);
        enter_lazy_tlb(mm, current);
        task_unlock(current);
        mm_update_next_owner(mm);
        mmput(mm);

Then the comment already rings some alarm bells: our spin_lock (as used
by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered
due to being an atomic_inc) can be reordered with respect to the assignment
of NULL to current->mm.

If the exit()ing task had recently migrated from another CPU, then that
CPU could concurrently run context_switch() and take this path:

	if (!prev->mm) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}

which then means finish_task_switch will call mmdrop():

	struct mm_struct *mm = rq->prev_mm;
	[...]
	if (mm) {
		membarrier_mm_sync_core_before_usermode(mm);
		mmdrop(mm);
	}

note that KASAN will be ok at this point, but it explains how the exit_mm
path ends up freeing the mm. Then, when the exit()ing CPU calls
context_switch, *it* will explode accessing the freed mm.

Easiest way to fix this is by guaranteeing the barrier semantics in the
exit path. Patch below. I guess we'll have to wait another 2500 hours to
see if it works :)

Will

--->8

diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d9fb55..f91e8d56b03f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -534,8 +534,9 @@ static void exit_mm(void)
        }
        mmgrab(mm);
        BUG_ON(mm != current->active_mm);
-       /* more a memory barrier than a real lock */
        task_lock(current);
+       /* Ensure we've grabbed the mm before setting current->mm to NULL */
+       smp_mb__after_spin_lock();
        current->mm = NULL;
        up_read(&mm->mmap_sem);
        enter_lazy_tlb(mm, current);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-14 15:07 ` Will Deacon
@ 2018-02-14 16:51   ` Mark Rutland
  2018-02-14 18:53     ` Mathieu Desnoyers
  0 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2018-02-14 16:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
> Hi Mark,

Hi Will,

> Cheers for the report. These things tend to be a pain to debug, but I've had
> a go.

Thanks for taking a look!

> On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote:
> The interesting thing here is on the exit path:
> 
> > Freed by task 10882:
> >  save_stack mm/kasan/kasan.c:447 [inline]
> >  set_track mm/kasan/kasan.c:459 [inline]
> >  __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
> >  kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
> >  slab_free_hook mm/slub.c:1393 [inline]
> >  slab_free_freelist_hook mm/slub.c:1414 [inline]
> >  slab_free mm/slub.c:2968 [inline]
> >  kmem_cache_free+0x88/0x270 mm/slub.c:2990
> >  __mmdrop+0x164/0x248 kernel/fork.c:604
> 
> ^^ This should never run, because there's an mmgrab() about 8 lines above
> the mmput() in exit_mm.
> 
> >  mmdrop+0x50/0x60 kernel/fork.c:615
> >  __mmput kernel/fork.c:981 [inline]
> >  mmput+0x270/0x338 kernel/fork.c:992
> >  exit_mm kernel/exit.c:544 [inline]
> 
> Looking at exit_mm:
> 
>         mmgrab(mm);
>         BUG_ON(mm != current->active_mm);
>         /* more a memory barrier than a real lock */
>         task_lock(current);
>         current->mm = NULL;
>         up_read(&mm->mmap_sem);
>         enter_lazy_tlb(mm, current);
>         task_unlock(current);
>         mm_update_next_owner(mm);
>         mmput(mm);
> 
> Then the comment already rings some alarm bells: our spin_lock (as used
> by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered
> due to being an atomic_inc) can be reordered with respect to the assignment
> of NULL to current->mm.
> 
> If the exit()ing task had recently migrated from another CPU, then that
> CPU could concurrently run context_switch() and take this path:
> 
> 	if (!prev->mm) {
> 		prev->active_mm = NULL;
> 		rq->prev_mm = oldmm;
> 	}

IIUC, on the prior context_switch, next->mm == NULL, so we set
next->active_mm to prev->mm.

Then, in this context_switch we set oldmm = prev->active_mm (where prev
is next from the prior context switch).

... right?

> which then means finish_task_switch will call mmdrop():
> 
> 	struct mm_struct *mm = rq->prev_mm;
> 	[...]
> 	if (mm) {
> 		membarrier_mm_sync_core_before_usermode(mm);
> 		mmdrop(mm);
> 	}

... then here we use what was prev->active_mm in the most recent context
switch.

So AFAICT, we're never concurrently accessing a task_struct::mm field
here, only prev::{mm,active_mm} while prev is current...

[...]

> diff --git a/kernel/exit.c b/kernel/exit.c
> index 995453d9fb55..f91e8d56b03f 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -534,8 +534,9 @@ static void exit_mm(void)
>         }
>         mmgrab(mm);
>         BUG_ON(mm != current->active_mm);
> -       /* more a memory barrier than a real lock */
>         task_lock(current);
> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
> +       smp_mb__after_spin_lock();
>         current->mm = NULL;

... and thus I don't follow why we would need to order these with
anything more than a compiler barrier (if we're preemptible here).

What have I completely misunderstood? ;)

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-14 16:51   ` Mark Rutland
@ 2018-02-14 18:53     ` Mathieu Desnoyers
  2018-02-15 11:49       ` Peter Zijlstra
  2018-02-15 14:22       ` Will Deacon
  0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2018-02-14 18:53 UTC (permalink / raw)
  To: linux-arm-kernel

----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland at arm.com wrote:

> On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
>> Hi Mark,
> 
> Hi Will,
> 
>> Cheers for the report. These things tend to be a pain to debug, but I've had
>> a go.
> 
> Thanks for taking a look!
> 
>> On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote:
>> The interesting thing here is on the exit path:
>> 
>> > Freed by task 10882:
>> >  save_stack mm/kasan/kasan.c:447 [inline]
>> >  set_track mm/kasan/kasan.c:459 [inline]
>> >  __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
>> >  kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
>> >  slab_free_hook mm/slub.c:1393 [inline]
>> >  slab_free_freelist_hook mm/slub.c:1414 [inline]
>> >  slab_free mm/slub.c:2968 [inline]
>> >  kmem_cache_free+0x88/0x270 mm/slub.c:2990
>> >  __mmdrop+0x164/0x248 kernel/fork.c:604
>> 
>> ^^ This should never run, because there's an mmgrab() about 8 lines above
>> the mmput() in exit_mm.
>> 
>> >  mmdrop+0x50/0x60 kernel/fork.c:615
>> >  __mmput kernel/fork.c:981 [inline]
>> >  mmput+0x270/0x338 kernel/fork.c:992
>> >  exit_mm kernel/exit.c:544 [inline]
>> 
>> Looking at exit_mm:
>> 
>>         mmgrab(mm);
>>         BUG_ON(mm != current->active_mm);
>>         /* more a memory barrier than a real lock */
>>         task_lock(current);
>>         current->mm = NULL;
>>         up_read(&mm->mmap_sem);
>>         enter_lazy_tlb(mm, current);
>>         task_unlock(current);
>>         mm_update_next_owner(mm);
>>         mmput(mm);
>> 
>> Then the comment already rings some alarm bells: our spin_lock (as used
>> by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered
>> due to being an atomic_inc) can be reordered with respect to the assignment
>> of NULL to current->mm.
>> 
>> If the exit()ing task had recently migrated from another CPU, then that
>> CPU could concurrently run context_switch() and take this path:
>> 
>> 	if (!prev->mm) {
>> 		prev->active_mm = NULL;
>> 		rq->prev_mm = oldmm;
>> 	}
> 
> IIUC, on the prior context_switch, next->mm == NULL, so we set
> next->active_mm to prev->mm.
> 
> Then, in this context_switch we set oldmm = prev->active_mm (where prev
> is next from the prior context switch).
> 
> ... right?
> 
>> which then means finish_task_switch will call mmdrop():
>> 
>> 	struct mm_struct *mm = rq->prev_mm;
>> 	[...]
>> 	if (mm) {
>> 		membarrier_mm_sync_core_before_usermode(mm);
>> 		mmdrop(mm);
>> 	}
> 
> ... then here we use what was prev->active_mm in the most recent context
> switch.
> 
> So AFAICT, we're never concurrently accessing a task_struct::mm field
> here, only prev::{mm,active_mm} while prev is current...
> 
> [...]
> 
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 995453d9fb55..f91e8d56b03f 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -534,8 +534,9 @@ static void exit_mm(void)
>>         }
>>         mmgrab(mm);
>>         BUG_ON(mm != current->active_mm);
>> -       /* more a memory barrier than a real lock */
>>         task_lock(current);
>> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
>> +       smp_mb__after_spin_lock();
>>         current->mm = NULL;
> 
> ... and thus I don't follow why we would need to order these with
> anything more than a compiler barrier (if we're preemptible here).
> 
> What have I completely misunderstood? ;)

The compiler barrier would not change anything, because task_lock()
already implies a compiler barrier (provided by the arch spin lock
inline asm memory clobber). So compiler-wise, it cannot move the
mmgrab(mm) after the store "current->mm = NULL".

However, given the scenario involves multiples CPUs (one doing exit_mm(),
the other doing context switch), the actual order of perceived load/store
can be shuffled. And AFAIU nothing prevents the CPU from ordering the
atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.

I wonder if we should not simply add a smp_mb__after_atomic() into
mmgrab() instead ? I see that e.g. futex.c does:

static inline void futex_get_mm(union futex_key *key)
{
        mmgrab(key->private.mm);
        /*
         * Ensure futex_get_mm() implies a full barrier such that
         * get_futex_key() implies a full barrier. This is relied upon
         * as smp_mb(); (B), see the ordering comment above.
         */
        smp_mb__after_atomic();
}

It could prevent nasty subtle bugs in other mmgrab() users.

Thoughts ?

Thanks,

Mathieu


> 
> Thanks,
> Mark.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-14 18:53     ` Mathieu Desnoyers
@ 2018-02-15 11:49       ` Peter Zijlstra
  2018-02-15 13:13         ` Mathieu Desnoyers
  2018-02-15 14:22       ` Will Deacon
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2018-02-15 11:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
> However, given the scenario involves multiples CPUs (one doing exit_mm(),
> the other doing context switch), the actual order of perceived load/store
> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.
> 
> I wonder if we should not simply add a smp_mb__after_atomic() into
> mmgrab() instead ? I see that e.g. futex.c does:

Don't think so, the futex case is really rather special and I suspect
this one is too. I would much rather have explicit comments rather than
implicit works by magic.

As per the rationale used for refcount_t, increments should be
unordered, because you ACQUIRE your object _before_ you can do the
increment.

The futex thing is simply abusing a bunch of implied barriers and
patching up the holes in paths that didn't already imply a barrier in
order to avoid having to add explicit barriers (which had measurable
performance issues).

And here we have explicit ordering outside of the reference counting
too, we want to ensure the reference is incremented before we modify
a second object.

This ordering is not at all related to acquiring the reference, so
bunding it seems odd.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 11:49       ` Peter Zijlstra
@ 2018-02-15 13:13         ` Mathieu Desnoyers
  0 siblings, 0 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2018-02-15 13:13 UTC (permalink / raw)
  To: linux-arm-kernel

----- On Feb 15, 2018, at 6:49 AM, Peter Zijlstra peterz at infradead.org wrote:

> On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
>> However, given the scenario involves multiples CPUs (one doing exit_mm(),
>> the other doing context switch), the actual order of perceived load/store
>> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
>> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.
>> 
>> I wonder if we should not simply add a smp_mb__after_atomic() into
>> mmgrab() instead ? I see that e.g. futex.c does:
> 
> Don't think so, the futex case is really rather special and I suspect
> this one is too. I would much rather have explicit comments rather than
> implicit works by magic.
> 
> As per the rationale used for refcount_t, increments should be
> unordered, because you ACQUIRE your object _before_ you can do the
> increment.
> 
> The futex thing is simply abusing a bunch of implied barriers and
> patching up the holes in paths that didn't already imply a barrier in
> order to avoid having to add explicit barriers (which had measurable
> performance issues).
> 
> And here we have explicit ordering outside of the reference counting
> too, we want to ensure the reference is incremented before we modify
> a second object.
> 
> This ordering is not at all related to acquiring the reference, so
> bunding it seems odd.

I understand your point. Will's added barrier and comment is fine.

Thanks,

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-14 18:53     ` Mathieu Desnoyers
  2018-02-15 11:49       ` Peter Zijlstra
@ 2018-02-15 14:22       ` Will Deacon
  2018-02-15 15:33         ` Will Deacon
                           ` (2 more replies)
  1 sibling, 3 replies; 17+ messages in thread
From: Will Deacon @ 2018-02-15 14:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland at arm.com wrote:
> > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
> >> If the exit()ing task had recently migrated from another CPU, then that
> >> CPU could concurrently run context_switch() and take this path:
> >> 
> >> 	if (!prev->mm) {
> >> 		prev->active_mm = NULL;
> >> 		rq->prev_mm = oldmm;
> >> 	}
> > 
> > IIUC, on the prior context_switch, next->mm == NULL, so we set
> > next->active_mm to prev->mm.
> > 
> > Then, in this context_switch we set oldmm = prev->active_mm (where prev
> > is next from the prior context switch).
> > 
> > ... right?
> > 
> >> which then means finish_task_switch will call mmdrop():
> >> 
> >> 	struct mm_struct *mm = rq->prev_mm;
> >> 	[...]
> >> 	if (mm) {
> >> 		membarrier_mm_sync_core_before_usermode(mm);
> >> 		mmdrop(mm);
> >> 	}
> > 
> > ... then here we use what was prev->active_mm in the most recent context
> > switch.
> > 
> > So AFAICT, we're never concurrently accessing a task_struct::mm field
> > here, only prev::{mm,active_mm} while prev is current...
> > 
> > [...]
> > 
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 995453d9fb55..f91e8d56b03f 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -534,8 +534,9 @@ static void exit_mm(void)
> >>         }
> >>         mmgrab(mm);
> >>         BUG_ON(mm != current->active_mm);
> >> -       /* more a memory barrier than a real lock */
> >>         task_lock(current);
> >> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
> >> +       smp_mb__after_spin_lock();
> >>         current->mm = NULL;
> > 
> > ... and thus I don't follow why we would need to order these with
> > anything more than a compiler barrier (if we're preemptible here).
> > 
> > What have I completely misunderstood? ;)
> 
> The compiler barrier would not change anything, because task_lock()
> already implies a compiler barrier (provided by the arch spin lock
> inline asm memory clobber). So compiler-wise, it cannot move the
> mmgrab(mm) after the store "current->mm = NULL".
> 
> However, given the scenario involves multiples CPUs (one doing exit_mm(),
> the other doing context switch), the actual order of perceived load/store
> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.

Mark and I have spent most of the morning looking at this and realised I
made a mistake in my original guesswork: prev can't migrate until half way
down finish_task_switch when on_cpu = 0, so the access of prev->mm in
context_switch can't race with exit_mm() for that task.

Furthermore, although the mmgrab() could in theory be reordered with
current->mm = NULL (and the ARMv8 architecture allows this too), it's
pretty unlikely with LL/SC atomics and the backwards branch, where the
CPU would have to pull off quite a few tricks for this to happen.

Instead, we've come up with a more plausible sequence that can in theory
happen on a single CPU:

<task foo calls exit()>

do_exit
	exit_mm
		mmgrab(mm);			// foo's mm has count +1
		BUG_ON(mm != current->active_mm);
		task_lock(current);
		current->mm = NULL;
		task_unlock(current);

<irq and ctxsw to kthread>

context_switch(prev=foo, next=kthread)
	mm = next->mm;
	oldmm = prev->active_mm;

	if (!mm) {				// True for kthread
		next->active_mm = oldmm;
		mmgrab(oldmm);			// foo's mm has count +2
	}

	if (!prev->mm) {			// True for foo
		rq->prev_mm = oldmm;
	}

	finish_task_switch
		mm = rq->prev_mm;
		if (mm) {			// True (foo's mm)
			mmdrop(mm);		// foo's mm has count +1
		}

	[...]

<ctxsw to task bar>

context_switch(prev=kthread, next=bar)
	mm = next->mm;
	oldmm = prev->active_mm;		// foo's mm!

	if (!prev->mm) {			// True for kthread
		rq->prev_mm = oldmm;
	}

	finish_task_switch
		mm = rq->prev_mm;
		if (mm) {			// True (foo's mm)
			mmdrop(mm);		// foo's mm has count +0
		}

	[...]

<ctxsw back to task foo>

context_switch(prev=bar, next=foo)
	mm = next->mm;
	oldmm = prev->active_mm;

	if (!mm) {				// True for foo
		next->active_mm = oldmm;	// This is bar's mm
		mmgrab(oldmm);			// bar's mm has count +1
	}


	[return back to exit_mm]

mmdrop(mm);					// foo's mm has count -1

At this point, we've got an imbalanced count on the mm and could free it
prematurely as seen in the KASAN log. A subsequent context-switch away
from foo would therefore result in a use-after-free.

Assuming others agree with this diagnosis, I'm not sure how to fix it.
It's basically not safe to set current->mm = NULL with preemption enabled.

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 14:22       ` Will Deacon
@ 2018-02-15 15:33         ` Will Deacon
  2018-02-15 16:47         ` Peter Zijlstra
  2018-02-19 11:26         ` Catalin Marinas
  2 siblings, 0 replies; 17+ messages in thread
From: Will Deacon @ 2018-02-15 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Assuming others agree with this diagnosis, I'm not sure how to fix it.
> It's basically not safe to set current->mm = NULL with preemption enabled.

One thing we could try would be to leave current->mm alone and just do the
mmdrop in finish_task_switch at the point where we put the task_struct for
DEAD tasks. mm_update_next_owner might need updating so that it doesn't
re-assign current as the owner and run into a BUG_ON.


Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 14:22       ` Will Deacon
  2018-02-15 15:33         ` Will Deacon
@ 2018-02-15 16:47         ` Peter Zijlstra
  2018-02-15 18:21           ` Will Deacon
  2018-02-19 11:26         ` Catalin Marinas
  2 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2018-02-15 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Instead, we've come up with a more plausible sequence that can in theory
> happen on a single CPU:
> 
> <task foo calls exit()>
> 
> do_exit
> 	exit_mm

If this is the last task of the process, we would expect:

  mm_count == 1
  mm_users == 1

at this point.

> 		mmgrab(mm);			// foo's mm has count +1
> 		BUG_ON(mm != current->active_mm);
> 		task_lock(current);
> 		current->mm = NULL;
> 		task_unlock(current);

So the whole active_mm is basically the last 'real' mm, and its purpose
is to avoid switch_mm() between user tasks and kernel tasks.

A kernel task has !->mm. We do this by incrementing mm_count when
switching from user to kernel task and decrementing when switching from
kernel to user.

What exit_mm() does is change a user task into a 'kernel' task. So it
should increment mm_count to mirror the context switch. I suspect this
is what the mmgrab() in exit_mm() is for.

> <irq and ctxsw to kthread>
> 
> context_switch(prev=foo, next=kthread)
> 	mm = next->mm;
> 	oldmm = prev->active_mm;
> 
> 	if (!mm) {				// True for kthread
> 		next->active_mm = oldmm;
> 		mmgrab(oldmm);			// foo's mm has count +2
> 	}
> 
> 	if (!prev->mm) {			// True for foo
> 		rq->prev_mm = oldmm;
> 	}
> 
> 	finish_task_switch
> 		mm = rq->prev_mm;
> 		if (mm) {			// True (foo's mm)
> 			mmdrop(mm);		// foo's mm has count +1
> 		}
> 
> 	[...]
> 
> <ctxsw to task bar>
> 
> context_switch(prev=kthread, next=bar)
> 	mm = next->mm;
> 	oldmm = prev->active_mm;		// foo's mm!
> 
> 	if (!prev->mm) {			// True for kthread
> 		rq->prev_mm = oldmm;
> 	}
> 
> 	finish_task_switch
> 		mm = rq->prev_mm;
> 		if (mm) {			// True (foo's mm)
> 			mmdrop(mm);		// foo's mm has count +0

The context switch into the next user task will then decrement. At this
point foo no longer has a reference to its mm, except on the stack.

> 		}
> 
> 	[...]
> 
> <ctxsw back to task foo>
> 
> context_switch(prev=bar, next=foo)
> 	mm = next->mm;
> 	oldmm = prev->active_mm;
> 
> 	if (!mm) {				// True for foo
> 		next->active_mm = oldmm;	// This is bar's mm
> 		mmgrab(oldmm);			// bar's mm has count +1
> 	}
> 
> 
> 	[return back to exit_mm]

Enter mm_users, this counts the number of tasks associated with the mm.
We start with 1 in mm_init(), and when it drops to 0, we decrement
mm_count. Since we also start with mm_count == 1, this would appear
consistent.

  mmput() // --mm_users == 0, which then results in:

> mmdrop(mm);					// foo's mm has count -1

In the above case, that's the very last reference to the mm, and since
we started out with mm_count == 1, this -1 makes 0 and we do the actual
free.

> At this point, we've got an imbalanced count on the mm and could free it
> prematurely as seen in the KASAN log.

I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
we freed mm and there is no further use of the on-stack mm pointer and
foo no longer has a pointer to it in either ->mm or ->active_mm. It's
well and proper dead.

> A subsequent context-switch away from foo would therefore result in a
> use-after-free.

At the above point, foo no longer has a reference to mm, we cleared ->mm
early, and the context switch to bar cleared ->active_mm. The switch
back into foo then results with foo->active_mm == bar->mm, which is
fine.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 16:47         ` Peter Zijlstra
@ 2018-02-15 18:21           ` Will Deacon
  2018-02-15 22:08             ` Mathieu Desnoyers
  0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2018-02-15 18:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> > Instead, we've come up with a more plausible sequence that can in theory
> > happen on a single CPU:
> > 
> > <task foo calls exit()>
> > 
> > do_exit
> > 	exit_mm
> 
> If this is the last task of the process, we would expect:
> 
>   mm_count == 1
>   mm_users == 1
> 
> at this point.
> 
> > 		mmgrab(mm);			// foo's mm has count +1
> > 		BUG_ON(mm != current->active_mm);
> > 		task_lock(current);
> > 		current->mm = NULL;
> > 		task_unlock(current);
> 
> So the whole active_mm is basically the last 'real' mm, and its purpose
> is to avoid switch_mm() between user tasks and kernel tasks.
> 
> A kernel task has !->mm. We do this by incrementing mm_count when
> switching from user to kernel task and decrementing when switching from
> kernel to user.
> 
> What exit_mm() does is change a user task into a 'kernel' task. So it
> should increment mm_count to mirror the context switch. I suspect this
> is what the mmgrab() in exit_mm() is for.
> 
> > <irq and ctxsw to kthread>
> > 
> > context_switch(prev=foo, next=kthread)
> > 	mm = next->mm;
> > 	oldmm = prev->active_mm;
> > 
> > 	if (!mm) {				// True for kthread
> > 		next->active_mm = oldmm;
> > 		mmgrab(oldmm);			// foo's mm has count +2
> > 	}
> > 
> > 	if (!prev->mm) {			// True for foo
> > 		rq->prev_mm = oldmm;
> > 	}
> > 
> > 	finish_task_switch
> > 		mm = rq->prev_mm;
> > 		if (mm) {			// True (foo's mm)
> > 			mmdrop(mm);		// foo's mm has count +1
> > 		}
> > 
> > 	[...]
> > 
> > <ctxsw to task bar>
> > 
> > context_switch(prev=kthread, next=bar)
> > 	mm = next->mm;
> > 	oldmm = prev->active_mm;		// foo's mm!
> > 
> > 	if (!prev->mm) {			// True for kthread
> > 		rq->prev_mm = oldmm;
> > 	}
> > 
> > 	finish_task_switch
> > 		mm = rq->prev_mm;
> > 		if (mm) {			// True (foo's mm)
> > 			mmdrop(mm);		// foo's mm has count +0
> 
> The context switch into the next user task will then decrement. At this
> point foo no longer has a reference to its mm, except on the stack.
> 
> > 		}
> > 
> > 	[...]
> > 
> > <ctxsw back to task foo>
> > 
> > context_switch(prev=bar, next=foo)
> > 	mm = next->mm;
> > 	oldmm = prev->active_mm;
> > 
> > 	if (!mm) {				// True for foo
> > 		next->active_mm = oldmm;	// This is bar's mm
> > 		mmgrab(oldmm);			// bar's mm has count +1
> > 	}
> > 
> > 
> > 	[return back to exit_mm]
> 
> Enter mm_users, this counts the number of tasks associated with the mm.
> We start with 1 in mm_init(), and when it drops to 0, we decrement
> mm_count. Since we also start with mm_count == 1, this would appear
> consistent.
> 
>   mmput() // --mm_users == 0, which then results in:
> 
> > mmdrop(mm);					// foo's mm has count -1
> 
> In the above case, that's the very last reference to the mm, and since
> we started out with mm_count == 1, this -1 makes 0 and we do the actual
> free.
> 
> > At this point, we've got an imbalanced count on the mm and could free it
> > prematurely as seen in the KASAN log.
> 
> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
> we freed mm and there is no further use of the on-stack mm pointer and
> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
> well and proper dead.
> 
> > A subsequent context-switch away from foo would therefore result in a
> > use-after-free.
> 
> At the above point, foo no longer has a reference to mm, we cleared ->mm
> early, and the context switch to bar cleared ->active_mm. The switch
> back into foo then results with foo->active_mm == bar->mm, which is
> fine.

Bugger, you're right. When we switch off foo after freeing the mm, we'll
actually access it's active mm which points to bar's mm. So whilst this
can explain part of the kasan splat, it doesn't explain the actual
use-after-free.

More head-scratching required :(

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 18:21           ` Will Deacon
@ 2018-02-15 22:08             ` Mathieu Desnoyers
  2018-02-16  0:02               ` Mathieu Desnoyers
  2018-02-16 16:53               ` Mark Rutland
  0 siblings, 2 replies; 17+ messages in thread
From: Mathieu Desnoyers @ 2018-02-15 22:08 UTC (permalink / raw)
  To: linux-arm-kernel

----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon at arm.com wrote:

> On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
>> > Instead, we've come up with a more plausible sequence that can in theory
>> > happen on a single CPU:
>> > 
>> > <task foo calls exit()>
>> > 
>> > do_exit
>> > 	exit_mm
>> 
>> If this is the last task of the process, we would expect:
>> 
>>   mm_count == 1
>>   mm_users == 1
>> 
>> at this point.
>> 
>> > 		mmgrab(mm);			// foo's mm has count +1
>> > 		BUG_ON(mm != current->active_mm);
>> > 		task_lock(current);
>> > 		current->mm = NULL;
>> > 		task_unlock(current);
>> 
>> So the whole active_mm is basically the last 'real' mm, and its purpose
>> is to avoid switch_mm() between user tasks and kernel tasks.
>> 
>> A kernel task has !->mm. We do this by incrementing mm_count when
>> switching from user to kernel task and decrementing when switching from
>> kernel to user.
>> 
>> What exit_mm() does is change a user task into a 'kernel' task. So it
>> should increment mm_count to mirror the context switch. I suspect this
>> is what the mmgrab() in exit_mm() is for.
>> 
>> > <irq and ctxsw to kthread>
>> > 
>> > context_switch(prev=foo, next=kthread)
>> > 	mm = next->mm;
>> > 	oldmm = prev->active_mm;
>> > 
>> > 	if (!mm) {				// True for kthread
>> > 		next->active_mm = oldmm;
>> > 		mmgrab(oldmm);			// foo's mm has count +2
>> > 	}
>> > 
>> > 	if (!prev->mm) {			// True for foo
>> > 		rq->prev_mm = oldmm;
>> > 	}
>> > 
>> > 	finish_task_switch
>> > 		mm = rq->prev_mm;
>> > 		if (mm) {			// True (foo's mm)
>> > 			mmdrop(mm);		// foo's mm has count +1
>> > 		}
>> > 
>> > 	[...]
>> > 
>> > <ctxsw to task bar>
>> > 
>> > context_switch(prev=kthread, next=bar)
>> > 	mm = next->mm;
>> > 	oldmm = prev->active_mm;		// foo's mm!
>> > 
>> > 	if (!prev->mm) {			// True for kthread
>> > 		rq->prev_mm = oldmm;
>> > 	}
>> > 
>> > 	finish_task_switch
>> > 		mm = rq->prev_mm;
>> > 		if (mm) {			// True (foo's mm)
>> > 			mmdrop(mm);		// foo's mm has count +0
>> 
>> The context switch into the next user task will then decrement. At this
>> point foo no longer has a reference to its mm, except on the stack.
>> 
>> > 		}
>> > 
>> > 	[...]
>> > 
>> > <ctxsw back to task foo>
>> > 
>> > context_switch(prev=bar, next=foo)
>> > 	mm = next->mm;
>> > 	oldmm = prev->active_mm;
>> > 
>> > 	if (!mm) {				// True for foo
>> > 		next->active_mm = oldmm;	// This is bar's mm
>> > 		mmgrab(oldmm);			// bar's mm has count +1
>> > 	}
>> > 
>> > 
>> > 	[return back to exit_mm]
>> 
>> Enter mm_users, this counts the number of tasks associated with the mm.
>> We start with 1 in mm_init(), and when it drops to 0, we decrement
>> mm_count. Since we also start with mm_count == 1, this would appear
>> consistent.
>> 
>>   mmput() // --mm_users == 0, which then results in:
>> 
>> > mmdrop(mm);					// foo's mm has count -1
>> 
>> In the above case, that's the very last reference to the mm, and since
>> we started out with mm_count == 1, this -1 makes 0 and we do the actual
>> free.
>> 
>> > At this point, we've got an imbalanced count on the mm and could free it
>> > prematurely as seen in the KASAN log.
>> 
>> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
>> we freed mm and there is no further use of the on-stack mm pointer and
>> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
>> well and proper dead.
>> 
>> > A subsequent context-switch away from foo would therefore result in a
>> > use-after-free.
>> 
>> At the above point, foo no longer has a reference to mm, we cleared ->mm
>> early, and the context switch to bar cleared ->active_mm. The switch
>> back into foo then results with foo->active_mm == bar->mm, which is
>> fine.
> 
> Bugger, you're right. When we switch off foo after freeing the mm, we'll
> actually access it's active mm which points to bar's mm. So whilst this
> can explain part of the kasan splat, it doesn't explain the actual
> use-after-free.
> 
> More head-scratching required :(

My current theory: do_exit() gets preempted after having set current->mm
to NULL, and after having issued mmput(), which brings the mm_count down
to 0. Unfortunately, if the scheduler switches from a userspace thread
to a kernel thread, context_switch() loads prev->active_mm which still
points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
in finish_task_switch().

If my understanding is correct, the following patch should help. The idea
is to keep a reference on the mm_count until after we are sure the scheduler
cannot schedule the task anymore. What I'm not sure is where exactly in
do_exit() are we sure the task cannot ever be preempted anymore ?


diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d..fefba68 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -764,6 +764,7 @@ void __noreturn do_exit(long code)
 {
        struct task_struct *tsk = current;
        int group_dead;
+       struct mm_struct *mm;
 
        profile_task_exit(tsk);
        kcov_task_exit(tsk);
@@ -849,6 +850,10 @@ void __noreturn do_exit(long code)
        tsk->exit_code = code;
        taskstats_exit(tsk, group_dead);
 
+       mm = current->mm;
+       if (mm)
+               mmgrab(mm);
+
        exit_mm();
 
        if (group_dead)
@@ -920,6 +925,9 @@ void __noreturn do_exit(long code)
 
        lockdep_free_task(tsk);
        do_task_dead();
+
+       if (mm)
+               mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(do_exit);
 




-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 22:08             ` Mathieu Desnoyers
@ 2018-02-16  0:02               ` Mathieu Desnoyers
  2018-02-16  8:11                 ` Peter Zijlstra
  2018-02-16 16:53               ` Mark Rutland
  1 sibling, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2018-02-16  0:02 UTC (permalink / raw)
  To: linux-arm-kernel

----- On Feb 15, 2018, at 5:08 PM, Mathieu Desnoyers mathieu.desnoyers at efficios.com wrote:

> ----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon at arm.com wrote:
> 
>> On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
>>> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
>>> > Instead, we've come up with a more plausible sequence that can in theory
>>> > happen on a single CPU:
>>> > 
>>> > <task foo calls exit()>
>>> > 
>>> > do_exit
>>> > 	exit_mm
>>> 
>>> If this is the last task of the process, we would expect:
>>> 
>>>   mm_count == 1
>>>   mm_users == 1
>>> 
>>> at this point.
>>> 
>>> > 		mmgrab(mm);			// foo's mm has count +1
>>> > 		BUG_ON(mm != current->active_mm);
>>> > 		task_lock(current);
>>> > 		current->mm = NULL;
>>> > 		task_unlock(current);
>>> 
>>> So the whole active_mm is basically the last 'real' mm, and its purpose
>>> is to avoid switch_mm() between user tasks and kernel tasks.
>>> 
>>> A kernel task has !->mm. We do this by incrementing mm_count when
>>> switching from user to kernel task and decrementing when switching from
>>> kernel to user.
>>> 
>>> What exit_mm() does is change a user task into a 'kernel' task. So it
>>> should increment mm_count to mirror the context switch. I suspect this
>>> is what the mmgrab() in exit_mm() is for.
>>> 
>>> > <irq and ctxsw to kthread>
>>> > 
>>> > context_switch(prev=foo, next=kthread)
>>> > 	mm = next->mm;
>>> > 	oldmm = prev->active_mm;
>>> > 
>>> > 	if (!mm) {				// True for kthread
>>> > 		next->active_mm = oldmm;
>>> > 		mmgrab(oldmm);			// foo's mm has count +2
>>> > 	}
>>> > 
>>> > 	if (!prev->mm) {			// True for foo
>>> > 		rq->prev_mm = oldmm;
>>> > 	}
>>> > 
>>> > 	finish_task_switch
>>> > 		mm = rq->prev_mm;
>>> > 		if (mm) {			// True (foo's mm)
>>> > 			mmdrop(mm);		// foo's mm has count +1
>>> > 		}
>>> > 
>>> > 	[...]
>>> > 
>>> > <ctxsw to task bar>
>>> > 
>>> > context_switch(prev=kthread, next=bar)
>>> > 	mm = next->mm;
>>> > 	oldmm = prev->active_mm;		// foo's mm!
>>> > 
>>> > 	if (!prev->mm) {			// True for kthread
>>> > 		rq->prev_mm = oldmm;
>>> > 	}
>>> > 
>>> > 	finish_task_switch
>>> > 		mm = rq->prev_mm;
>>> > 		if (mm) {			// True (foo's mm)
>>> > 			mmdrop(mm);		// foo's mm has count +0
>>> 
>>> The context switch into the next user task will then decrement. At this
>>> point foo no longer has a reference to its mm, except on the stack.
>>> 
>>> > 		}
>>> > 
>>> > 	[...]
>>> > 
>>> > <ctxsw back to task foo>
>>> > 
>>> > context_switch(prev=bar, next=foo)
>>> > 	mm = next->mm;
>>> > 	oldmm = prev->active_mm;
>>> > 
>>> > 	if (!mm) {				// True for foo
>>> > 		next->active_mm = oldmm;	// This is bar's mm
>>> > 		mmgrab(oldmm);			// bar's mm has count +1
>>> > 	}
>>> > 
>>> > 
>>> > 	[return back to exit_mm]
>>> 
>>> Enter mm_users, this counts the number of tasks associated with the mm.
>>> We start with 1 in mm_init(), and when it drops to 0, we decrement
>>> mm_count. Since we also start with mm_count == 1, this would appear
>>> consistent.
>>> 
>>>   mmput() // --mm_users == 0, which then results in:
>>> 
>>> > mmdrop(mm);					// foo's mm has count -1
>>> 
>>> In the above case, that's the very last reference to the mm, and since
>>> we started out with mm_count == 1, this -1 makes 0 and we do the actual
>>> free.
>>> 
>>> > At this point, we've got an imbalanced count on the mm and could free it
>>> > prematurely as seen in the KASAN log.
>>> 
>>> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
>>> we freed mm and there is no further use of the on-stack mm pointer and
>>> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
>>> well and proper dead.
>>> 
>>> > A subsequent context-switch away from foo would therefore result in a
>>> > use-after-free.
>>> 
>>> At the above point, foo no longer has a reference to mm, we cleared ->mm
>>> early, and the context switch to bar cleared ->active_mm. The switch
>>> back into foo then results with foo->active_mm == bar->mm, which is
>>> fine.
>> 
>> Bugger, you're right. When we switch off foo after freeing the mm, we'll
>> actually access it's active mm which points to bar's mm. So whilst this
>> can explain part of the kasan splat, it doesn't explain the actual
>> use-after-free.
>> 
>> More head-scratching required :(
> 
> My current theory: do_exit() gets preempted after having set current->mm
> to NULL, and after having issued mmput(), which brings the mm_count down
> to 0. Unfortunately, if the scheduler switches from a userspace thread
> to a kernel thread, context_switch() loads prev->active_mm which still
> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
> in finish_task_switch().
> 
> If my understanding is correct, the following patch should help. The idea
> is to keep a reference on the mm_count until after we are sure the scheduler
> cannot schedule the task anymore. What I'm not sure is where exactly in
> do_exit() are we sure the task cannot ever be preempted anymore ?
> 

Actually, it's the preempt_disable() at the end of do_exit() I was looking
for. The following patch moves the mmdrop() right after preempte_disable.
In my previous patch, the mmdrop() after do_task_dead (which is noreturn)
was rather dumb (leak).

diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d..2804655 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -764,6 +764,7 @@ void __noreturn do_exit(long code)
 {
        struct task_struct *tsk = current;
        int group_dead;
+       struct mm_struct *mm;
 
        profile_task_exit(tsk);
        kcov_task_exit(tsk);
@@ -849,6 +850,10 @@ void __noreturn do_exit(long code)
        tsk->exit_code = code;
        taskstats_exit(tsk, group_dead);
 
+       mm = current->mm;
+       if (mm)
+               mmgrab(mm);
+
        exit_mm();
 
        if (group_dead)
@@ -913,6 +918,8 @@ void __noreturn do_exit(long code)
 
        check_stack_usage();
        preempt_disable();
+       if (mm)
+               mmdrop(mm);
        if (tsk->nr_dirtied)
                __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
        exit_rcu();


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-16  0:02               ` Mathieu Desnoyers
@ 2018-02-16  8:11                 ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2018-02-16  8:11 UTC (permalink / raw)
  To: linux-arm-kernel


Please trim..

On Fri, Feb 16, 2018 at 12:02:42AM +0000, Mathieu Desnoyers wrote:
> > My current theory: do_exit() gets preempted after having set current->mm
> > to NULL, and after having issued mmput(), which brings the mm_count down
> > to 0.

No it doesn't..

remember, the last thread of a process enters with:

  mm_count == 1 && mm_users == 1

exit_mm() then does mmgrab():

  mm_count == 2 && mm_users == 1

we then do mmput(), which results in:

  mm_count == 1 && mm_users == 0

That last mm_count is for ->active_mm, and will be dropped once we
schedule to a next user task.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 22:08             ` Mathieu Desnoyers
  2018-02-16  0:02               ` Mathieu Desnoyers
@ 2018-02-16 16:53               ` Mark Rutland
  2018-02-16 17:17                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2018-02-16 16:53 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote:
> My current theory: do_exit() gets preempted after having set current->mm
> to NULL, and after having issued mmput(), which brings the mm_count down
> to 0.
>
> Unfortunately, if the scheduler switches from a userspace thread
> to a kernel thread, context_switch() loads prev->active_mm which still
> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
> in finish_task_switch().

For this to happen, we need to get to the mmput() in exit_mm() with:

  mm->mm_count == 1
  mm->mm_users == 1
  mm == active_mm

... but AFAICT, this cannot happen.

If there's no context_switch between clearing current->mm and the
mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the
active_mm reference (in mm_count) that context_switch+finish_task_switch
manage.

If there is a context_switch between the two, then AFAICT, either:

a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In
   context_switch we mmgrab() the active_mm to inherit it, and in
   finish_task_switch() we drop the oldmm, balancing the mmgrab() with
   an mmput().

   e.g we go task -> kernel_task -> task

b) At some point, another user task is scheduled, and we switch to its
   mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which
   means mm_count >= 1. Since we witched to a new mm, if we switch back
   to the first task, it cannot have its own mm as active_mm.

   e.g. we go task -> other_task -> task

I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and
finish_task_switch() aren't to blame.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-16 16:53               ` Mark Rutland
@ 2018-02-16 17:17                 ` Mathieu Desnoyers
  2018-02-16 18:33                   ` Mark Rutland
  0 siblings, 1 reply; 17+ messages in thread
From: Mathieu Desnoyers @ 2018-02-16 17:17 UTC (permalink / raw)
  To: linux-arm-kernel

----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland at arm.com wrote:

> Hi,
> 
> On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote:
>> My current theory: do_exit() gets preempted after having set current->mm
>> to NULL, and after having issued mmput(), which brings the mm_count down
>> to 0.
>>
>> Unfortunately, if the scheduler switches from a userspace thread
>> to a kernel thread, context_switch() loads prev->active_mm which still
>> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
>> in finish_task_switch().
> 
> For this to happen, we need to get to the mmput() in exit_mm() with:
> 
>  mm->mm_count == 1
>  mm->mm_users == 1
>  mm == active_mm
> 
> ... but AFAICT, this cannot happen.
> 
> If there's no context_switch between clearing current->mm and the
> mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the
> active_mm reference (in mm_count) that context_switch+finish_task_switch
> manage.
> 
> If there is a context_switch between the two, then AFAICT, either:
> 
> a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In
>   context_switch we mmgrab() the active_mm to inherit it, and in
>   finish_task_switch() we drop the oldmm, balancing the mmgrab() with
>   an mmput().
> 
>   e.g we go task -> kernel_task -> task
> 
> b) At some point, another user task is scheduled, and we switch to its
>   mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which
>   means mm_count >= 1. Since we witched to a new mm, if we switch back
>   to the first task, it cannot have its own mm as active_mm.
> 
>   e.g. we go task -> other_task -> task
> 
> I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and
> finish_task_switch() aren't to blame.

Currently reviewing: fs/proc/base.c: __set_oom_adj()

        /*
         * Make sure we will check other processes sharing the mm if this is
         * not vfrok which wants its own oom_score_adj.
         * pin the mm so it doesn't go away and get reused after task_unlock
         */
        if (!task->vfork_done) {
                struct task_struct *p = find_lock_task_mm(task);

                if (p) {
                        if (atomic_read(&p->mm->mm_users) > 1) {
                                mm = p->mm;
                                mmgrab(mm);
                        }
                        task_unlock(p);
                }
        }

Considering that mmput() done by exit_mm() is done outside of the
task_lock critical section, I wonder how the mm_users load is
synchronized ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-16 17:17                 ` Mathieu Desnoyers
@ 2018-02-16 18:33                   ` Mark Rutland
  0 siblings, 0 replies; 17+ messages in thread
From: Mark Rutland @ 2018-02-16 18:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 16, 2018 at 05:17:57PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland at arm.com wrote:
> > I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and
> > finish_task_switch() aren't to blame.
> 
> Currently reviewing: fs/proc/base.c: __set_oom_adj()
> 
>         /*
>          * Make sure we will check other processes sharing the mm if this is
>          * not vfrok which wants its own oom_score_adj.
>          * pin the mm so it doesn't go away and get reused after task_unlock
>          */
>         if (!task->vfork_done) {
>                 struct task_struct *p = find_lock_task_mm(task);
> 
>                 if (p) {
>                         if (atomic_read(&p->mm->mm_users) > 1) {
>                                 mm = p->mm;
>                                 mmgrab(mm);
>                         }
>                         task_unlock(p);
>                 }
>         }
> 
> Considering that mmput() done by exit_mm() is done outside of the
> task_lock critical section, I wonder how the mm_users load is
> synchronized ?

That looks suspicious, but I don't think it can result in this
particular problem.

In find_lock_task_mm() we get the task lock, and check mm != NULL, which
means that mm->mm_count >= 1 (thanks to the implicit reference
context_switch()+finish_task_switch() manage). While we hold the task
lock, task->mm can't change beneath our feet, and hence that reference
can't be dropped by finish_task_switch().

Thus, immediately after the mmgrab(), we know mm->mm_count >= 2. That
shouldn't drop below 1 until the subsequent mmdrop(), even after we drop
the task lock, unless there's a misplaced mmdrop() elsewhere. Locally,
mmgrab() and mmdrop() are balanced.

However, if mm_users can be incremented behind our back, we might skip
updating the oom adjustments for other users of the mm.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
  2018-02-15 14:22       ` Will Deacon
  2018-02-15 15:33         ` Will Deacon
  2018-02-15 16:47         ` Peter Zijlstra
@ 2018-02-19 11:26         ` Catalin Marinas
  2 siblings, 0 replies; 17+ messages in thread
From: Catalin Marinas @ 2018-02-19 11:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Instead, we've come up with a more plausible sequence that can in theory
> happen on a single CPU:
> 
> <task foo calls exit()>
> 
> do_exit
> 	exit_mm
> 		mmgrab(mm);			// foo's mm has count +1
> 		BUG_ON(mm != current->active_mm);
> 		task_lock(current);
> 		current->mm = NULL;
> 		task_unlock(current);
> 
> <irq and ctxsw to kthread>

[...]

> mmdrop(mm);					// foo's mm has count -1
> 
> At this point, we've got an imbalanced count on the mm and could free it
> prematurely as seen in the KASAN log. A subsequent context-switch away
> from foo would therefore result in a use-after-free.

Peter already dismissed an algorithm issue here but I thought I'd give
model checking a go (it only deals with mm_users/mm_count; also added a
heavily simplified exec_mmap() in the loop):

https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/ctxsw.tla

As expected, it didn't show any problems (though it does not take memory
ordering into account).

Now, there are lots of other mmget/mmput and mmgrab/mmdrop throughout
the kernel and finding an imbalanced call needs more work.

-- 
Catalin

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-02-19 11:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-14 12:02 arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch Mark Rutland
2018-02-14 15:07 ` Will Deacon
2018-02-14 16:51   ` Mark Rutland
2018-02-14 18:53     ` Mathieu Desnoyers
2018-02-15 11:49       ` Peter Zijlstra
2018-02-15 13:13         ` Mathieu Desnoyers
2018-02-15 14:22       ` Will Deacon
2018-02-15 15:33         ` Will Deacon
2018-02-15 16:47         ` Peter Zijlstra
2018-02-15 18:21           ` Will Deacon
2018-02-15 22:08             ` Mathieu Desnoyers
2018-02-16  0:02               ` Mathieu Desnoyers
2018-02-16  8:11                 ` Peter Zijlstra
2018-02-16 16:53               ` Mark Rutland
2018-02-16 17:17                 ` Mathieu Desnoyers
2018-02-16 18:33                   ` Mark Rutland
2018-02-19 11:26         ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).