* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
@ 2018-02-14 12:02 Mark Rutland
2018-02-14 15:07 ` Will Deacon
0 siblings, 1 reply; 17+ messages in thread
From: Mark Rutland @ 2018-02-14 12:02 UTC (permalink / raw)
To: linux-arm-kernel
Hi,
As a heads-up, I hit the splat below when fuzzing v4.16-rc1 on arm64.
Evidently, we get to finish_task_switch() with rq->prev_mm != NULL,
despite rq->prev_mm having been freed. KASAN spots the dereference of
mm->membarrier_state in membarrier_mm_sync_core_before_usermode(mm), but
AFAICT the underlying issue is independent of the membarrier code, and
we could get a splat on the subsequent mmdrop(mm).
I've seen this once in ~2500 CPU hours of fuzzing, so it looks pretty
difficult to hit, and I have no reproducer so far.
Syzkaller report below, mirrored with Syzkaller log at [1]. If I hit
this again, I'll upload new info there.
Thanks,
Mark.
[1] https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20180214-finish_task_switch-stale-mm/
==================================================================
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:183 [inline]
BUG: KASAN: use-after-free in membarrier_mm_sync_core_before_usermode include/linux/sched/mm.h:216 [inline]
BUG: KASAN: use-after-free in finish_task_switch+0x590/0x5e8 kernel/sched/core.c:2720
Read of size 4 at addr ffff800073dd4980 by task swapper/1/0
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.16.0-rc1-00001-g3fffa6166965-dirty #13
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x388 arch/arm64/kernel/time.c:52
show_stack+0x20/0x30 arch/arm64/kernel/traps.c:151
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0xd0/0x120 lib/dump_stack.c:53
print_address_description+0x5c/0x2a8 mm/kasan/report.c:262
kasan_report_error mm/kasan/report.c:362 [inline]
kasan_report+0x210/0x340 mm/kasan/report.c:420
__asan_report_load4_noabort+0x18/0x20 mm/kasan/report.c:440
__read_once_size include/linux/compiler.h:183 [inline]
membarrier_mm_sync_core_before_usermode include/linux/sched/mm.h:216 [inline]
finish_task_switch+0x590/0x5e8 kernel/sched/core.c:2720
context_switch kernel/sched/core.c:2860 [inline]
__schedule+0x4f0/0x1558 kernel/sched/core.c:3435
schedule_idle+0x40/0x80 kernel/sched/core.c:3521
do_idle+0x114/0x2e0 kernel/sched/idle.c:269
cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:351
secondary_start_kernel+0x290/0x318 arch/arm64/kernel/smp.c:283
Allocated by task 12964:
save_stack mm/kasan/kasan.c:447 [inline]
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xd0/0x180 mm/kasan/kasan.c:552
kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:489
slab_post_alloc_hook mm/slab.h:443 [inline]
slab_alloc_node mm/slub.c:2725 [inline]
slab_alloc mm/slub.c:2733 [inline]
kmem_cache_alloc+0x150/0x240 mm/slub.c:2738
dup_mm kernel/fork.c:1235 [inline]
copy_mm kernel/fork.c:1298 [inline]
copy_process.isra.7.part.8+0x1a10/0x5018 kernel/fork.c:1804
copy_process kernel/fork.c:1617 [inline]
_do_fork+0x178/0x998 kernel/fork.c:2098
SYSC_clone kernel/fork.c:2205 [inline]
SyS_clone+0x48/0x60 kernel/fork.c:2183
el0_svc_naked+0x30/0x34
Freed by task 10882:
save_stack mm/kasan/kasan.c:447 [inline]
set_track mm/kasan/kasan.c:459 [inline]
__kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520
kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527
slab_free_hook mm/slub.c:1393 [inline]
slab_free_freelist_hook mm/slub.c:1414 [inline]
slab_free mm/slub.c:2968 [inline]
kmem_cache_free+0x88/0x270 mm/slub.c:2990
__mmdrop+0x164/0x248 kernel/fork.c:604
mmdrop+0x50/0x60 kernel/fork.c:615
__mmput kernel/fork.c:981 [inline]
mmput+0x270/0x338 kernel/fork.c:992
exit_mm kernel/exit.c:544 [inline]
do_exit+0x640/0x2100 kernel/exit.c:852
do_group_exit+0xdc/0x260 kernel/exit.c:968
get_signal+0x2ec/0xfc8 kernel/signal.c:2469
do_signal+0x270/0x4f8 arch/arm64/kernel/signal.c:869
do_notify_resume+0x150/0x1c0 arch/arm64/kernel/signal.c:927
work_pending+0x8/0x14
The buggy address belongs to the object at ffff800073dd4600
which belongs to the cache mm_struct of size 1104
The buggy address is located 896 bytes inside of
1104-byte region [ffff800073dd4600, ffff800073dd4a50)
The buggy address belongs to the page:
page:ffff7e0001cf7400 count:1 mapcount:0 mapping: (null) index:0x0 compound_mapcount: 0
flags: 0x4fffc00000008100(slab|head)
raw: 4fffc00000008100 0000000000000000 0000000000000000 0000000180190019
raw: ffff7e0001d2b000 0000000700000007 ffff800075803000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff800073dd4880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff800073dd4900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff800073dd4980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff800073dd4a00: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc
ffff800073dd4a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
^ permalink raw reply [flat|nested] 17+ messages in thread* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-14 12:02 arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch Mark Rutland @ 2018-02-14 15:07 ` Will Deacon 2018-02-14 16:51 ` Mark Rutland 0 siblings, 1 reply; 17+ messages in thread From: Will Deacon @ 2018-02-14 15:07 UTC (permalink / raw) To: linux-arm-kernel Hi Mark, Cheers for the report. These things tend to be a pain to debug, but I've had a go. On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote: > As a heads-up, I hit the splat below when fuzzing v4.16-rc1 on arm64. > > Evidently, we get to finish_task_switch() with rq->prev_mm != NULL, > despite rq->prev_mm having been freed. KASAN spots the dereference of > mm->membarrier_state in membarrier_mm_sync_core_before_usermode(mm), but > AFAICT the underlying issue is independent of the membarrier code, and > we could get a splat on the subsequent mmdrop(mm). > > I've seen this once in ~2500 CPU hours of fuzzing, so it looks pretty > difficult to hit, and I have no reproducer so far. > > Syzkaller report below, mirrored with Syzkaller log at [1]. If I hit > this again, I'll upload new info there. The interesting thing here is on the exit path: > Freed by task 10882: > save_stack mm/kasan/kasan.c:447 [inline] > set_track mm/kasan/kasan.c:459 [inline] > __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520 > kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527 > slab_free_hook mm/slub.c:1393 [inline] > slab_free_freelist_hook mm/slub.c:1414 [inline] > slab_free mm/slub.c:2968 [inline] > kmem_cache_free+0x88/0x270 mm/slub.c:2990 > __mmdrop+0x164/0x248 kernel/fork.c:604 ^^ This should never run, because there's an mmgrab() about 8 lines above the mmput() in exit_mm. > mmdrop+0x50/0x60 kernel/fork.c:615 > __mmput kernel/fork.c:981 [inline] > mmput+0x270/0x338 kernel/fork.c:992 > exit_mm kernel/exit.c:544 [inline] Looking at exit_mm: mmgrab(mm); BUG_ON(mm != current->active_mm); /* more a memory barrier than a real lock */ task_lock(current); current->mm = NULL; up_read(&mm->mmap_sem); enter_lazy_tlb(mm, current); task_unlock(current); mm_update_next_owner(mm); mmput(mm); Then the comment already rings some alarm bells: our spin_lock (as used by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered due to being an atomic_inc) can be reordered with respect to the assignment of NULL to current->mm. If the exit()ing task had recently migrated from another CPU, then that CPU could concurrently run context_switch() and take this path: if (!prev->mm) { prev->active_mm = NULL; rq->prev_mm = oldmm; } which then means finish_task_switch will call mmdrop(): struct mm_struct *mm = rq->prev_mm; [...] if (mm) { membarrier_mm_sync_core_before_usermode(mm); mmdrop(mm); } note that KASAN will be ok at this point, but it explains how the exit_mm path ends up freeing the mm. Then, when the exit()ing CPU calls context_switch, *it* will explode accessing the freed mm. Easiest way to fix this is by guaranteeing the barrier semantics in the exit path. Patch below. I guess we'll have to wait another 2500 hours to see if it works :) Will --->8 diff --git a/kernel/exit.c b/kernel/exit.c index 995453d9fb55..f91e8d56b03f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -534,8 +534,9 @@ static void exit_mm(void) } mmgrab(mm); BUG_ON(mm != current->active_mm); - /* more a memory barrier than a real lock */ task_lock(current); + /* Ensure we've grabbed the mm before setting current->mm to NULL */ + smp_mb__after_spin_lock(); current->mm = NULL; up_read(&mm->mmap_sem); enter_lazy_tlb(mm, current); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-14 15:07 ` Will Deacon @ 2018-02-14 16:51 ` Mark Rutland 2018-02-14 18:53 ` Mathieu Desnoyers 0 siblings, 1 reply; 17+ messages in thread From: Mark Rutland @ 2018-02-14 16:51 UTC (permalink / raw) To: linux-arm-kernel On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: > Hi Mark, Hi Will, > Cheers for the report. These things tend to be a pain to debug, but I've had > a go. Thanks for taking a look! > On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote: > The interesting thing here is on the exit path: > > > Freed by task 10882: > > save_stack mm/kasan/kasan.c:447 [inline] > > set_track mm/kasan/kasan.c:459 [inline] > > __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520 > > kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527 > > slab_free_hook mm/slub.c:1393 [inline] > > slab_free_freelist_hook mm/slub.c:1414 [inline] > > slab_free mm/slub.c:2968 [inline] > > kmem_cache_free+0x88/0x270 mm/slub.c:2990 > > __mmdrop+0x164/0x248 kernel/fork.c:604 > > ^^ This should never run, because there's an mmgrab() about 8 lines above > the mmput() in exit_mm. > > > mmdrop+0x50/0x60 kernel/fork.c:615 > > __mmput kernel/fork.c:981 [inline] > > mmput+0x270/0x338 kernel/fork.c:992 > > exit_mm kernel/exit.c:544 [inline] > > Looking at exit_mm: > > mmgrab(mm); > BUG_ON(mm != current->active_mm); > /* more a memory barrier than a real lock */ > task_lock(current); > current->mm = NULL; > up_read(&mm->mmap_sem); > enter_lazy_tlb(mm, current); > task_unlock(current); > mm_update_next_owner(mm); > mmput(mm); > > Then the comment already rings some alarm bells: our spin_lock (as used > by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered > due to being an atomic_inc) can be reordered with respect to the assignment > of NULL to current->mm. > > If the exit()ing task had recently migrated from another CPU, then that > CPU could concurrently run context_switch() and take this path: > > if (!prev->mm) { > prev->active_mm = NULL; > rq->prev_mm = oldmm; > } IIUC, on the prior context_switch, next->mm == NULL, so we set next->active_mm to prev->mm. Then, in this context_switch we set oldmm = prev->active_mm (where prev is next from the prior context switch). ... right? > which then means finish_task_switch will call mmdrop(): > > struct mm_struct *mm = rq->prev_mm; > [...] > if (mm) { > membarrier_mm_sync_core_before_usermode(mm); > mmdrop(mm); > } ... then here we use what was prev->active_mm in the most recent context switch. So AFAICT, we're never concurrently accessing a task_struct::mm field here, only prev::{mm,active_mm} while prev is current... [...] > diff --git a/kernel/exit.c b/kernel/exit.c > index 995453d9fb55..f91e8d56b03f 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -534,8 +534,9 @@ static void exit_mm(void) > } > mmgrab(mm); > BUG_ON(mm != current->active_mm); > - /* more a memory barrier than a real lock */ > task_lock(current); > + /* Ensure we've grabbed the mm before setting current->mm to NULL */ > + smp_mb__after_spin_lock(); > current->mm = NULL; ... and thus I don't follow why we would need to order these with anything more than a compiler barrier (if we're preemptible here). What have I completely misunderstood? ;) Thanks, Mark. ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-14 16:51 ` Mark Rutland @ 2018-02-14 18:53 ` Mathieu Desnoyers 2018-02-15 11:49 ` Peter Zijlstra 2018-02-15 14:22 ` Will Deacon 0 siblings, 2 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2018-02-14 18:53 UTC (permalink / raw) To: linux-arm-kernel ----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland at arm.com wrote: > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: >> Hi Mark, > > Hi Will, > >> Cheers for the report. These things tend to be a pain to debug, but I've had >> a go. > > Thanks for taking a look! > >> On Wed, Feb 14, 2018 at 12:02:54PM +0000, Mark Rutland wrote: >> The interesting thing here is on the exit path: >> >> > Freed by task 10882: >> > save_stack mm/kasan/kasan.c:447 [inline] >> > set_track mm/kasan/kasan.c:459 [inline] >> > __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:520 >> > kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:527 >> > slab_free_hook mm/slub.c:1393 [inline] >> > slab_free_freelist_hook mm/slub.c:1414 [inline] >> > slab_free mm/slub.c:2968 [inline] >> > kmem_cache_free+0x88/0x270 mm/slub.c:2990 >> > __mmdrop+0x164/0x248 kernel/fork.c:604 >> >> ^^ This should never run, because there's an mmgrab() about 8 lines above >> the mmput() in exit_mm. >> >> > mmdrop+0x50/0x60 kernel/fork.c:615 >> > __mmput kernel/fork.c:981 [inline] >> > mmput+0x270/0x338 kernel/fork.c:992 >> > exit_mm kernel/exit.c:544 [inline] >> >> Looking at exit_mm: >> >> mmgrab(mm); >> BUG_ON(mm != current->active_mm); >> /* more a memory barrier than a real lock */ >> task_lock(current); >> current->mm = NULL; >> up_read(&mm->mmap_sem); >> enter_lazy_tlb(mm, current); >> task_unlock(current); >> mm_update_next_owner(mm); >> mmput(mm); >> >> Then the comment already rings some alarm bells: our spin_lock (as used >> by task_lock) has ACQUIRE semantics, so the mmgrab (which is unordered >> due to being an atomic_inc) can be reordered with respect to the assignment >> of NULL to current->mm. >> >> If the exit()ing task had recently migrated from another CPU, then that >> CPU could concurrently run context_switch() and take this path: >> >> if (!prev->mm) { >> prev->active_mm = NULL; >> rq->prev_mm = oldmm; >> } > > IIUC, on the prior context_switch, next->mm == NULL, so we set > next->active_mm to prev->mm. > > Then, in this context_switch we set oldmm = prev->active_mm (where prev > is next from the prior context switch). > > ... right? > >> which then means finish_task_switch will call mmdrop(): >> >> struct mm_struct *mm = rq->prev_mm; >> [...] >> if (mm) { >> membarrier_mm_sync_core_before_usermode(mm); >> mmdrop(mm); >> } > > ... then here we use what was prev->active_mm in the most recent context > switch. > > So AFAICT, we're never concurrently accessing a task_struct::mm field > here, only prev::{mm,active_mm} while prev is current... > > [...] > >> diff --git a/kernel/exit.c b/kernel/exit.c >> index 995453d9fb55..f91e8d56b03f 100644 >> --- a/kernel/exit.c >> +++ b/kernel/exit.c >> @@ -534,8 +534,9 @@ static void exit_mm(void) >> } >> mmgrab(mm); >> BUG_ON(mm != current->active_mm); >> - /* more a memory barrier than a real lock */ >> task_lock(current); >> + /* Ensure we've grabbed the mm before setting current->mm to NULL */ >> + smp_mb__after_spin_lock(); >> current->mm = NULL; > > ... and thus I don't follow why we would need to order these with > anything more than a compiler barrier (if we're preemptible here). > > What have I completely misunderstood? ;) The compiler barrier would not change anything, because task_lock() already implies a compiler barrier (provided by the arch spin lock inline asm memory clobber). So compiler-wise, it cannot move the mmgrab(mm) after the store "current->mm = NULL". However, given the scenario involves multiples CPUs (one doing exit_mm(), the other doing context switch), the actual order of perceived load/store can be shuffled. And AFAIU nothing prevents the CPU from ordering the atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. I wonder if we should not simply add a smp_mb__after_atomic() into mmgrab() instead ? I see that e.g. futex.c does: static inline void futex_get_mm(union futex_key *key) { mmgrab(key->private.mm); /* * Ensure futex_get_mm() implies a full barrier such that * get_futex_key() implies a full barrier. This is relied upon * as smp_mb(); (B), see the ordering comment above. */ smp_mb__after_atomic(); } It could prevent nasty subtle bugs in other mmgrab() users. Thoughts ? Thanks, Mathieu > > Thanks, > Mark. -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-14 18:53 ` Mathieu Desnoyers @ 2018-02-15 11:49 ` Peter Zijlstra 2018-02-15 13:13 ` Mathieu Desnoyers 2018-02-15 14:22 ` Will Deacon 1 sibling, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2018-02-15 11:49 UTC (permalink / raw) To: linux-arm-kernel On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote: > However, given the scenario involves multiples CPUs (one doing exit_mm(), > the other doing context switch), the actual order of perceived load/store > can be shuffled. And AFAIU nothing prevents the CPU from ordering the > atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. > > I wonder if we should not simply add a smp_mb__after_atomic() into > mmgrab() instead ? I see that e.g. futex.c does: Don't think so, the futex case is really rather special and I suspect this one is too. I would much rather have explicit comments rather than implicit works by magic. As per the rationale used for refcount_t, increments should be unordered, because you ACQUIRE your object _before_ you can do the increment. The futex thing is simply abusing a bunch of implied barriers and patching up the holes in paths that didn't already imply a barrier in order to avoid having to add explicit barriers (which had measurable performance issues). And here we have explicit ordering outside of the reference counting too, we want to ensure the reference is incremented before we modify a second object. This ordering is not at all related to acquiring the reference, so bunding it seems odd. ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 11:49 ` Peter Zijlstra @ 2018-02-15 13:13 ` Mathieu Desnoyers 0 siblings, 0 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2018-02-15 13:13 UTC (permalink / raw) To: linux-arm-kernel ----- On Feb 15, 2018, at 6:49 AM, Peter Zijlstra peterz at infradead.org wrote: > On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote: >> However, given the scenario involves multiples CPUs (one doing exit_mm(), >> the other doing context switch), the actual order of perceived load/store >> can be shuffled. And AFAIU nothing prevents the CPU from ordering the >> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. >> >> I wonder if we should not simply add a smp_mb__after_atomic() into >> mmgrab() instead ? I see that e.g. futex.c does: > > Don't think so, the futex case is really rather special and I suspect > this one is too. I would much rather have explicit comments rather than > implicit works by magic. > > As per the rationale used for refcount_t, increments should be > unordered, because you ACQUIRE your object _before_ you can do the > increment. > > The futex thing is simply abusing a bunch of implied barriers and > patching up the holes in paths that didn't already imply a barrier in > order to avoid having to add explicit barriers (which had measurable > performance issues). > > And here we have explicit ordering outside of the reference counting > too, we want to ensure the reference is incremented before we modify > a second object. > > This ordering is not at all related to acquiring the reference, so > bunding it seems odd. I understand your point. Will's added barrier and comment is fine. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-14 18:53 ` Mathieu Desnoyers 2018-02-15 11:49 ` Peter Zijlstra @ 2018-02-15 14:22 ` Will Deacon 2018-02-15 15:33 ` Will Deacon ` (2 more replies) 1 sibling, 3 replies; 17+ messages in thread From: Will Deacon @ 2018-02-15 14:22 UTC (permalink / raw) To: linux-arm-kernel On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote: > ----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland at arm.com wrote: > > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote: > >> If the exit()ing task had recently migrated from another CPU, then that > >> CPU could concurrently run context_switch() and take this path: > >> > >> if (!prev->mm) { > >> prev->active_mm = NULL; > >> rq->prev_mm = oldmm; > >> } > > > > IIUC, on the prior context_switch, next->mm == NULL, so we set > > next->active_mm to prev->mm. > > > > Then, in this context_switch we set oldmm = prev->active_mm (where prev > > is next from the prior context switch). > > > > ... right? > > > >> which then means finish_task_switch will call mmdrop(): > >> > >> struct mm_struct *mm = rq->prev_mm; > >> [...] > >> if (mm) { > >> membarrier_mm_sync_core_before_usermode(mm); > >> mmdrop(mm); > >> } > > > > ... then here we use what was prev->active_mm in the most recent context > > switch. > > > > So AFAICT, we're never concurrently accessing a task_struct::mm field > > here, only prev::{mm,active_mm} while prev is current... > > > > [...] > > > >> diff --git a/kernel/exit.c b/kernel/exit.c > >> index 995453d9fb55..f91e8d56b03f 100644 > >> --- a/kernel/exit.c > >> +++ b/kernel/exit.c > >> @@ -534,8 +534,9 @@ static void exit_mm(void) > >> } > >> mmgrab(mm); > >> BUG_ON(mm != current->active_mm); > >> - /* more a memory barrier than a real lock */ > >> task_lock(current); > >> + /* Ensure we've grabbed the mm before setting current->mm to NULL */ > >> + smp_mb__after_spin_lock(); > >> current->mm = NULL; > > > > ... and thus I don't follow why we would need to order these with > > anything more than a compiler barrier (if we're preemptible here). > > > > What have I completely misunderstood? ;) > > The compiler barrier would not change anything, because task_lock() > already implies a compiler barrier (provided by the arch spin lock > inline asm memory clobber). So compiler-wise, it cannot move the > mmgrab(mm) after the store "current->mm = NULL". > > However, given the scenario involves multiples CPUs (one doing exit_mm(), > the other doing context switch), the actual order of perceived load/store > can be shuffled. And AFAIU nothing prevents the CPU from ordering the > atomic_inc() done by mmgrab(mm) _after_ the store to current->mm. Mark and I have spent most of the morning looking at this and realised I made a mistake in my original guesswork: prev can't migrate until half way down finish_task_switch when on_cpu = 0, so the access of prev->mm in context_switch can't race with exit_mm() for that task. Furthermore, although the mmgrab() could in theory be reordered with current->mm = NULL (and the ARMv8 architecture allows this too), it's pretty unlikely with LL/SC atomics and the backwards branch, where the CPU would have to pull off quite a few tricks for this to happen. Instead, we've come up with a more plausible sequence that can in theory happen on a single CPU: <task foo calls exit()> do_exit exit_mm mmgrab(mm); // foo's mm has count +1 BUG_ON(mm != current->active_mm); task_lock(current); current->mm = NULL; task_unlock(current); <irq and ctxsw to kthread> context_switch(prev=foo, next=kthread) mm = next->mm; oldmm = prev->active_mm; if (!mm) { // True for kthread next->active_mm = oldmm; mmgrab(oldmm); // foo's mm has count +2 } if (!prev->mm) { // True for foo rq->prev_mm = oldmm; } finish_task_switch mm = rq->prev_mm; if (mm) { // True (foo's mm) mmdrop(mm); // foo's mm has count +1 } [...] <ctxsw to task bar> context_switch(prev=kthread, next=bar) mm = next->mm; oldmm = prev->active_mm; // foo's mm! if (!prev->mm) { // True for kthread rq->prev_mm = oldmm; } finish_task_switch mm = rq->prev_mm; if (mm) { // True (foo's mm) mmdrop(mm); // foo's mm has count +0 } [...] <ctxsw back to task foo> context_switch(prev=bar, next=foo) mm = next->mm; oldmm = prev->active_mm; if (!mm) { // True for foo next->active_mm = oldmm; // This is bar's mm mmgrab(oldmm); // bar's mm has count +1 } [return back to exit_mm] mmdrop(mm); // foo's mm has count -1 At this point, we've got an imbalanced count on the mm and could free it prematurely as seen in the KASAN log. A subsequent context-switch away from foo would therefore result in a use-after-free. Assuming others agree with this diagnosis, I'm not sure how to fix it. It's basically not safe to set current->mm = NULL with preemption enabled. Will ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 14:22 ` Will Deacon @ 2018-02-15 15:33 ` Will Deacon 2018-02-15 16:47 ` Peter Zijlstra 2018-02-19 11:26 ` Catalin Marinas 2 siblings, 0 replies; 17+ messages in thread From: Will Deacon @ 2018-02-15 15:33 UTC (permalink / raw) To: linux-arm-kernel On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Assuming others agree with this diagnosis, I'm not sure how to fix it. > It's basically not safe to set current->mm = NULL with preemption enabled. One thing we could try would be to leave current->mm alone and just do the mmdrop in finish_task_switch at the point where we put the task_struct for DEAD tasks. mm_update_next_owner might need updating so that it doesn't re-assign current as the owner and run into a BUG_ON. Will ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 14:22 ` Will Deacon 2018-02-15 15:33 ` Will Deacon @ 2018-02-15 16:47 ` Peter Zijlstra 2018-02-15 18:21 ` Will Deacon 2018-02-19 11:26 ` Catalin Marinas 2 siblings, 1 reply; 17+ messages in thread From: Peter Zijlstra @ 2018-02-15 16:47 UTC (permalink / raw) To: linux-arm-kernel On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Instead, we've come up with a more plausible sequence that can in theory > happen on a single CPU: > > <task foo calls exit()> > > do_exit > exit_mm If this is the last task of the process, we would expect: mm_count == 1 mm_users == 1 at this point. > mmgrab(mm); // foo's mm has count +1 > BUG_ON(mm != current->active_mm); > task_lock(current); > current->mm = NULL; > task_unlock(current); So the whole active_mm is basically the last 'real' mm, and its purpose is to avoid switch_mm() between user tasks and kernel tasks. A kernel task has !->mm. We do this by incrementing mm_count when switching from user to kernel task and decrementing when switching from kernel to user. What exit_mm() does is change a user task into a 'kernel' task. So it should increment mm_count to mirror the context switch. I suspect this is what the mmgrab() in exit_mm() is for. > <irq and ctxsw to kthread> > > context_switch(prev=foo, next=kthread) > mm = next->mm; > oldmm = prev->active_mm; > > if (!mm) { // True for kthread > next->active_mm = oldmm; > mmgrab(oldmm); // foo's mm has count +2 > } > > if (!prev->mm) { // True for foo > rq->prev_mm = oldmm; > } > > finish_task_switch > mm = rq->prev_mm; > if (mm) { // True (foo's mm) > mmdrop(mm); // foo's mm has count +1 > } > > [...] > > <ctxsw to task bar> > > context_switch(prev=kthread, next=bar) > mm = next->mm; > oldmm = prev->active_mm; // foo's mm! > > if (!prev->mm) { // True for kthread > rq->prev_mm = oldmm; > } > > finish_task_switch > mm = rq->prev_mm; > if (mm) { // True (foo's mm) > mmdrop(mm); // foo's mm has count +0 The context switch into the next user task will then decrement. At this point foo no longer has a reference to its mm, except on the stack. > } > > [...] > > <ctxsw back to task foo> > > context_switch(prev=bar, next=foo) > mm = next->mm; > oldmm = prev->active_mm; > > if (!mm) { // True for foo > next->active_mm = oldmm; // This is bar's mm > mmgrab(oldmm); // bar's mm has count +1 > } > > > [return back to exit_mm] Enter mm_users, this counts the number of tasks associated with the mm. We start with 1 in mm_init(), and when it drops to 0, we decrement mm_count. Since we also start with mm_count == 1, this would appear consistent. mmput() // --mm_users == 0, which then results in: > mmdrop(mm); // foo's mm has count -1 In the above case, that's the very last reference to the mm, and since we started out with mm_count == 1, this -1 makes 0 and we do the actual free. > At this point, we've got an imbalanced count on the mm and could free it > prematurely as seen in the KASAN log. I'm not sure I see premature. At this point mm_users==0, mm_count==0 and we freed mm and there is no further use of the on-stack mm pointer and foo no longer has a pointer to it in either ->mm or ->active_mm. It's well and proper dead. > A subsequent context-switch away from foo would therefore result in a > use-after-free. At the above point, foo no longer has a reference to mm, we cleared ->mm early, and the context switch to bar cleared ->active_mm. The switch back into foo then results with foo->active_mm == bar->mm, which is fine. ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 16:47 ` Peter Zijlstra @ 2018-02-15 18:21 ` Will Deacon 2018-02-15 22:08 ` Mathieu Desnoyers 0 siblings, 1 reply; 17+ messages in thread From: Will Deacon @ 2018-02-15 18:21 UTC (permalink / raw) To: linux-arm-kernel On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote: > On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > > Instead, we've come up with a more plausible sequence that can in theory > > happen on a single CPU: > > > > <task foo calls exit()> > > > > do_exit > > exit_mm > > If this is the last task of the process, we would expect: > > mm_count == 1 > mm_users == 1 > > at this point. > > > mmgrab(mm); // foo's mm has count +1 > > BUG_ON(mm != current->active_mm); > > task_lock(current); > > current->mm = NULL; > > task_unlock(current); > > So the whole active_mm is basically the last 'real' mm, and its purpose > is to avoid switch_mm() between user tasks and kernel tasks. > > A kernel task has !->mm. We do this by incrementing mm_count when > switching from user to kernel task and decrementing when switching from > kernel to user. > > What exit_mm() does is change a user task into a 'kernel' task. So it > should increment mm_count to mirror the context switch. I suspect this > is what the mmgrab() in exit_mm() is for. > > > <irq and ctxsw to kthread> > > > > context_switch(prev=foo, next=kthread) > > mm = next->mm; > > oldmm = prev->active_mm; > > > > if (!mm) { // True for kthread > > next->active_mm = oldmm; > > mmgrab(oldmm); // foo's mm has count +2 > > } > > > > if (!prev->mm) { // True for foo > > rq->prev_mm = oldmm; > > } > > > > finish_task_switch > > mm = rq->prev_mm; > > if (mm) { // True (foo's mm) > > mmdrop(mm); // foo's mm has count +1 > > } > > > > [...] > > > > <ctxsw to task bar> > > > > context_switch(prev=kthread, next=bar) > > mm = next->mm; > > oldmm = prev->active_mm; // foo's mm! > > > > if (!prev->mm) { // True for kthread > > rq->prev_mm = oldmm; > > } > > > > finish_task_switch > > mm = rq->prev_mm; > > if (mm) { // True (foo's mm) > > mmdrop(mm); // foo's mm has count +0 > > The context switch into the next user task will then decrement. At this > point foo no longer has a reference to its mm, except on the stack. > > > } > > > > [...] > > > > <ctxsw back to task foo> > > > > context_switch(prev=bar, next=foo) > > mm = next->mm; > > oldmm = prev->active_mm; > > > > if (!mm) { // True for foo > > next->active_mm = oldmm; // This is bar's mm > > mmgrab(oldmm); // bar's mm has count +1 > > } > > > > > > [return back to exit_mm] > > Enter mm_users, this counts the number of tasks associated with the mm. > We start with 1 in mm_init(), and when it drops to 0, we decrement > mm_count. Since we also start with mm_count == 1, this would appear > consistent. > > mmput() // --mm_users == 0, which then results in: > > > mmdrop(mm); // foo's mm has count -1 > > In the above case, that's the very last reference to the mm, and since > we started out with mm_count == 1, this -1 makes 0 and we do the actual > free. > > > At this point, we've got an imbalanced count on the mm and could free it > > prematurely as seen in the KASAN log. > > I'm not sure I see premature. At this point mm_users==0, mm_count==0 and > we freed mm and there is no further use of the on-stack mm pointer and > foo no longer has a pointer to it in either ->mm or ->active_mm. It's > well and proper dead. > > > A subsequent context-switch away from foo would therefore result in a > > use-after-free. > > At the above point, foo no longer has a reference to mm, we cleared ->mm > early, and the context switch to bar cleared ->active_mm. The switch > back into foo then results with foo->active_mm == bar->mm, which is > fine. Bugger, you're right. When we switch off foo after freeing the mm, we'll actually access it's active mm which points to bar's mm. So whilst this can explain part of the kasan splat, it doesn't explain the actual use-after-free. More head-scratching required :( Will ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 18:21 ` Will Deacon @ 2018-02-15 22:08 ` Mathieu Desnoyers 2018-02-16 0:02 ` Mathieu Desnoyers 2018-02-16 16:53 ` Mark Rutland 0 siblings, 2 replies; 17+ messages in thread From: Mathieu Desnoyers @ 2018-02-15 22:08 UTC (permalink / raw) To: linux-arm-kernel ----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon at arm.com wrote: > On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote: >> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: >> > Instead, we've come up with a more plausible sequence that can in theory >> > happen on a single CPU: >> > >> > <task foo calls exit()> >> > >> > do_exit >> > exit_mm >> >> If this is the last task of the process, we would expect: >> >> mm_count == 1 >> mm_users == 1 >> >> at this point. >> >> > mmgrab(mm); // foo's mm has count +1 >> > BUG_ON(mm != current->active_mm); >> > task_lock(current); >> > current->mm = NULL; >> > task_unlock(current); >> >> So the whole active_mm is basically the last 'real' mm, and its purpose >> is to avoid switch_mm() between user tasks and kernel tasks. >> >> A kernel task has !->mm. We do this by incrementing mm_count when >> switching from user to kernel task and decrementing when switching from >> kernel to user. >> >> What exit_mm() does is change a user task into a 'kernel' task. So it >> should increment mm_count to mirror the context switch. I suspect this >> is what the mmgrab() in exit_mm() is for. >> >> > <irq and ctxsw to kthread> >> > >> > context_switch(prev=foo, next=kthread) >> > mm = next->mm; >> > oldmm = prev->active_mm; >> > >> > if (!mm) { // True for kthread >> > next->active_mm = oldmm; >> > mmgrab(oldmm); // foo's mm has count +2 >> > } >> > >> > if (!prev->mm) { // True for foo >> > rq->prev_mm = oldmm; >> > } >> > >> > finish_task_switch >> > mm = rq->prev_mm; >> > if (mm) { // True (foo's mm) >> > mmdrop(mm); // foo's mm has count +1 >> > } >> > >> > [...] >> > >> > <ctxsw to task bar> >> > >> > context_switch(prev=kthread, next=bar) >> > mm = next->mm; >> > oldmm = prev->active_mm; // foo's mm! >> > >> > if (!prev->mm) { // True for kthread >> > rq->prev_mm = oldmm; >> > } >> > >> > finish_task_switch >> > mm = rq->prev_mm; >> > if (mm) { // True (foo's mm) >> > mmdrop(mm); // foo's mm has count +0 >> >> The context switch into the next user task will then decrement. At this >> point foo no longer has a reference to its mm, except on the stack. >> >> > } >> > >> > [...] >> > >> > <ctxsw back to task foo> >> > >> > context_switch(prev=bar, next=foo) >> > mm = next->mm; >> > oldmm = prev->active_mm; >> > >> > if (!mm) { // True for foo >> > next->active_mm = oldmm; // This is bar's mm >> > mmgrab(oldmm); // bar's mm has count +1 >> > } >> > >> > >> > [return back to exit_mm] >> >> Enter mm_users, this counts the number of tasks associated with the mm. >> We start with 1 in mm_init(), and when it drops to 0, we decrement >> mm_count. Since we also start with mm_count == 1, this would appear >> consistent. >> >> mmput() // --mm_users == 0, which then results in: >> >> > mmdrop(mm); // foo's mm has count -1 >> >> In the above case, that's the very last reference to the mm, and since >> we started out with mm_count == 1, this -1 makes 0 and we do the actual >> free. >> >> > At this point, we've got an imbalanced count on the mm and could free it >> > prematurely as seen in the KASAN log. >> >> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and >> we freed mm and there is no further use of the on-stack mm pointer and >> foo no longer has a pointer to it in either ->mm or ->active_mm. It's >> well and proper dead. >> >> > A subsequent context-switch away from foo would therefore result in a >> > use-after-free. >> >> At the above point, foo no longer has a reference to mm, we cleared ->mm >> early, and the context switch to bar cleared ->active_mm. The switch >> back into foo then results with foo->active_mm == bar->mm, which is >> fine. > > Bugger, you're right. When we switch off foo after freeing the mm, we'll > actually access it's active mm which points to bar's mm. So whilst this > can explain part of the kasan splat, it doesn't explain the actual > use-after-free. > > More head-scratching required :( My current theory: do_exit() gets preempted after having set current->mm to NULL, and after having issued mmput(), which brings the mm_count down to 0. Unfortunately, if the scheduler switches from a userspace thread to a kernel thread, context_switch() loads prev->active_mm which still points to the now-freed mm, mmgrab the mm, and eventually does mmdrop in finish_task_switch(). If my understanding is correct, the following patch should help. The idea is to keep a reference on the mm_count until after we are sure the scheduler cannot schedule the task anymore. What I'm not sure is where exactly in do_exit() are we sure the task cannot ever be preempted anymore ? diff --git a/kernel/exit.c b/kernel/exit.c index 995453d..fefba68 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -764,6 +764,7 @@ void __noreturn do_exit(long code) { struct task_struct *tsk = current; int group_dead; + struct mm_struct *mm; profile_task_exit(tsk); kcov_task_exit(tsk); @@ -849,6 +850,10 @@ void __noreturn do_exit(long code) tsk->exit_code = code; taskstats_exit(tsk, group_dead); + mm = current->mm; + if (mm) + mmgrab(mm); + exit_mm(); if (group_dead) @@ -920,6 +925,9 @@ void __noreturn do_exit(long code) lockdep_free_task(tsk); do_task_dead(); + + if (mm) + mmdrop(mm); } EXPORT_SYMBOL_GPL(do_exit); -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply related [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 22:08 ` Mathieu Desnoyers @ 2018-02-16 0:02 ` Mathieu Desnoyers 2018-02-16 8:11 ` Peter Zijlstra 2018-02-16 16:53 ` Mark Rutland 1 sibling, 1 reply; 17+ messages in thread From: Mathieu Desnoyers @ 2018-02-16 0:02 UTC (permalink / raw) To: linux-arm-kernel ----- On Feb 15, 2018, at 5:08 PM, Mathieu Desnoyers mathieu.desnoyers at efficios.com wrote: > ----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon at arm.com wrote: > >> On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote: >>> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: >>> > Instead, we've come up with a more plausible sequence that can in theory >>> > happen on a single CPU: >>> > >>> > <task foo calls exit()> >>> > >>> > do_exit >>> > exit_mm >>> >>> If this is the last task of the process, we would expect: >>> >>> mm_count == 1 >>> mm_users == 1 >>> >>> at this point. >>> >>> > mmgrab(mm); // foo's mm has count +1 >>> > BUG_ON(mm != current->active_mm); >>> > task_lock(current); >>> > current->mm = NULL; >>> > task_unlock(current); >>> >>> So the whole active_mm is basically the last 'real' mm, and its purpose >>> is to avoid switch_mm() between user tasks and kernel tasks. >>> >>> A kernel task has !->mm. We do this by incrementing mm_count when >>> switching from user to kernel task and decrementing when switching from >>> kernel to user. >>> >>> What exit_mm() does is change a user task into a 'kernel' task. So it >>> should increment mm_count to mirror the context switch. I suspect this >>> is what the mmgrab() in exit_mm() is for. >>> >>> > <irq and ctxsw to kthread> >>> > >>> > context_switch(prev=foo, next=kthread) >>> > mm = next->mm; >>> > oldmm = prev->active_mm; >>> > >>> > if (!mm) { // True for kthread >>> > next->active_mm = oldmm; >>> > mmgrab(oldmm); // foo's mm has count +2 >>> > } >>> > >>> > if (!prev->mm) { // True for foo >>> > rq->prev_mm = oldmm; >>> > } >>> > >>> > finish_task_switch >>> > mm = rq->prev_mm; >>> > if (mm) { // True (foo's mm) >>> > mmdrop(mm); // foo's mm has count +1 >>> > } >>> > >>> > [...] >>> > >>> > <ctxsw to task bar> >>> > >>> > context_switch(prev=kthread, next=bar) >>> > mm = next->mm; >>> > oldmm = prev->active_mm; // foo's mm! >>> > >>> > if (!prev->mm) { // True for kthread >>> > rq->prev_mm = oldmm; >>> > } >>> > >>> > finish_task_switch >>> > mm = rq->prev_mm; >>> > if (mm) { // True (foo's mm) >>> > mmdrop(mm); // foo's mm has count +0 >>> >>> The context switch into the next user task will then decrement. At this >>> point foo no longer has a reference to its mm, except on the stack. >>> >>> > } >>> > >>> > [...] >>> > >>> > <ctxsw back to task foo> >>> > >>> > context_switch(prev=bar, next=foo) >>> > mm = next->mm; >>> > oldmm = prev->active_mm; >>> > >>> > if (!mm) { // True for foo >>> > next->active_mm = oldmm; // This is bar's mm >>> > mmgrab(oldmm); // bar's mm has count +1 >>> > } >>> > >>> > >>> > [return back to exit_mm] >>> >>> Enter mm_users, this counts the number of tasks associated with the mm. >>> We start with 1 in mm_init(), and when it drops to 0, we decrement >>> mm_count. Since we also start with mm_count == 1, this would appear >>> consistent. >>> >>> mmput() // --mm_users == 0, which then results in: >>> >>> > mmdrop(mm); // foo's mm has count -1 >>> >>> In the above case, that's the very last reference to the mm, and since >>> we started out with mm_count == 1, this -1 makes 0 and we do the actual >>> free. >>> >>> > At this point, we've got an imbalanced count on the mm and could free it >>> > prematurely as seen in the KASAN log. >>> >>> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and >>> we freed mm and there is no further use of the on-stack mm pointer and >>> foo no longer has a pointer to it in either ->mm or ->active_mm. It's >>> well and proper dead. >>> >>> > A subsequent context-switch away from foo would therefore result in a >>> > use-after-free. >>> >>> At the above point, foo no longer has a reference to mm, we cleared ->mm >>> early, and the context switch to bar cleared ->active_mm. The switch >>> back into foo then results with foo->active_mm == bar->mm, which is >>> fine. >> >> Bugger, you're right. When we switch off foo after freeing the mm, we'll >> actually access it's active mm which points to bar's mm. So whilst this >> can explain part of the kasan splat, it doesn't explain the actual >> use-after-free. >> >> More head-scratching required :( > > My current theory: do_exit() gets preempted after having set current->mm > to NULL, and after having issued mmput(), which brings the mm_count down > to 0. Unfortunately, if the scheduler switches from a userspace thread > to a kernel thread, context_switch() loads prev->active_mm which still > points to the now-freed mm, mmgrab the mm, and eventually does mmdrop > in finish_task_switch(). > > If my understanding is correct, the following patch should help. The idea > is to keep a reference on the mm_count until after we are sure the scheduler > cannot schedule the task anymore. What I'm not sure is where exactly in > do_exit() are we sure the task cannot ever be preempted anymore ? > Actually, it's the preempt_disable() at the end of do_exit() I was looking for. The following patch moves the mmdrop() right after preempte_disable. In my previous patch, the mmdrop() after do_task_dead (which is noreturn) was rather dumb (leak). diff --git a/kernel/exit.c b/kernel/exit.c index 995453d..2804655 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -764,6 +764,7 @@ void __noreturn do_exit(long code) { struct task_struct *tsk = current; int group_dead; + struct mm_struct *mm; profile_task_exit(tsk); kcov_task_exit(tsk); @@ -849,6 +850,10 @@ void __noreturn do_exit(long code) tsk->exit_code = code; taskstats_exit(tsk, group_dead); + mm = current->mm; + if (mm) + mmgrab(mm); + exit_mm(); if (group_dead) @@ -913,6 +918,8 @@ void __noreturn do_exit(long code) check_stack_usage(); preempt_disable(); + if (mm) + mmdrop(mm); if (tsk->nr_dirtied) __this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied); exit_rcu(); -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply related [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-16 0:02 ` Mathieu Desnoyers @ 2018-02-16 8:11 ` Peter Zijlstra 0 siblings, 0 replies; 17+ messages in thread From: Peter Zijlstra @ 2018-02-16 8:11 UTC (permalink / raw) To: linux-arm-kernel Please trim.. On Fri, Feb 16, 2018 at 12:02:42AM +0000, Mathieu Desnoyers wrote: > > My current theory: do_exit() gets preempted after having set current->mm > > to NULL, and after having issued mmput(), which brings the mm_count down > > to 0. No it doesn't.. remember, the last thread of a process enters with: mm_count == 1 && mm_users == 1 exit_mm() then does mmgrab(): mm_count == 2 && mm_users == 1 we then do mmput(), which results in: mm_count == 1 && mm_users == 0 That last mm_count is for ->active_mm, and will be dropped once we schedule to a next user task. ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 22:08 ` Mathieu Desnoyers 2018-02-16 0:02 ` Mathieu Desnoyers @ 2018-02-16 16:53 ` Mark Rutland 2018-02-16 17:17 ` Mathieu Desnoyers 1 sibling, 1 reply; 17+ messages in thread From: Mark Rutland @ 2018-02-16 16:53 UTC (permalink / raw) To: linux-arm-kernel Hi, On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote: > My current theory: do_exit() gets preempted after having set current->mm > to NULL, and after having issued mmput(), which brings the mm_count down > to 0. > > Unfortunately, if the scheduler switches from a userspace thread > to a kernel thread, context_switch() loads prev->active_mm which still > points to the now-freed mm, mmgrab the mm, and eventually does mmdrop > in finish_task_switch(). For this to happen, we need to get to the mmput() in exit_mm() with: mm->mm_count == 1 mm->mm_users == 1 mm == active_mm ... but AFAICT, this cannot happen. If there's no context_switch between clearing current->mm and the mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the active_mm reference (in mm_count) that context_switch+finish_task_switch manage. If there is a context_switch between the two, then AFAICT, either: a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In context_switch we mmgrab() the active_mm to inherit it, and in finish_task_switch() we drop the oldmm, balancing the mmgrab() with an mmput(). e.g we go task -> kernel_task -> task b) At some point, another user task is scheduled, and we switch to its mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which means mm_count >= 1. Since we witched to a new mm, if we switch back to the first task, it cannot have its own mm as active_mm. e.g. we go task -> other_task -> task I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and finish_task_switch() aren't to blame. Thanks, Mark. ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-16 16:53 ` Mark Rutland @ 2018-02-16 17:17 ` Mathieu Desnoyers 2018-02-16 18:33 ` Mark Rutland 0 siblings, 1 reply; 17+ messages in thread From: Mathieu Desnoyers @ 2018-02-16 17:17 UTC (permalink / raw) To: linux-arm-kernel ----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland at arm.com wrote: > Hi, > > On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote: >> My current theory: do_exit() gets preempted after having set current->mm >> to NULL, and after having issued mmput(), which brings the mm_count down >> to 0. >> >> Unfortunately, if the scheduler switches from a userspace thread >> to a kernel thread, context_switch() loads prev->active_mm which still >> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop >> in finish_task_switch(). > > For this to happen, we need to get to the mmput() in exit_mm() with: > > mm->mm_count == 1 > mm->mm_users == 1 > mm == active_mm > > ... but AFAICT, this cannot happen. > > If there's no context_switch between clearing current->mm and the > mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the > active_mm reference (in mm_count) that context_switch+finish_task_switch > manage. > > If there is a context_switch between the two, then AFAICT, either: > > a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In > context_switch we mmgrab() the active_mm to inherit it, and in > finish_task_switch() we drop the oldmm, balancing the mmgrab() with > an mmput(). > > e.g we go task -> kernel_task -> task > > b) At some point, another user task is scheduled, and we switch to its > mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which > means mm_count >= 1. Since we witched to a new mm, if we switch back > to the first task, it cannot have its own mm as active_mm. > > e.g. we go task -> other_task -> task > > I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and > finish_task_switch() aren't to blame. Currently reviewing: fs/proc/base.c: __set_oom_adj() /* * Make sure we will check other processes sharing the mm if this is * not vfrok which wants its own oom_score_adj. * pin the mm so it doesn't go away and get reused after task_unlock */ if (!task->vfork_done) { struct task_struct *p = find_lock_task_mm(task); if (p) { if (atomic_read(&p->mm->mm_users) > 1) { mm = p->mm; mmgrab(mm); } task_unlock(p); } } Considering that mmput() done by exit_mm() is done outside of the task_lock critical section, I wonder how the mm_users load is synchronized ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-16 17:17 ` Mathieu Desnoyers @ 2018-02-16 18:33 ` Mark Rutland 0 siblings, 0 replies; 17+ messages in thread From: Mark Rutland @ 2018-02-16 18:33 UTC (permalink / raw) To: linux-arm-kernel On Fri, Feb 16, 2018 at 05:17:57PM +0000, Mathieu Desnoyers wrote: > ----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland at arm.com wrote: > > I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and > > finish_task_switch() aren't to blame. > > Currently reviewing: fs/proc/base.c: __set_oom_adj() > > /* > * Make sure we will check other processes sharing the mm if this is > * not vfrok which wants its own oom_score_adj. > * pin the mm so it doesn't go away and get reused after task_unlock > */ > if (!task->vfork_done) { > struct task_struct *p = find_lock_task_mm(task); > > if (p) { > if (atomic_read(&p->mm->mm_users) > 1) { > mm = p->mm; > mmgrab(mm); > } > task_unlock(p); > } > } > > Considering that mmput() done by exit_mm() is done outside of the > task_lock critical section, I wonder how the mm_users load is > synchronized ? That looks suspicious, but I don't think it can result in this particular problem. In find_lock_task_mm() we get the task lock, and check mm != NULL, which means that mm->mm_count >= 1 (thanks to the implicit reference context_switch()+finish_task_switch() manage). While we hold the task lock, task->mm can't change beneath our feet, and hence that reference can't be dropped by finish_task_switch(). Thus, immediately after the mmgrab(), we know mm->mm_count >= 2. That shouldn't drop below 1 until the subsequent mmdrop(), even after we drop the task lock, unless there's a misplaced mmdrop() elsewhere. Locally, mmgrab() and mmdrop() are balanced. However, if mm_users can be incremented behind our back, we might skip updating the oom adjustments for other users of the mm. Thanks, Mark. ^ permalink raw reply [flat|nested] 17+ messages in thread
* arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch 2018-02-15 14:22 ` Will Deacon 2018-02-15 15:33 ` Will Deacon 2018-02-15 16:47 ` Peter Zijlstra @ 2018-02-19 11:26 ` Catalin Marinas 2 siblings, 0 replies; 17+ messages in thread From: Catalin Marinas @ 2018-02-19 11:26 UTC (permalink / raw) To: linux-arm-kernel On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Instead, we've come up with a more plausible sequence that can in theory > happen on a single CPU: > > <task foo calls exit()> > > do_exit > exit_mm > mmgrab(mm); // foo's mm has count +1 > BUG_ON(mm != current->active_mm); > task_lock(current); > current->mm = NULL; > task_unlock(current); > > <irq and ctxsw to kthread> [...] > mmdrop(mm); // foo's mm has count -1 > > At this point, we've got an imbalanced count on the mm and could free it > prematurely as seen in the KASAN log. A subsequent context-switch away > from foo would therefore result in a use-after-free. Peter already dismissed an algorithm issue here but I thought I'd give model checking a go (it only deals with mm_users/mm_count; also added a heavily simplified exec_mmap() in the loop): https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/kernel-tla.git/tree/ctxsw.tla As expected, it didn't show any problems (though it does not take memory ordering into account). Now, there are lots of other mmget/mmput and mmgrab/mmdrop throughout the kernel and finding an imbalanced call needs more work. -- Catalin ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-02-19 11:26 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-02-14 12:02 arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch Mark Rutland 2018-02-14 15:07 ` Will Deacon 2018-02-14 16:51 ` Mark Rutland 2018-02-14 18:53 ` Mathieu Desnoyers 2018-02-15 11:49 ` Peter Zijlstra 2018-02-15 13:13 ` Mathieu Desnoyers 2018-02-15 14:22 ` Will Deacon 2018-02-15 15:33 ` Will Deacon 2018-02-15 16:47 ` Peter Zijlstra 2018-02-15 18:21 ` Will Deacon 2018-02-15 22:08 ` Mathieu Desnoyers 2018-02-16 0:02 ` Mathieu Desnoyers 2018-02-16 8:11 ` Peter Zijlstra 2018-02-16 16:53 ` Mark Rutland 2018-02-16 17:17 ` Mathieu Desnoyers 2018-02-16 18:33 ` Mark Rutland 2018-02-19 11:26 ` Catalin Marinas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).