From: mathieu.desnoyers@efficios.com (Mathieu Desnoyers)
To: linux-arm-kernel@lists.infradead.org
Subject: arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
Date: Thu, 15 Feb 2018 22:08:56 +0000 (UTC) [thread overview]
Message-ID: <1144433342.22716.1518732536600.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20180215182154.GD15274@arm.com>
----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon at arm.com wrote:
> On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
>> > Instead, we've come up with a more plausible sequence that can in theory
>> > happen on a single CPU:
>> >
>> > <task foo calls exit()>
>> >
>> > do_exit
>> > exit_mm
>>
>> If this is the last task of the process, we would expect:
>>
>> mm_count == 1
>> mm_users == 1
>>
>> at this point.
>>
>> > mmgrab(mm); // foo's mm has count +1
>> > BUG_ON(mm != current->active_mm);
>> > task_lock(current);
>> > current->mm = NULL;
>> > task_unlock(current);
>>
>> So the whole active_mm is basically the last 'real' mm, and its purpose
>> is to avoid switch_mm() between user tasks and kernel tasks.
>>
>> A kernel task has !->mm. We do this by incrementing mm_count when
>> switching from user to kernel task and decrementing when switching from
>> kernel to user.
>>
>> What exit_mm() does is change a user task into a 'kernel' task. So it
>> should increment mm_count to mirror the context switch. I suspect this
>> is what the mmgrab() in exit_mm() is for.
>>
>> > <irq and ctxsw to kthread>
>> >
>> > context_switch(prev=foo, next=kthread)
>> > mm = next->mm;
>> > oldmm = prev->active_mm;
>> >
>> > if (!mm) { // True for kthread
>> > next->active_mm = oldmm;
>> > mmgrab(oldmm); // foo's mm has count +2
>> > }
>> >
>> > if (!prev->mm) { // True for foo
>> > rq->prev_mm = oldmm;
>> > }
>> >
>> > finish_task_switch
>> > mm = rq->prev_mm;
>> > if (mm) { // True (foo's mm)
>> > mmdrop(mm); // foo's mm has count +1
>> > }
>> >
>> > [...]
>> >
>> > <ctxsw to task bar>
>> >
>> > context_switch(prev=kthread, next=bar)
>> > mm = next->mm;
>> > oldmm = prev->active_mm; // foo's mm!
>> >
>> > if (!prev->mm) { // True for kthread
>> > rq->prev_mm = oldmm;
>> > }
>> >
>> > finish_task_switch
>> > mm = rq->prev_mm;
>> > if (mm) { // True (foo's mm)
>> > mmdrop(mm); // foo's mm has count +0
>>
>> The context switch into the next user task will then decrement. At this
>> point foo no longer has a reference to its mm, except on the stack.
>>
>> > }
>> >
>> > [...]
>> >
>> > <ctxsw back to task foo>
>> >
>> > context_switch(prev=bar, next=foo)
>> > mm = next->mm;
>> > oldmm = prev->active_mm;
>> >
>> > if (!mm) { // True for foo
>> > next->active_mm = oldmm; // This is bar's mm
>> > mmgrab(oldmm); // bar's mm has count +1
>> > }
>> >
>> >
>> > [return back to exit_mm]
>>
>> Enter mm_users, this counts the number of tasks associated with the mm.
>> We start with 1 in mm_init(), and when it drops to 0, we decrement
>> mm_count. Since we also start with mm_count == 1, this would appear
>> consistent.
>>
>> mmput() // --mm_users == 0, which then results in:
>>
>> > mmdrop(mm); // foo's mm has count -1
>>
>> In the above case, that's the very last reference to the mm, and since
>> we started out with mm_count == 1, this -1 makes 0 and we do the actual
>> free.
>>
>> > At this point, we've got an imbalanced count on the mm and could free it
>> > prematurely as seen in the KASAN log.
>>
>> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
>> we freed mm and there is no further use of the on-stack mm pointer and
>> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
>> well and proper dead.
>>
>> > A subsequent context-switch away from foo would therefore result in a
>> > use-after-free.
>>
>> At the above point, foo no longer has a reference to mm, we cleared ->mm
>> early, and the context switch to bar cleared ->active_mm. The switch
>> back into foo then results with foo->active_mm == bar->mm, which is
>> fine.
>
> Bugger, you're right. When we switch off foo after freeing the mm, we'll
> actually access it's active mm which points to bar's mm. So whilst this
> can explain part of the kasan splat, it doesn't explain the actual
> use-after-free.
>
> More head-scratching required :(
My current theory: do_exit() gets preempted after having set current->mm
to NULL, and after having issued mmput(), which brings the mm_count down
to 0. Unfortunately, if the scheduler switches from a userspace thread
to a kernel thread, context_switch() loads prev->active_mm which still
points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
in finish_task_switch().
If my understanding is correct, the following patch should help. The idea
is to keep a reference on the mm_count until after we are sure the scheduler
cannot schedule the task anymore. What I'm not sure is where exactly in
do_exit() are we sure the task cannot ever be preempted anymore ?
diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d..fefba68 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -764,6 +764,7 @@ void __noreturn do_exit(long code)
{
struct task_struct *tsk = current;
int group_dead;
+ struct mm_struct *mm;
profile_task_exit(tsk);
kcov_task_exit(tsk);
@@ -849,6 +850,10 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ mm = current->mm;
+ if (mm)
+ mmgrab(mm);
+
exit_mm();
if (group_dead)
@@ -920,6 +925,9 @@ void __noreturn do_exit(long code)
lockdep_free_task(tsk);
do_task_dead();
+
+ if (mm)
+ mmdrop(mm);
}
EXPORT_SYMBOL_GPL(do_exit);
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
WARNING: multiple messages have this Message-ID (diff)
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Mark Rutland <mark.rutland@arm.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch
Date: Thu, 15 Feb 2018 22:08:56 +0000 (UTC) [thread overview]
Message-ID: <1144433342.22716.1518732536600.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20180215182154.GD15274@arm.com>
----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon@arm.com wrote:
> On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
>> > Instead, we've come up with a more plausible sequence that can in theory
>> > happen on a single CPU:
>> >
>> > <task foo calls exit()>
>> >
>> > do_exit
>> > exit_mm
>>
>> If this is the last task of the process, we would expect:
>>
>> mm_count == 1
>> mm_users == 1
>>
>> at this point.
>>
>> > mmgrab(mm); // foo's mm has count +1
>> > BUG_ON(mm != current->active_mm);
>> > task_lock(current);
>> > current->mm = NULL;
>> > task_unlock(current);
>>
>> So the whole active_mm is basically the last 'real' mm, and its purpose
>> is to avoid switch_mm() between user tasks and kernel tasks.
>>
>> A kernel task has !->mm. We do this by incrementing mm_count when
>> switching from user to kernel task and decrementing when switching from
>> kernel to user.
>>
>> What exit_mm() does is change a user task into a 'kernel' task. So it
>> should increment mm_count to mirror the context switch. I suspect this
>> is what the mmgrab() in exit_mm() is for.
>>
>> > <irq and ctxsw to kthread>
>> >
>> > context_switch(prev=foo, next=kthread)
>> > mm = next->mm;
>> > oldmm = prev->active_mm;
>> >
>> > if (!mm) { // True for kthread
>> > next->active_mm = oldmm;
>> > mmgrab(oldmm); // foo's mm has count +2
>> > }
>> >
>> > if (!prev->mm) { // True for foo
>> > rq->prev_mm = oldmm;
>> > }
>> >
>> > finish_task_switch
>> > mm = rq->prev_mm;
>> > if (mm) { // True (foo's mm)
>> > mmdrop(mm); // foo's mm has count +1
>> > }
>> >
>> > [...]
>> >
>> > <ctxsw to task bar>
>> >
>> > context_switch(prev=kthread, next=bar)
>> > mm = next->mm;
>> > oldmm = prev->active_mm; // foo's mm!
>> >
>> > if (!prev->mm) { // True for kthread
>> > rq->prev_mm = oldmm;
>> > }
>> >
>> > finish_task_switch
>> > mm = rq->prev_mm;
>> > if (mm) { // True (foo's mm)
>> > mmdrop(mm); // foo's mm has count +0
>>
>> The context switch into the next user task will then decrement. At this
>> point foo no longer has a reference to its mm, except on the stack.
>>
>> > }
>> >
>> > [...]
>> >
>> > <ctxsw back to task foo>
>> >
>> > context_switch(prev=bar, next=foo)
>> > mm = next->mm;
>> > oldmm = prev->active_mm;
>> >
>> > if (!mm) { // True for foo
>> > next->active_mm = oldmm; // This is bar's mm
>> > mmgrab(oldmm); // bar's mm has count +1
>> > }
>> >
>> >
>> > [return back to exit_mm]
>>
>> Enter mm_users, this counts the number of tasks associated with the mm.
>> We start with 1 in mm_init(), and when it drops to 0, we decrement
>> mm_count. Since we also start with mm_count == 1, this would appear
>> consistent.
>>
>> mmput() // --mm_users == 0, which then results in:
>>
>> > mmdrop(mm); // foo's mm has count -1
>>
>> In the above case, that's the very last reference to the mm, and since
>> we started out with mm_count == 1, this -1 makes 0 and we do the actual
>> free.
>>
>> > At this point, we've got an imbalanced count on the mm and could free it
>> > prematurely as seen in the KASAN log.
>>
>> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
>> we freed mm and there is no further use of the on-stack mm pointer and
>> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
>> well and proper dead.
>>
>> > A subsequent context-switch away from foo would therefore result in a
>> > use-after-free.
>>
>> At the above point, foo no longer has a reference to mm, we cleared ->mm
>> early, and the context switch to bar cleared ->active_mm. The switch
>> back into foo then results with foo->active_mm == bar->mm, which is
>> fine.
>
> Bugger, you're right. When we switch off foo after freeing the mm, we'll
> actually access it's active mm which points to bar's mm. So whilst this
> can explain part of the kasan splat, it doesn't explain the actual
> use-after-free.
>
> More head-scratching required :(
My current theory: do_exit() gets preempted after having set current->mm
to NULL, and after having issued mmput(), which brings the mm_count down
to 0. Unfortunately, if the scheduler switches from a userspace thread
to a kernel thread, context_switch() loads prev->active_mm which still
points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
in finish_task_switch().
If my understanding is correct, the following patch should help. The idea
is to keep a reference on the mm_count until after we are sure the scheduler
cannot schedule the task anymore. What I'm not sure is where exactly in
do_exit() are we sure the task cannot ever be preempted anymore ?
diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d..fefba68 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -764,6 +764,7 @@ void __noreturn do_exit(long code)
{
struct task_struct *tsk = current;
int group_dead;
+ struct mm_struct *mm;
profile_task_exit(tsk);
kcov_task_exit(tsk);
@@ -849,6 +850,10 @@ void __noreturn do_exit(long code)
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);
+ mm = current->mm;
+ if (mm)
+ mmgrab(mm);
+
exit_mm();
if (group_dead)
@@ -920,6 +925,9 @@ void __noreturn do_exit(long code)
lockdep_free_task(tsk);
do_task_dead();
+
+ if (mm)
+ mmdrop(mm);
}
EXPORT_SYMBOL_GPL(do_exit);
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
next prev parent reply other threads:[~2018-02-15 22:08 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-14 12:02 arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch Mark Rutland
2018-02-14 12:02 ` Mark Rutland
2018-02-14 15:07 ` Will Deacon
2018-02-14 15:07 ` Will Deacon
2018-02-14 16:51 ` Mark Rutland
2018-02-14 16:51 ` Mark Rutland
2018-02-14 18:53 ` Mathieu Desnoyers
2018-02-14 18:53 ` Mathieu Desnoyers
2018-02-15 11:49 ` Peter Zijlstra
2018-02-15 11:49 ` Peter Zijlstra
2018-02-15 13:13 ` Mathieu Desnoyers
2018-02-15 13:13 ` Mathieu Desnoyers
2018-02-15 14:22 ` Will Deacon
2018-02-15 14:22 ` Will Deacon
2018-02-15 15:33 ` Will Deacon
2018-02-15 15:33 ` Will Deacon
2018-02-15 16:47 ` Peter Zijlstra
2018-02-15 16:47 ` Peter Zijlstra
2018-02-15 18:21 ` Will Deacon
2018-02-15 18:21 ` Will Deacon
2018-02-15 22:08 ` Mathieu Desnoyers [this message]
2018-02-15 22:08 ` Mathieu Desnoyers
2018-02-16 0:02 ` Mathieu Desnoyers
2018-02-16 0:02 ` Mathieu Desnoyers
2018-02-16 8:11 ` Peter Zijlstra
2018-02-16 8:11 ` Peter Zijlstra
2018-02-16 16:53 ` Mark Rutland
2018-02-16 16:53 ` Mark Rutland
2018-02-16 17:17 ` Mathieu Desnoyers
2018-02-16 17:17 ` Mathieu Desnoyers
2018-02-16 18:33 ` Mark Rutland
2018-02-16 18:33 ` Mark Rutland
2018-02-19 11:26 ` Catalin Marinas
2018-02-19 11:26 ` Catalin Marinas
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1144433342.22716.1518732536600.JavaMail.zimbra@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=linux-arm-kernel@lists.infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.